The Spatial Fourier Transform

December 5, 2009

Vision:
Imagine a surveillance system looking for keywords
(example: airports) and showing on a camera recording from where in the far crowd the words are coming from…

The idea:
is to add extra dimensions to the short-time fourier transform, leading to a time-frequency-space representation. Why? To assign spectral bins not only to time instances but also to points in space. Such multidimensional representation is a decomposition of the signal trough frequency (DFT), plus time (STFT) and  “space”, leading to the definition of the Short-Time Spatial Fourier Transform (ST-SpFT).

How?
Important thing is to note that the Spatial Fourier Transform (SFT) can be only applied on multi-channel audio with very exact synchronisation of the channels. One way of this multidimensional decomposition is through WEIGHTING THE FFT OF INDEPENDENT MICROPHONE CHANNELS by rotating their phase vectors according to the scanned DoA candidates.

Applications:
- (NOT BLIND!) source separation (we know the spectrum of the sound and its origin in space)
- Acoustic source localisation

The challenges:
- one is to implement the decomposition in a way that provides clear time-frequency space (simply: time-frequency-DoA (TFD)) atoms
- another one is the fast implementation of this TFD decomposition ()

Possible derivates of the idea:
- Spatial Wavelet Transform (replace FFT with DWT)
- Spatial Harmonic Chirp Transform (ev. spatial Short-Time Harmonic Chirp Transform)

Nice images and code coming soon.

RELATED PROJECTS:
1) MarPanning from Marsyas is providing visualisation of spectral bins through the “pan” space, derived from stereo musical recordings. MarPanning seems to use angle information of cross-spectrum bins, meaning that every frequency bin is assigned ONLY TO ONE SPATIAL (Left-Right)
Nice Vide here.

Environmental Noise – Based Automatic Volume Control

August 29, 2009

This intelligent automatic level control is another feature proposal for mobile and SW-based audio players, as well as communication devices and public places like airports, trams, trains, etc..

Problem examples: with recent music players people do need to set the volume much higher when traveling on a train, lowering the volume when the train stops at a station, raising it again when it starts to move, etc… People need to raise the ringer volume of their phones when  taking the phone to the street, lower the ringer volume when going to the library, etc.. Imagine a phone with a ringer volume set to 90% and, that rings at a very high level just in a middle of a meeting..

The idea: to track the level of environmental noise and automatically ..
-  raise or lower the sound volume in the mobile music player headphones based on the environmental noise..
-  raise or lower the speaker volume of mobile phones during phone calls, when the user enters a noisy area (road crossing, etc)
-  raise or lower the ringer volume…
- adjust loudspeaker volumes in trains, buses, airports individually for every loudspeaker, based on its environmental noise level
(how many times it did happen that you did not understand the announcement about a route change because of the noise in the train?)

Application fields:
- all CD, Mp3, etc mobile music players
- all mobile phones..
- all in-tram, in-train, airport, etc based loudspeaker installations

Possible features:
- simple automatic level control based on the environmental noise- parametric-equalizer-like control based on the features of the environmental noise (not necessarily all frequencies need to be boosted everytime..)
- learn-by listening feature: (you take the mp3 player with you, enter the train, activate the LBL feature, start a music to play, and you show the player how you would set the music level at different noise levels and train speeds. with this the system learns how much it needs to update the volume at different cases..)

Technology analogy:
- the automatic contract control of mobile phone displays, based on the environmental light conditions

Keywords:

- Automatic (level/volume/gain) control,
- Adaptive noise masking,
- parametric equalizer,
- environmental noise spectrum

Audible Pause

August 10, 2009

Audible Pause is a feature proposal for mobile and SW-based audio players. It is a feature that would keep playing the last “sound” after the Pause button is pressed. Just like the last video frame is displayed when Pause is pressed on the Video player. However, since playing a single audio sample in a loop does not lead to anything audible, this Audio Pause feature needs a little signal processing behind the scenes.

pause

The trick would be to:
1. look for frames of the signal with similar spectral envelope,
2. smooth the envelope a bit in time [1],
3. apply phase-synchronous overlap-add [2] in a loop in order to keep the the automatically chosen audio frames (point 1) play continuously.

Applications:
- Useful feature for note transcription by listening
- Useful for learning sounds of speech in foreign languages (ü, ö, ä, etc)

References:
[1] anything about RASTA filtration
[2] anything on phase vocoders

The Chirprate-Pitch Plane

December 7, 2008

Problem:
Crossing pitch trajectories, that make multipitch tracking a difficult task. Looking for extra features that could be used for the tracking algorithm, assigning pitch trajectory to speakers (acoustic sources). Since we address single-channel recordings, the PoPi plane (ie. linking position to pitch) is not the way to go.

Method Description:
Using the pitch rate as an additional cue (regardless HOW one can obtain that feature) for multipitch-tracking. Why? If the pitch trajectory of two speakers are crossing in a given frame, then definitely the pitch of both speakers is coming “from somewhere” and this “somewhere” is hopefully not the same for both, as the trajectories are JUST crossing now in the problematic frame. The information where the pitch is “coming from”, and where it is “going to” is nothing else but the chirp-rate. Of course the question is how to decompose the signal effectively into a representation showing pitch linked to its pitch change rate.

One of the possible solutions is to take the frame under analysis, pre-warp it with different chirp-rate candidates (just like in the Fast implementation of Chirp Transform), and extract all the pitch candidates for all given pre-warping factor. With this we get a Chirprate vs. Pitch Plane, that shows not only the actual pitch value of the speaker, but also from which “direction” the pitch is coming from, ie was the pitch value higher or lower in the previous frame, and how big this difference between the two frames is.

Below one possible Chirprate-Pitch decomposition: depicting one acoustic source in the scene. Axis x: the pitch-rate (chirpFactor), and axis y: the correlation lag (inverse of Pitch):

warpedpitchestimation1
As you see, there is only one correlation lag candidate, and one chirpFactor candidate for our speaker, which is the result of using the ACF-CEP based pitch estimation mentioned in the previous post.

References:
[1] “FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION”, P. Cancela, E. Lopez, M. Rocamora, Proc. of DAFx 2010, Graz, Austria, September 6-10, 2010. (Chirprate-Pitch Plane discussed in section 5.2)

STChT-based Noise Suppression

December 7, 2008

Description:
Pitch Estimation module (ACF+CEP) + Spectral estimation module (STChT) + Noise spectrum estimation (adaptive Quantile) + Noise removal (anything from Wiener filter to the most complicated TF-based methods)

onevoice_blockscheme

Building blocks:

  • Pitch estimation. An enhanced version of the reindexing method published in [1] was used as a basis for this module.
  • Short-Time Fan Chirp (STFCh)Transform. We use the fast version of the Fan-Chirp transform [2].
  • Noise estimation: this module is based on the well known quantile filtration idea [3], acc. to which the noisy background can be estimated by applying empirically defined percentage of the sorted time-frequency atoms.
  • Noise suppression: The “speech enhancement” is happening in this module. It applyes the estimated noise spectrum, and removes it from the representation provided by the STHChT module.


References:

[1] Képesi, M and Weruaga, L.: “Harmonic Tracking based Short-Time Chirp Analysis of Speech Signals”, Robust2004 COST278 & ISCA ITRW Workshop on Robustness Issues in Conversational Interaction, 30th and 31st August 2004, University of East Anglia, Norwich, UK
[2] L. Weruaga, M. Kepesi, “The fan-chirp transform for non-stationary harmonic sounds”, Signal Proc., vol. 87, pp. 1504-1522, 2007.
[3] Stahl, V.; Fischer, A.; Bippus, R: Quantile based noise estimation for spectral subtraction and Wiener filtering, Acoustics, Speech, and Signal Processing, 2000. Volume 3, Issue , 2000 Page(s):1875 – 1878 vol.3

Cep(ACF)

December 7, 2008

Problem:
STChT requires reliable pitch estimation in order to provide sharp TF representation. This is a challenging task, as pitch estimation in noisy and multi-speaker environments is never an easy task.

Method Description:
A comibined Pitch estimation method, that combines Autocorrelation (ACF) with Cepstrum (Cep) and some additional tricks: We know, that the Autocorrelation extracts the periodicity of the speech signal even in noisy background, but gives multiple pitch candidates because of double-pitch, half-pitch, etc.. This is taken care by the Cepstrum applied on top of the ACF, which merges all autocorrelation peak candidates into one cepstrum-based pitch candidate. And the trick is inbetween: Cepstrum is reliable only if the spectrum it uses is nice enough, ie. dominant, rich of harmonics, and as flat as possible. But how could be a noisy speech spectrum nice like that?

cep_acf

Well, a half-way rectified autocorrelation, leads almost to a spectrum like that: with boosted periodicities and enhanced spectral representation of hidden harmonicities..

References:
No publications yet.

Multiband PoPi Decomposition

December 7, 2008

Problem:
When applying the PoPi decomposition for concurrent speaker scenarios (coctail party effect) in order to track multiple speakers moving while speaking we see that the original formulation of the PoPi decomposition shows always only the more dominant speaker (ie. more dominant microperiodicies in given signal frame), and the othere speaker (let’s call him background speaker) is suppressed and hardly visible in the representation.

Solution:
Similar problem is addressed in Klapuri’s PhD work that targets automatic transcription of simultaneous musical tones. Klapuri’s (and the most logical) way to enhance multiple pitch candidates is to give them chance in multiple frequency bands to be dominant.

bandwisepopi

We do the same: the “multiband” version of the PoPi plane is based on subband processing. This provides good results already at using as few bands as 17. This has been proven on several double-talk and triple-talk scenarios recorded in 3 different rooms with different reverberation times. However, for non-speech like scenarios, and more speakers the 17 band might be a low number.

multispeaker_popi

The image above shows the PoPi decomposition of 2 concurrent speakers. Their position and the corresponding pitch values is easy to read out. The recording shows the voice of Tania and Lukas.

References:
[1] T. Habib, L. Ottowitz, and M. Kepési, “Experimental Evaluation of Multi-band Position-Pitch Estimation (M-PoPi) Algorithm for Multi-Speaker Localization,” INTERSPEECH 2008, Sept. 22-26, Brisbane, Australia.
[2] T. Habib, M. Kepési and L. Ottowitz, “Experimental Evaluation of the Joint Position-Pitch Estimation (PoPi) algorithm in Noisy Environments,” 5th IEEE Workshop on Sensor Array and Multi-Channel Signal Processing (SAM 2008), Jul. 21-23, Darmstadt, Germany.
[3] M. Kepési, L. Ottowitz and T. Habib, “Joint Position-Pitch Estimation for Multiple Speaker Scenarios,” IEEE Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA 2008), May 6-8, Trento, Italy.

Housing for Microphone Arrays for Size Minimization

December 7, 2008

Problem:
It would be a dream of every microphone array researcher to have a mic array with a size of a matchbox. Unfortunately this is not a case, and many mic arrays are of a size of hundreds of cm. This size restriction comes from the frequency of the signal we are trying to catch with the array: the lower the frequency the bigger the array must be.

Possible Solution:
if the array size depends on the wavelength, and we can NOT change the frequency of the signal under acquisition, what can we affect in order to use smaller arrays? The guess is right: the speed of sound.

The PoPi plane

December 7, 2008

The idea:
Although unvoiced sounds of speech dominanting at high frequencies (sounds “s” “z” “f”, etc) give very clear DoA estimation, the real information is still hidden in the not-so-clear and confusing micro-periodicities of the autocorrelation. These microperiodicities carry the information not only about the DoA of the source they are related to, but also its pitch information. This means, even in case of two moving, not always active acoustic sources, their ID could be described to their position by linking their pitch to their DoA.

Method Description:
Decomposing a frame of a 2-channel acoustic signal into a Position-Pitch plane shows clearly where the source is (at which DoA angle) and what the Pitch (from correlation lag) of that speaker is. The image below demonstrates a PoPi plane extracted by a 16-channel circular mic array signal for a voiced speech frame.

picture-13

References:
[1] M. Képesi, F. Pernkopf, M. Wohmayr, “Joint Position-Pitch Tracking for 2-Channel Audio,” CBMI 2007, Jun 25-27, Bordeaux, France
[2] M. Wohmayr, M. Képesi, “Joint Position-Pitch Extraction from Multichannel Audio,” Proc. Interspeech 2007, August 27-31, Antwerpen, Belgium
[3] M. Képesi, M. Wohmayr, T. Habib, “Pitch-Driven Position Estimation of Speakers in Multispeaker Environments,” The 3rd Congress of the Alps Adria Acoustics Association, September 27-28, 2007, Graz, Austria
More publications under the Multichannel PoPi topic.
Patent applications:
[P1] Képesi, M. – Wohlmayr, M. – Kubin, G.: “Joint Position-Pitch Estimation of Acoustic Sources for Their Tracking and Separation,” European patent, submitted: May, 2007.

Links:
[1] Related page at SPSC
[2] PoPi Demo Videos and Audio Files
[3] Elmar using the method for controlling the orientation of his robot.

Spectral Reindexing for Pitch Estimation

December 7, 2008

Problem:
Need for a powerful but straightforward pitch estimation method to drive the chirp transform. Maybe reordering the information represented by all the frequency bins of an FFT (or ChT) in order to get a precise pitch estimation.

Method Description:
The main idea is to scan through all possible pitch candidates (80-500 Hz) and assign the mean of all the energy values corresponding to Fo, 2Fo, … , kFo. In equation it looks like this:

and an example of a pitch trajectory derived from a low-resolution spectrogram is shown below:

What we see is that it is feasible to achieve frequency resolution far below the frequency resolution of the FFT (usually 15-20Hz/fr. bin).

References:
[1] M. Képesi, L. Weruaga, E. Schofield, “Detailed Multidimensional Analysis of our Acoustical Environment,” Forum Acusticum. Budapest (Hu), September 2005, pp. 2649-2654.

[2] M. Képesi and L. Weruaga, “High-resolution noise-robust spectral-based pitch estimation,” Interspeech 2005, pp. 313-316, Lisboa (P), Sep. 2005
[3] P. Cancela, “Tracking melody in polyphonic audio. mirex 2008,” in Proc. Music Inf. Retrieval Evaluation eXchange, 2008
[4] “FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION”, P. Cancela, E. Lopez, M. Rocamora, DAFx 2010.
(“F0-gram” ie. GlogS discussed in chapter 4)

Related Methods:
.Harmonic Product Spectrum
.Harmonic Sum Spectrum


Follow

Get every new post delivered to your Inbox.