Archive for the ‘Uncategorized’ Category

15. meet FoPi – the formant-pitch plane

January 31, 2018

Below: FoPi decomposition of vowels E and O, Fo =93Hz.
x-Axis: Formants
y-Axis: Pitch

fopi EE 92Hzfopi OO3



14. PoPi decomposition, the Fast way

January 13, 2018
  1. Load 2 channels of audio signals, recorded with a pair of microphones. The example below uses microphones with 60cm spacing.
  2. Cut frames, get cross-correlation of the frames.
  3. Calculate the look-up tables for phase (PhiLUT), pitch (PitLUT)…
    phi_lut and pit_lut
    … and “phase-pitch” 😉
  4. Multiply the cross-correlation vector from point 2 by the 3 PoPi-LUTs above.
  5. Sum the 3 sub-results (L+O+R).
  6. Do you believe in cascading different approaches?
    – If not, just go to point 9. But.
    – If yes, create one more final matrix by multiplying the sub-results (dot products from point 4) instead of summing.
  7. Sum the sums (from p.5), sum also the dot-products (from p.6).
  8. Take a deep breath. Below you find examples of summing (L+O+R) and multiplying the (L*O*R) sub-results. Finally, you see the results of cascading the two approaches.
  9. Download the code. Learn the theory.
  10. Have a nice day!

13. Spectral reindexing, the Fast way

January 3, 2018
  1. Get the spectrum of a signal frame. Try to get below 10Hz/bin spectral resolution, if the harmonics are stable enough over time. Example: in case of 22kHz sampling, cut 2000+ samples and apply a 2048 or 4096 point FFT. In case of 11kHz sampling, use a 1024-point FFT.
    reind signal and spectrum2
  2. Calculate the LUT to extract spectral bins at F0, 2F0, 3F0, 4F0, 5F0 candidate pitch values, as well as LUT vectors inbetween, ie. 1.5F0, 2.5F0, etc. The LUT depends on Fs, Nfft, Fo_min and Fo_max, see code.
    reind lut
  3. Extract spectral bins at F0, 2F0, 3F0, 4F0, 5F0, etc. candidates, and store them in separate vectors.
    reind disassembled spectrum
  4. Assemble the “re-indexed” spectrum by summing up the vectors.
  5. Save time and download the Code.
  6. Take a look at the theory behind Re-indexing.


12. Wave Surfing – Intro

October 28, 2017

Sources in space and time
In audio signal processing, we observe signals by sampling air pressure at one point in space (by using one microphone), over time.

However, we could also sample the signal over space in one point of time using one sampling period only (but thousands of microphones).

The aim of this experiment is to see the difference between the two approaches having a simple experiment: analysing a frame of 2 superposed acoustic signals: a whistle and a vowel.

This slideshow requires JavaScript.

The Code is initialised at gitHub.
popi video screenShot

Click on the image above to see the short video. Feel free to extend the Code.


11. First book-chapter on HChT and HChT Code online

June 13, 2017

HChT here:

Book Chapters on HChT:
chapters 7.10.2 -theory, 7.11.4 -signal examples, B.5.12 -code.



10. The Chirprate-Pitch Plane

December 7, 2008

Crossing pitch trajectories, that make multipitch tracking a difficult task. Looking for extra features that could be used for the tracking algorithm, assigning pitch trajectory to speakers (acoustic sources). Since we address single-channel recordings, the PoPi plane (ie. linking position to pitch) is not the way to go.

Method Description:
Using the pitch rate as an additional cue (regardless HOW one can obtain that feature) for multipitch-tracking. Why? If the pitch trajectory of two speakers are crossing in a given frame, then definitely the pitch of both speakers is coming “from somewhere” and this “somewhere” is hopefully not the same for both, as the trajectories are JUST crossing now in the problematic frame. The information where the pitch is “coming from”, and where it is “going to” is nothing else but the chirp-rate. Of course the question is how to decompose the signal effectively into a representation showing pitch linked to its pitch change rate.

One of the possible solutions is to take the frame under analysis, pre-warp it with different chirp-rate candidates (just like in the Fast implementation of Chirp Transform), and extract all the pitch candidates for all given pre-warping factor. With this we get a Chirprate vs. Pitch Plane (ie. alpha-f0 plane), that shows not only the actual pitch value of the speaker, but also from which “direction” the pitch is coming from, ie. was the pitch value higher or lower in the previous frame (positive or negative alpha), and how big this difference between the two frames is (the value of alpha itself).

Below two different Chirprate-Pitch decompositions (or pitch salience plane, see [1]): depicting one acoustic source in the scene. For more details see [1].

salience space
As we see, there is only one dominant Fo candidate, and one chirp-rate candidate for our speaker (no ghost peaks, cross-terms, etc.). Further option would be the application of the ACF-CEP based pitch estimation mentioned in the previous post.

Related work:
[1] “FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION”, P. Cancela, E. Lopez, M. Rocamora, Proc. of DAFx 2010, Graz, Austria, September 6-10, 2010. (Chirprate-Pitch Plane discussed in section 5.2),
[2] MartĂ­n Rocamora, Pablo Cancela, Pitch tracking in polyphonic audio by clustering local fundamental frequency estimates, Brazilian AES Audio Engineering Congress, 9th. S~ao Paulo, Brazil – May, 17–19. 2011,
[3] Luis Jure, Ernesto LĂłpez, MartĂ­n Rocamora, Pablo Cancela, Haldo SpontĂłn, Ignacio Irigaray: Pitch content visualization tools for music performance analysis, International Society for Music Information Retrieval Conference, 13th, Proceedings. ISMIR 2012. Porto, Portugal, page 493–498 – 2012

Software Tools
[SW1] Matlab GUI Tool, incl. code from the Audio Processing Group (FING|EUM),
[SW2] Vamp plugin for Sonic Visualiser from the Audio Processing Group (FING|EUM),

9. STChT-based Noise Suppression

December 7, 2008

Pitch Estimation module (ACF+CEP) + Spectral estimation module (STChT) + Noise spectrum estimation (adaptive Quantile) + Noise removal (anything from Wiener filter to the most complicated TF-based methods)


Building blocks:

  • Pitch estimation. An enhanced version of the reindexing method published in [1] was used as a basis for this module.
  • Short-Time Fan Chirp (STFCh)Transform. We use the fast version of the Fan-Chirp transform [2].
  • Noise estimation: this module is based on the well known quantile filtration idea [3], acc. to which the noisy background can be estimated by applying empirically defined percentage of the sorted time-frequency atoms.
  • Noise suppression: The “speech enhancement” is happening in this module. It applyes the estimated noise spectrum, and removes it from the representation provided by the STHChT module.


[1] KĂ©pesi, M and Weruaga, L.: “Harmonic Tracking based Short-Time Chirp Analysis of Speech Signals”, Robust2004 COST278 & ISCA ITRW Workshop on Robustness Issues in Conversational Interaction, 30th and 31st August 2004, University of East Anglia, Norwich, UK
[2] L. Weruaga, M. Kepesi, “The fan-chirp transform for non-stationary harmonic sounds”, Signal Proc., vol. 87, pp. 1504-1522, 2007.

Related Work:
[3] Stahl, V.; Fischer, A.; Bippus, R: “Quantile based noise estimation for spectral subtraction and Wiener filtering”, Acoustics, Speech, and Signal Processing, 2000. Volume 3, Issue , 2000 Page(s):1875 – 1878 vol.3
[4] Sidsel Marie Norholm, PhD Thesis, 2015

8. Cep(ACF)

December 7, 2008

STChT requires reliable pitch estimation in order to provide sharp TF representation. This is a challenging task, as pitch estimation in noisy and multi-speaker environments is never an easy task.

Method Description:
A comibined Pitch estimation method, that combines Autocorrelation (ACF) with Cepstrum (Cep) and some additional tricks: We know, that the Autocorrelation extracts the periodicity of the speech signal even in noisy background, but gives multiple pitch candidates because of double-pitch, half-pitch, etc.. This is taken care by the Cepstrum applied on top of the ACF, which merges all autocorrelation peak candidates into one cepstrum-based pitch candidate. And the trick is inbetween: Cepstrum is reliable only if the spectrum it uses is nice enough, ie. dominant, rich of harmonics, and as flat as possible. But how could be a noisy speech spectrum nice like that?


Well, a half-way rectified autocorrelation, leads almost to a spectrum like that: with boosted periodicities and enhanced spectral representation of hidden harmonicities..

No publications yet.

7. Multiband PoPi Decomposition

December 7, 2008

When applying the PoPi decomposition for concurrent speaker scenarios (coctail party effect) in order to track multiple speakers moving while speaking we see that the original formulation of the PoPi decomposition shows always only the more dominant speaker (ie. more dominant microperiodicies in given signal frame), and the othere speaker (let’s call him background speaker) is suppressed and hardly visible in the representation.

Similar problem is addressed in Klapuri’s PhD work that targets automatic transcription of simultaneous musical tones. Klapuri’s (and the most logical) way to enhance multiple pitch candidates is to give them chance in multiple frequency bands to be dominant.


We do the same: the “multiband” version of the PoPi plane is based on subband processing. This provides good results already at using as few bands as 17. This has been proven on several double-talk and triple-talk scenarios recorded in 3 different rooms with different reverberation times. However, for non-speech like scenarios, and more speakers the 17 band might be a low number.


The image above shows the PoPi decomposition of 2 concurrent speakers. Their position and the corresponding pitch values is easy to read out. The recording shows the voice of Tania and Lukas.

[1] T. Habib, L. Ottowitz, and M. KepĂ©si, “Experimental Evaluation of Multi-band Position-Pitch Estimation (M-PoPi) Algorithm for Multi-Speaker Localization,” INTERSPEECH 2008, Sept. 22-26, Brisbane, Australia.
[2] T. Habib, M. KepĂ©si and L. Ottowitz, “Experimental Evaluation of the Joint Position-Pitch Estimation (PoPi) algorithm in Noisy Environments,” 5th IEEE Workshop on Sensor Array and Multi-Channel Signal Processing (SAM 2008), Jul. 21-23, Darmstadt, Germany.
[3] M. KepĂ©si, L. Ottowitz and T. Habib, “Joint Position-Pitch Estimation for Multiple Speaker Scenarios,” IEEE Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA 2008), May 6-8, Trento, Italy.


6. Housing for Microphone Arrays for Size Minimization

December 7, 2008

It is the dream of every microphone array researcher to have a mic array with a size of only a matchbox. Unfortunately, until now many mic arrays are of a size of hundreds of cm. This size restriction comes from the frequency of the signal we are trying to catch with the array: the lower the frequency the bigger the array must be. But not any more…

101 – Miniature Microphone Array
102 – Main Enclosure
103 – Sensor array, Microphone Array
104 – Sensors, Microphones
105 – Air
106 – Media: Argon, Krypton, Xenon, Sulfur Hexafluoride, Carbon Dioxide, etc.

The Solution:
if the array size depends on the wavelength, and we can NOT change the frequency of the signal under acquisition, what can we affect in order to use smaller arrays? The guess is right: the speed of sound.

EP2218267: Kubin, Kepesi, Stark: Housing for Microphone Arrays and Multi-Sensor Devices for Their Size-Optimization, and
US8767993 (B2): Kubin, Kepesi, Stark: Housing for Microphone Arrays and Multi-Sensor Devices for Their Size-Optimization.