Archive for the ‘Uncategorized’ Category

HChT and PoPi code online

June 13, 2017

HChT here:

PoPi here:

Book Chapters on HChT:
chapters 7.10.2 -theory, 7.11.4 -signal examples, B.5.12 -code.



The Chirprate-Pitch Plane

December 7, 2008

Crossing pitch trajectories, that make multipitch tracking a difficult task. Looking for extra features that could be used for the tracking algorithm, assigning pitch trajectory to speakers (acoustic sources). Since we address single-channel recordings, the PoPi plane (ie. linking position to pitch) is not the way to go.

Method Description:
Using the pitch rate as an additional cue (regardless HOW one can obtain that feature) for multipitch-tracking. Why? If the pitch trajectory of two speakers are crossing in a given frame, then definitely the pitch of both speakers is coming “from somewhere” and this “somewhere” is hopefully not the same for both, as the trajectories are JUST crossing now in the problematic frame. The information where the pitch is “coming from”, and where it is “going to” is nothing else but the chirp-rate. Of course the question is how to decompose the signal effectively into a representation showing pitch linked to its pitch change rate.

One of the possible solutions is to take the frame under analysis, pre-warp it with different chirp-rate candidates (just like in the Fast implementation of Chirp Transform), and extract all the pitch candidates for all given pre-warping factor. With this we get a Chirprate vs. Pitch Plane (ie. alpha-f0 plane), that shows not only the actual pitch value of the speaker, but also from which “direction” the pitch is coming from, ie. was the pitch value higher or lower in the previous frame (positive or negative alpha), and how big this difference between the two frames is (the value of alpha itself).

Below two different Chirprate-Pitch decompositions (or pitch salience plane, see [1]): depicting one acoustic source in the scene. For more details see [1].

salience space
As we see, there is only one dominant Fo candidate, and one chirp-rate candidate for our speaker (no ghost peaks, cross-terms, etc.). Further option would be the application of the ACF-CEP based pitch estimation mentioned in the previous post.

Related work:
[1] “FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION”, P. Cancela, E. Lopez, M. Rocamora, Proc. of DAFx 2010, Graz, Austria, September 6-10, 2010. (Chirprate-Pitch Plane discussed in section 5.2),
[2] Martín Rocamora, Pablo Cancela, Pitch tracking in polyphonic audio by clustering local fundamental frequency estimates, Brazilian AES Audio Engineering Congress, 9th. S~ao Paulo, Brazil – May, 17–19. 2011,
[3] Luis Jure, Ernesto López, Martín Rocamora, Pablo Cancela, Haldo Spontón, Ignacio Irigaray: Pitch content visualization tools for music performance analysis, International Society for Music Information Retrieval Conference, 13th, Proceedings. ISMIR 2012. Porto, Portugal, page 493–498 – 2012

Software Tools
[SW1] Matlab GUI Tool, incl. code from the Audio Processing Group (FING|EUM),
[SW2] Vamp plugin for Sonic Visualiser from the Audio Processing Group (FING|EUM),

STChT-based Noise Suppression

December 7, 2008

Pitch Estimation module (ACF+CEP) + Spectral estimation module (STChT) + Noise spectrum estimation (adaptive Quantile) + Noise removal (anything from Wiener filter to the most complicated TF-based methods)


Building blocks:

  • Pitch estimation. An enhanced version of the reindexing method published in [1] was used as a basis for this module.
  • Short-Time Fan Chirp (STFCh)Transform. We use the fast version of the Fan-Chirp transform [2].
  • Noise estimation: this module is based on the well known quantile filtration idea [3], acc. to which the noisy background can be estimated by applying empirically defined percentage of the sorted time-frequency atoms.
  • Noise suppression: The “speech enhancement” is happening in this module. It applyes the estimated noise spectrum, and removes it from the representation provided by the STHChT module.


[1] Képesi, M and Weruaga, L.: “Harmonic Tracking based Short-Time Chirp Analysis of Speech Signals”, Robust2004 COST278 & ISCA ITRW Workshop on Robustness Issues in Conversational Interaction, 30th and 31st August 2004, University of East Anglia, Norwich, UK
[2] L. Weruaga, M. Kepesi, “The fan-chirp transform for non-stationary harmonic sounds”, Signal Proc., vol. 87, pp. 1504-1522, 2007.
[3] Stahl, V.; Fischer, A.; Bippus, R: Quantile based noise estimation for spectral subtraction and Wiener filtering, Acoustics, Speech, and Signal Processing, 2000. Volume 3, Issue , 2000 Page(s):1875 – 1878 vol.3


December 7, 2008

STChT requires reliable pitch estimation in order to provide sharp TF representation. This is a challenging task, as pitch estimation in noisy and multi-speaker environments is never an easy task.

Method Description:
A comibined Pitch estimation method, that combines Autocorrelation (ACF) with Cepstrum (Cep) and some additional tricks: We know, that the Autocorrelation extracts the periodicity of the speech signal even in noisy background, but gives multiple pitch candidates because of double-pitch, half-pitch, etc.. This is taken care by the Cepstrum applied on top of the ACF, which merges all autocorrelation peak candidates into one cepstrum-based pitch candidate. And the trick is inbetween: Cepstrum is reliable only if the spectrum it uses is nice enough, ie. dominant, rich of harmonics, and as flat as possible. But how could be a noisy speech spectrum nice like that?


Well, a half-way rectified autocorrelation, leads almost to a spectrum like that: with boosted periodicities and enhanced spectral representation of hidden harmonicities..

No publications yet.

Multiband PoPi Decomposition

December 7, 2008

When applying the PoPi decomposition for concurrent speaker scenarios (coctail party effect) in order to track multiple speakers moving while speaking we see that the original formulation of the PoPi decomposition shows always only the more dominant speaker (ie. more dominant microperiodicies in given signal frame), and the othere speaker (let’s call him background speaker) is suppressed and hardly visible in the representation.

Similar problem is addressed in Klapuri’s PhD work that targets automatic transcription of simultaneous musical tones. Klapuri’s (and the most logical) way to enhance multiple pitch candidates is to give them chance in multiple frequency bands to be dominant.


We do the same: the “multiband” version of the PoPi plane is based on subband processing. This provides good results already at using as few bands as 17. This has been proven on several double-talk and triple-talk scenarios recorded in 3 different rooms with different reverberation times. However, for non-speech like scenarios, and more speakers the 17 band might be a low number.


The image above shows the PoPi decomposition of 2 concurrent speakers. Their position and the corresponding pitch values is easy to read out. The recording shows the voice of Tania and Lukas.

[1] T. Habib, L. Ottowitz, and M. Kepési, “Experimental Evaluation of Multi-band Position-Pitch Estimation (M-PoPi) Algorithm for Multi-Speaker Localization,” INTERSPEECH 2008, Sept. 22-26, Brisbane, Australia.
[2] T. Habib, M. Kepési and L. Ottowitz, “Experimental Evaluation of the Joint Position-Pitch Estimation (PoPi) algorithm in Noisy Environments,” 5th IEEE Workshop on Sensor Array and Multi-Channel Signal Processing (SAM 2008), Jul. 21-23, Darmstadt, Germany.
[3] M. Kepési, L. Ottowitz and T. Habib, “Joint Position-Pitch Estimation for Multiple Speaker Scenarios,” IEEE Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA 2008), May 6-8, Trento, Italy.


Housing for Microphone Arrays for Size Minimization

December 7, 2008

It is the dream of every microphone array researcher to have a mic array with a size of only a matchbox. Unfortunately, until now many mic arrays are of a size of hundreds of cm. This size restriction comes from the frequency of the signal we are trying to catch with the array: the lower the frequency the bigger the array must be. But not any more…

101 – Miniature Microphone Array
102 – Main Enclosure
103 – Sensor array, Microphone Array
104 – Sensors, Microphones
105 – Air
106 – Media: Argon, Krypton, Xenon, Sulfur Hexafluoride, Carbon Dioxide, etc.

The Solution:
if the array size depends on the wavelength, and we can NOT change the frequency of the signal under acquisition, what can we affect in order to use smaller arrays? The guess is right: the speed of sound.

EP2218267: Kubin, Kepesi, Stark: Housing for Microphone Arrays and Multi-Sensor Devices for Their Size-Optimization

The PoPi plane

December 7, 2008

The idea:
Although unvoiced sounds of speech dominanting at high frequencies (sounds “s” “z” “f”, etc) give very clear DoA estimation, the real information is still hidden in the not-so-clear and confusing micro-periodicities of the autocorrelation. These microperiodicities carry the information not only about the DoA of the source they are related to, but also its pitch information. This means, even in case of two moving, not always active acoustic sources, their ID could be described to their position by linking their pitch to their DoA.

Method Description:
Decomposing a frame of a 2-channel acoustic signal into a Position-Pitch plane shows clearly where the source is (at which DoA angle) and what the Pitch (from correlation lag) of that speaker is. The image below demonstrates a PoPi plane extracted by a 16-channel circular mic array signal for a voiced speech frame.


[1] M. Képesi, F. Pernkopf, M. Wohmayr, “Joint Position-Pitch Tracking for 2-Channel Audio,” CBMI 2007, Jun 25-27, Bordeaux, France
[2] M. Wohmayr, M. Képesi, “Joint Position-Pitch Extraction from Multichannel Audio,” Proc. Interspeech 2007, August 27-31, Antwerpen, Belgium
[3] M. Képesi, M. Wohmayr, T. Habib, “Pitch-Driven Position Estimation of Speakers in Multispeaker Environments,” The 3rd Congress of the Alps Adria Acoustics Association, September 27-28, 2007, Graz, Austria
Related work:
[4] T Habib, H Romsdorfer: Comparison of SRP-PHAT and multiband-POPI algorithms for speaker localization using particle filters, Proc. DAFX, 2010 –
Stephan Gerlach, Stefan Goetze1, J ̈rg Bitzer and Simon Doclo: Evaluation of Joint Position-Pitch Estimation Algorithm for Localising Multiple Speakers in Adverse Acoustical Environments, ICASSP, 2011, Prague
[6] MG Christensen, JX Zhang, SH Jensen: Joint DOA and Multi-Pitch Estimation based on Subspace Techniques, EURASIP Journal on Advances in Signal Processing 2012, 2012:1
[7] T Habib, H Romsdorfer: Auditory Inspired Methods for Localization of Multiple Concurrent Speakers, Computer Speech & Language, 2012
More publication
… under the Multichannel PoPi topic.


[1] Related page at SPSC (if it is not there, it is not there…)
[2] PoPi Demo Videos and Audio Files
[3] Elmar using the method for controlling the orientation of his robot.

[1] Simple and Fast Demo

Spectral Reindexing for Pitch Estimation

December 7, 2008

Need for a powerful but straightforward pitch estimation method.

The idea:
reordering the information represented by the frequency bins of a spectrogram (FFT, FChT or ChT) into an FoGram.

Auditory Perceptual Integration:
The main idea is to scan through all possible pitch candidates and assign, to every frequency index Fo, the sum of the energy values at Fo, 2Fo, … , iFo. In equation it looks like this:

fo … pitch candidates (usually between 80 and 380Hz),
nH … number of harmonics considered for gathering,
S() … Spectral sample at i x fo.

An example of a such Fo-gram derived from a HChT spectrogram is shown below (courtesy Cancela et al.):

After zooming the image we see that the FoGram provides extremely high frequency resolution (below 1Hz!), far-far above the frequency resolution of the spectral representation it is derived from (usually 10-30Hz/freq. bin).

[1] M. Képesi, L. Weruaga, E. Schofield, “Detailed Multidimensional Analysis of our Acoustical Environment,” Forum Acusticum. Budapest (Hu), September 2005, pp. 2649-2654.
[2] M. Képesi and L. Weruaga, “High-resolution noise-robust spectral-based pitch estimation,” Interspeech 2005, pp. 313-316, Lisboa (P), Sep. 2005

Related Work:

[3] P. Cancela, “Tracking melody in polyphonic audio. mirex 2008,” in Proc. Music Inf. Retrieval Evaluation eXchange, 2008
[4] “FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION”, P. Cancela, E. Lopez, M. Rocamora, DAFx 2010.
(“F0-gram” ie. GlogS discussed in chapter 4)
[5] Pei Zhao, Zhiping Zhang, Xihong Wu: Monaural speech separation based on multi-scale Fan-Chirp Transform, ICASSP 2008. March 31 2008, Page(s): 161 – 164

Short-Time Chirp Transform

December 7, 2008

Smeared FFT representation of harmonic lines in case of changing pitch.

Method Description:
To replace the harmonics in the Fourier transform with Chirps.



Replacement of the harmonics in Fourier trf by properly designed chirps provide a new orthogonal transform, which over-performs Fourier significantly in cases like mentioned above. The gain we get in enhanced T-F representation could be well used for speech enhancement and other methods, as discussed in our papers.

[1] Weruaga, L. and Képesi, M: “Speech analysis with the Short-time Chirp transform”, 8th European Conf. on Speech , EUROSPEECH 2003, Geneva, Sept 2003, vol.I, pp.53-56.
[2] Képesi, M. Weruaga, L.: “Speech Analysis with the Fast Chirp Transform, ” EUSIPCO 2004, the 12th European Signal Processing Conference, Wien, Austria, 7-10 September 2004
[3] Weruaga, L. and Képesi, M.: “EM-driven Stereo-like Gaussian Chirplet Mixture Estimation”, ICASSP 2005, IEEE International Conference on Acoustics, Speech, and Signal Processing. March 1923, IV, pp. 473-476, 2005, Philadelphia, USA.
[4] L. Weruaga and M. Képesi, “Self-organizing chirp-sensitive artificial auditory cortical model,” Interspeech 2005, pp. 705-708, Lisboa (P), Sep. 2005.
[5] M. Képesi, L. Weruaga, “Adaptive chirp-based time-frequency analysis of speech signals”, Speech Comm., vol.48, pp. 474-492, 2006.
[6] L. Weruaga, M. Képesi, “The fan-chirp transform for non-stationary harmonic sounds”, Signal Proc., vol. 87, pp. 1504-1522, 2007.

Related Work:
R Dunn, TF Quatieri, “Sinewave Analysis/Synthesis Based on the Fan-Chirp Tranform,” IEEE Workshop on Applications of Signal Processing to …, 2007
[8] Macej Bartkowiak, “Application of the Fan-Chirp Transform to Hybrid Sinusoidal+Noise Modeling of Polyphonic Audio,” Eusipco 2008, Lausanne, Switzerland
[9] Pei Zhao; Zhiping Zhang; Xihong Wu, “Monaural speech separation based on multi-scale Fan-Chirp Transform,” Acoustics, Speech and Signal Processing, 2008. ICASSP 2008, March 31 2008-April 4 2008 Page(s):161 – 164
[10] Ha Nguyen, Luis Weruaga: “Time–Frequency Analysis of Vietnamese Speech Inspired on Chirp Auditory Selectivity,” Book Series Lecture Notes in Computer Science, pp. 284-295, Springer Berlin / Heidelberg, Volume 5351/2008, 2008,
ISBN 978-3-540-89196-3

[11] “FAN CHIRP TRANSFORM FOR MUSIC REPRESENTATION”, P. Cancela, E. Lopez, M. Rocamora, DAFx 2010.
[12] Pei Zhao, Zhiping Zhang, Xihong Wu: Monaural speech separation based on multi-scale Fan-Chirp Transform, ICASSP 2008. March 31 2008, Page(s): 161 – 164
[13] Hui Yin, Climent Nadeu, and Volker Hohmann1: Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition, Hindawi Publishing Corporation, EURASIP Journal on Audio, Speech, and Music Processing Volume 2009, Article ID 304579.

Auditory Feedback for Simulating Attention

December 7, 2008

The Trigger:
a need for a pitch estimation method that keeps track of the speaker under interest. In other words a method that knows what to follow in order not to lose the pitch trajectory even in case of having other active speakers in background (cocktail party effect).

Method Description:
The “what to follow” was the main question… I picked up the auditory model-based pitch estimation method, the block scheme of which is depicted below..


.. and designed the “Enhance + summa” module in a way that it can accept feedback information form the estimated pitch value, and boost the estimation, if it was correct:


The block scheme clearly indicates that the answer to the “what to follow?” question is the “formant envelopes”, sampled at the estimated pitch value. Below an example, showing the internal states of the estimation module while enhancing the channels belonging to one speaker, with another active speaker in the background.


[1] Képesi, M.: “Auditory Model-Based Tracking of Mixed Acoustic sources,” Proc. of SPRA 2003, Rhodos, Greece 2003.

More detailed description here.