Esta página está en construcción: perdonen los errores, repeticiones y temas inacabados.

This page is being developed: I am sorry for errors, duplications  and unfinished subjects.



ABSTRACT. On a normed linear space of signals an Adaptive Auto-Dissimilarity Function (AADF) is defined, to be used to estimate pitch contours in quasíperiodic signals such as those in related to speech and music. Voicing, Aperiodicity and Intensity time functions are also directly found. This provides a rigorous, efficient and very fast algorithm, successfully used by the author for twenty years. Further analysis can be made in those pitch contours to obtain intensity, prosódic intonation patterns, rhythm, musical scale and melody estimation, and harmonic consonance.
0. Antecedents
Pitch and signal periodicity
The speech signal production
The musical signal production
Pitch estimators
1. The Normed Linear (Vector) Space of Signals
Quasíperiodic signals
Dissimilarity of signals
Adaptive Autodissimilarity
Aperiodicity, voicing and quasíperiod functions
Relationships to other methods for pitch estimation.
 2. AADF Algorithm 
Problems and strategies
Relationships to other methods for pitch estimation.
Further applications
3. Conclusions
4. References
5. PEA: an actual implementation of AADF
0. Antecedents

Pitch and Periodicity

There is a perceptive feature in some sounds –called ‘voiced’ in speech, and ‘melodic’ or ‘tuned’ in music–, pitch. It goes from low to high (a conventional way of speaking) or from ‘bass' to 'treble', in audio technology. Experience shows that pitch appears when the sound waveform –acoustical pressure or electrical voltage– is periodic or almost, and disappears when it is not. Analyses of these periodic waveforms show that they can be obtained by summing up a set of sinusoidal signals whose periods stand in an integer relationship 1,2,3,4... This set is called ‘harmonic series’ –better named ‘succession’– and its components, ‘harmonics’.

Since pitch remains appreciably unchanged when some harmonic components are reduced or suppressed, and so does signal periodicity, pitch and period seem to be directly linked. In fact both terms, pitch and fundamental frequency appear often interchanged in the specialized literature –though incorrectly, as they are perceptive and analytical terms.

Indeed, for periodic signals, it is indifferent to look at the period or at the first harmonic f0; but for almost periodic signals things behave differently: when the harmonics frequencies are not exact multiples of f0 (in fact we can not properly speak of harmonics in this case) the composite period depends on all the (quasí) harmonics and does not coincide any longer with the inverse of f0. As is the case with actual speech signals, this is no longer valid to estimate pitch by looking at f0 only, all the harmonics must be taken into account. And it is precisely what is done when we make a periodicity analysis.

These are well known facts, but we repeat them here to justify the estimation of pitch by processing the signal periodic properties, which is what our method does.

Speech Signal

The speech signal is mostly voiced (pitched), since pitch is perceived in all vowels, liquids, nasals, affricates and even in many stops. The waveforms varies but slowly, presenting a strong periodicity, that is, a form repeated several times at regular (almost equal) time intervals, or quasíperiods. Since vocal tract remains approximately constants during the quicker changes of f0 –the fundamental frequency or frequency of vocal chords in the glottis (?), the resulting frequency remains essentially stable, the result in a pattern shortened or lengthened according to f0, rather than an expanded or compressed spectrum. Therefore, the waveform in two contiguous periods is almost the same during the duration of the shorter one. In speech studies this fact is related to the constancy of formants (responsible of timbre) during pitch evolution. For speech, pitch is produced by the vocal chords vibration, and regulated (varied) by the laryngeal muscles tension, which varies continuously between two (low and high) extreme pitches, whose scope, called tessitura, depends of sex, age and subject.

So we will say that the voiced speech signal is locally quasíperiodic.

Speech periodicity can also be seen as emerged from a regularly spaced succession of vocal chords openings (and closings) that allow air thrusts which excite the vocal and nasal tracts in his way to the outer space. Each air thrust produces a single temporal response, and, since during short time intervals –i.e. 100 msec– both the thrust or pulse shape and the vocal tract filter properties –whose peaks, the formants, emphasíze some frequencies which will be the bulk of the response frequency content. So the air pulse spacing varies but not the beginning of their time responses, of greater intensity. Thus accounts for the shape periodicity even during pitch changes.

musical Signals

The musical signal is also mostly voiced or pitched, since melody (and harmony for Western music) is the main musical feature , the other being rhythm. Even some rhythmical instruments utter pitched sounds, as timbale and Indian tablah. The musical waveform present strong similarities with the speech one: both exhibit pitched sounds with enhanced frequency zones – formants–; for music, the differences are mainly the long time intervals with almost-constant pitch (the notes) and fewer timbre (waveform) changes within a single instrument –the Voice appears then as a hugely rich instrument, an orchestra of its own. There is also a greater tessitura for musical instruments: from three and a half octaves to seven or eight, versus two or two and a half for normal singing voices, and even one to one and a half for plain speech.

As speech, pitched musical signals can be considered as emerging from a basic vibration, the excitation, filtered by a resonant body: string with piano, violin, guitar boxes; reeds and clarinet, horns, saxophone, shenai tubes; lips and trumpet, or trombone metal tubes; even tuned bars in vibraphones or xylophones uses resonators; and tuned membranes vibrations, like the timbale and Indian tablah, are filtered by the hollow volumes to which they are clamped or fixed.

The excitation provides a basic pulse –whose shape can be controlled by means of finger position and velocity, or lips pressure– which is filtered by the resonant body, so producing a combined response that is output to the surrounding space. Since the pulses frequency is variable (string, tube or bar lengths) while the resonator is fixed and stable (they are small movements, as in violin bridges, that can be discarded in first approximation), and so are their formants and frequency characteristics, and their time responses shapes, as in speech.

Hence the musical signal is mainly quasíperiodic.

Periodicity and harmonic content

As it is well-known, when several harmonics are added in time (point to point), the result is a periodic function and its period and their fundamental are the same. This is often is expressed, in a familiar way of speaking of perception, as "when hearing a voiced sound one perceives the fundamental".

1. The Normed Linear (Vector) Space of Signals

Space of Signals

Signals are real functions of variable called time, t, t being either real (analogical signal) or integer (discrete or sampled signals); so signals take real values (so positive or negative), and are bounded and continuous or almost-continuous (with a finite number of discontinuities in any bounded time interval when analogical). They can be summed up between themselves and multiplied by real numbers, the results being again signals. The 0 signal and the opposite signal of f , -f, are also signals.

The set of these signals constitute also a linear (vector) space. In it, we can also define a family of (semi) norms of f with real exponent p and window w as:

          ||  f  || p,w   =   ( SSw w | ft | p  ) 1 / p                    1 <= p < infinite                                (1)

where the sum is extended to the support of w, Sw  (t interval where w is not null). So S becomes a seminormed space, Sp.

Time windows are special functions used to select and weight fragments of the whole signal, so to obtain its local properties there. Thus they are positive continuous functions concentrated around instant 0, that is they are not null in the vicinity of 0 and null otherwise. In this way they select only this vicinity of 0, the time interval Sw. On the other way their weighting property require their sum or integral to be constant –increase a sample means to reduce another; so:

                                           wt > 0                       if  t  is within Sw        

                             wt = 0                       if  otherwise 

S wt = 1

To select other parts of the signal, the windows are displaced to the desired point (instant), becoming dxw. In this case we extend the family of seminorms to any point x in time:

 ||  f  || p,w,x   =    ( SSw dxw | ft | p  ) 1 / p                  

That is, we displace the window from instant 0 to instant x, selecting the fragment of the signal f within the (displaced) support of w, Sw; we raise the absolute value of the selected samples to the p power and multiply them for the corresponding values of w; we sum up (integrate) all these numbers and extract the p root of the result. Now we have the seminorm of signal f around instant x with window w and p power. This is a finite real number meaning more or less a collective value of the absolute value of the samples, usually called the instantaneous signal amplitude or energy.

As in any linear space, there exist a null signal, 0, which values are all 0 for any t, and an opposite signal -f for each signal f, so that ft =-(-f)t, that is, their values are opposite for each instant t. Let us state some immediate properties of the p-norms that we will use later:

1. || 0 || = 0 p-norm of null signal is null.

2. || f || = 0 means f = 0 only inside Sw (that is why p-norms are only seminorms).

3. || a f || = a || f || for any a positive real number

4. || - f || = || f || for any a positive real number

5. || f - g || <= || f || + || g|| Minkovsky’s inequality

6. || f - g || >= | || f || - || g|| | Derived from Minkovsky’s.

We are specially interested in the last two inequalities:

                         | || f || - || g|| |   <=   ||  f  - g || <=   || f || + || g|| 

that is, the norm of the difference on two signals f and g, is not minor than the absolute value of the difference of their norms, and no greater than the sum of them (of course all the norms appearing in these inequalities share the same p, w, x).


The norm of the difference of two signals (their distance in metric normed space S) grows when their waveform difference, since the corresponding samples will also be more different. Thus it is a good index of their shape dissimilarity. But it is also affected by their amplitudes: ||af - ag|| = || a (f -g) || = a || f -g ||; to avoid this undesirable effect we can use the second Minkovsky, which provide us with an ideal normalizing factor of the difference, so that their quotient will always be comprised between 0 and 1.

  0 <=    ||  f  - g ||  /  ( || f || + || g|| )    <= 1           with either  || f || or  || g|| not null    

Let us call D the central term comprised between 0 and 1. The value 0 is reached when the numerator is null, so when f = g (i.s.w., that is, "inside the support of w"), both signals coincide i.s.w. Thus the value 0 correspond to maximum similarity or minimum dissimilarity. D is defined only if the norms of f and g are not simultaneously 0, a situation that would nullify the denominator of D and indeterminate D itself. If only one of them is null, D will be || f - 0 || / ( || f || + ||0|| ) = || f || / || f || = 1, that means: any signal is most dissimilar to the null one. Can be the value 1 be reached otherwise ?. Yes, when g=-f, i.s.w, because then D= || f - (-f) || / ( || f || + || -f|| ) = || 2 f || / ( || f || + || f|| )= 2 || f || / 2 || f || = 1: opposite signals are most dissimilar.

But we find D ill adjusted for our aim when comparing two equal signals but for the size: let g be a fraction of f: g = a f. In this case D = || f - a f || / ( || f || + ||a f || ) = (1-a) || f || /(1+a) || f || = (1-a) /(1+a) a fraction that becomes near 1 when a is small. Since we want to deactivate signal size influence in the dissimilarity, we must restore a low vale for this case (of course without arriving to a=0 , a case already seen above, for which D =1 is acceptable).

We find another ideal corrective element in the first term of Minkovsky inequalities: when subtracted from the present numerator of D, without overcoming it because is smaller or equal, diminishes it in an amount equivalent to their norms difference. Thus we define a new expression, D', as:

    D'(f,g) =        (  ||  f  - g || - | || f || - || g|| |  )  /  ( || f || + || g|| )

See in the figure a the vectorial representation of these operations.

Now the former problem is removed: for f = a f, D' = ( || f - a f || - | || f || - || a f || ) / ( || f || + ||a f || ) = ( (1-a)-(1-a) )|| f || /((1+a) || f ||) = 0, a minimum dissimilarity for 'homothetic' signals, as we intended: only the shape counts, not the size.

However, now, for one null signal, D' = ( || f - 0 || - | || f || - ||0|| | ) / ( || f || + ||0|| ) = 0, both become most similar, i.e. equal, something not expected nor seemingly desirable. A compromise must be made between the inclusion of the subtracting term or its suppression: this is done with a coefficient comprised between 0 and 1: in this way, D' arrives to its final expression: the dissimilarity between two signals is defined as

         D'(f,g) = ||  f  - g || -  b  | || f || - || g|| | )  / ( || f || + || g|| )

The representation of the value of D' as a function of a with b as a parameter will show its  behavior with the mentioned homothetic signals: for f = a f, D' = ( || f - a f || - b | || f || - || a f || ) / ( || f || + ||a f || ) = ( (1-a) -b(1-a) ) || f || /((1+a) || f || ) = (1-b)(1-a)/(1+a), a value equal to 0 for f = g, to 0 for opposite signals,  f = -g, and to (1-b ) for g=0. For a=.5 and b=.5, D' = 1/6 =.15, a low value not far away of 0 in the 0-1 scope.

Autodissimilarity Function of Delay

When we calculate the value of the dissimilarity D between two segments of the same signal f, we call it Autodissimilarity, AD. Both segments are selected and weighted by a time window situated on the signal on two spots in relative delay t around instant t. By changing t, we obtain finally:

  ||  f t + t /2  -    f t - t /2  || -  |  || f t + t /2 || - || f t - t /2 || |   

ADF f,t,p,w,b   (t ) =


                  || f t + t /2 || + || f t - t /2 ||

We have now a new function related to the periodic properties of f in t (around t, we can say), as we intended. Indeed, when t = 0 both segments are the same for any f, and  ADF = 0.

Let us suppose that f is periodic with period T. When t  grows, the segments become different until t is equal to the signal period T  and the selected segments become equal again; the process continues in the same way, so we obtain another periodic function of the delay t. In all the other points ADF will be positive, reaching the value 1 only if f inside the first windows is equal to -f inside the second one: this happens, for instance, in sinusoidal, triangular saw, square signals with null mean value.

Let us note that if w weights equally all the samples in the period, all the ADF for different instants t will be the same, since for each one, only the order of the samples would change in the calculation of the norms, not their values. This requirement will be developed later, in the definition of compensated windows.

For quasíperiodic signals it can be expected that also ADF will have a quasíperiodic character, since the samples vary slowly from period to period.

Thus, periodic signals will present null minima for all the multiples of T, included 0. For quasíperiodic signals minima (null only for t = 0) are to be expected for the multiples of the quasíperiod, the smaller the minima for more periodic character. The minima for high multiples will surely be greater than for small multiples, since the changes from a quasíperiod to the following will accumulate for far quasíperiods.

A problem, theoretical and practical, is to determine the deepness of the value of a minimum to decide that the processed signal is 'quasíperiodic' or not, a matter without a definite answer. An (arbitrary) threshold a must be adopted. When the minimum is smaller than a the signal f will be considered quasíperiodic in the instant t, and aperiodic otherwise.

The matter, that is, the value of a, will be decided empirically; and, when we speak of pitch, it is the ear which should judge. So, a quasíperiodic signal will be considered 'pitched' or 'voiced'  when the (general) ear decides so.

Aperiodicity, Quasíperiod and Intensity Time Functions

When we change the instant t around which we calculate the ADF of the delay to other near instant, the samples will be almost the same (only some are suppressed and some new appear) and similarly weighted (due to the smoothness of the windows w). So all the values of ADF will be similar, for each one of the delays. Thus also the minimum to be found in quasíperiodic signals will have a similar value, and so will be probably be above or below the threshold a, so that the classing as quasíperiodic or aperiodic in both neighboring near instants will probably be the same. Going on with this small displacement of t, we will originate new time functions.

The first is the value A of this running minimum, or Aperiodicity Function, which resume the periodicity properties of f along the time. An immediate consequence, by means of the threshold a, is the segmenting of the signal f in quasíperiodic or aperiodic fragments. When the signals carry on (as is our intention) speech or music information and a has been related to the perceptive pitch, we can call these classified fragments pitched, voiced or tuned. while the others will be  non-pitched, unvoiced, non-tuned or noise.

The second obtained function of t is the Quasíperiod  function, that is, the value of for the minimum A. This is the estimation of the local (quasí) period of f in t. Of course the Quasíperiod  function is defined only within voiced segments.

The inverse of the Quasíperiod for each T will provide a new and useful function: The Pitch function, supposedly what the auditory apparatus perceives in time, that is, intonations, melodies, accents, etc.

As am additional fruit of the former calculations we obtain one more function: the Intensity function which we define as the norm in t. extending the integration to a window of support T. This support will provide smoother functions that any other. 

 2. AADF Algorithm 

AADF Algorithm

We call AADF (or Adaptive Autodissimilarity Function) an algorithm to calculate the ADF only when necessary that is, we calculate only the segment of ADF where we expect the minimum to occur: this is of course around the last estimated pitch, because this parameter do not change quickly but rather smoothly. In analytical terms that means that the variation of pitch between neighboring points is limited. Then we will take the last estimated pitch as the value of tau around which the ADF is calculated. In that way, the calculation follows the pitch, thus saving a lot of useless values of ADF.

The amount around the former pitch depends on the temporal interval between ADF estimation and on the expected pitch variation of the signal itself  (let´s think in a contrabass melody vs a bird song) .


But there is another important adaptation in the ADF calculation, the support of the window itself (the segment of t where W it is not null) is made proportional to the delay tau. The reason of that second adaptation relates to the usual adaptation of the observation window to the size of the expected phenomena. When we look in a map for a name, we adapt our size window to the expected size of the item, big window for country names and small for little towns names. In music we select windows in seconds for rhythms and windows for ten of seconds for pitches and so on.

Working on a signal, when we look for a periodic phenomena, we used small windows for small phenomena (high pitches and big windows for greater phenomena for low  pitches).  We have, thus:

Sw  = k . t

And also, to "get" the same energy from the signal independently of the window support we have to change change accordingly its height in order to get the same area. We have then that each family (shape) of windows admit the general equation:

wa = 1/a .V (t/a )  

Two windows of the same family (triangular) with the same area, shape and proportional supports to their height.

The constants k takes a value around one that means that the support is similar to the searched period T. I bigger value shall promediate pitch values between several periods, mean while shorter values shall take into consideration incomplete periods of the signal, so becoming more sensible to pitch changes that less reliable or robust.

An actual placement of windows to calculate the ADF in two points, that is for two values of delay t .

The result, for unvoiced and voiced sounds, can be seen in next figure.

The actual AADF (above) for a Spanish [s] (down).
The actual AADF (above) for a long Spanish [e] (down).  

See a description of the algorithm here.

Problems and Strategies

The problems faced for our algorithm are similar to those encountered in all the others:

A1. Harmonics (specially those emphasízed by formants) preferredto (true?) pitch,

A2. Sub-harmonics preferred to (true?) pitch,

A3. Noise-embedded pitch.

A4. Voicing threshold.

A5. Random contiguous pitch values.

        A6. Pitch discrimination   

A7. Direction of calculus.

We can take the natural time direction, the opposite, or even a more complicated strategy in order to find short but reliable segments from which to continue the analysis in less clear segments (as voiced fricatives, voiced occlusive or quickly changing pitches).

       A8. Problem adaptation.

 There is no such thing that a universal pitch algorithm (except perhaps the ear), we must so adapt our algorithm, that is, select its parameters for the specific type of signal we expect to estimate. We can consider several general cases as:

-  Human speech (with sub-cases man, woman an children)
- Human singing (with the same cases now expanded with the usual tessitura voiced: bass, baritone, alto, tenor, contra tenor; alto, contralto and soprano).
- Animal pitch (from the big to the small; baleen, elephant, bull, lion, tiger, monkey, eagle, snake, nightingale, bat, etc.)
- Mechanical pitches (rotors, motors, etc).
- Other human and animal noises (bees, flies, cricket...).
- musical instruments (big and small; string, wind and percussion; continuous sound or percussive sound)

and some others, specific form our method, but also common to other under different nomenclature.:

       B1. Value of exponent p in the norm definition.
       B2. Window shape.
       B3. Value of Alfa (window support).  The constants k (alpha in some descriptions) takes a value around 1,
                that is, the window support take a value near tau.
        B4. Value of Beta
        B5. Treble de-emphasís.
        B6. Uncertainity of estimated pitch.
        B7. Interpolation.
        B8. Smoothing.
Relationships to other methods for pitch estimation.

Further Applications

C1. Pitch transcription..
From the most general point of view the problem of pitch transcription involves several related stages:
1) The choice of a algorithm for periodicity estimation
2) A threshold for voicing and invoicing decision
3) The implementation of a function to transform periodicities in pitches, using ear-perception properties (mel scale); these properties affect also the second stage.
4) The choice of the precision of the calculus in order to get closer values of pitch, a representation of the pitch voicing intensity functions of time.
C2. Scale estimation.
C3. Polyphonic (chord) pitch estimation.
C4. Rhythm estimation and transcription.
C5. Graphic domain.
C6. Generalized theory of likeness.
C7. ADG: AutoDissimilyGram

3. Conclusions

(to be continued)


AMERIO, L. y otro, Almost-periodic functions and functional equations, Van Nostrand, New York, 1971.

COMER, D. J. "The use of waveform asymmetry to identify voiced sounds", IEEE Trans. AU, vol.16, p.500, Dec.1968.

DOLANSKY, L. O., "An instantaneous pitch-period indicator", JASA, vol.27, p.67, Jan.1955.

GOLD, B. "Computer program for pitch extraction", JASA, vol.34

HESS, W., "On-line pitch determination of speech signals using a non-recursive digital filtera, 8th ICA, London, 1974.

KOLMOGOROV, A. N. y otro, Elementos de la teoría de funciones y del análisis funcional, Mir, Moscú, 1972.

MAISSIS, A. H. "Methode d'extraction du fundamental", L'Onde Electrique, vol.53, p.l10, mars 1973.

MAKSYM, J. N. "Real-time pitch extraction by adaptive prediction of the speech waveform", IEEE Trans. on AU, vol.21, p.l49, Jun.1973.

MOORER, J. A. "The optimum comb method of pitch period analysis of continuous digitized speech", IEEE Trans. on ASSP, vol.22, p.330, Oct.1974.

NOLL, A. M. "Cepstrum pitch determination", JASA, vol.41, p. 293, 1967.

RABINER, L. R. "On the use of autocorrelation analysis for pitch detection", IEEE Trans.on ASSP. vol.25, 0.24, Feb.1977.

REDDY, D. R. "Pitch period determination of speech sounds", Comm.ACM, vol.10, 0.343, 1967.

Sánchez, F. J. Tratamiento de señales cuasíperiódicas. Aplicación a la estimación del tono fundamental. Tesis Doctoral. Escuela Técnica Superior de Ingenieros Industriales, Madrid, 1982.

Sánchez, F. J.  "Dissimilarity and Aperiodicity functions; temporal processing of quasíperiodic signals", 9th International Congress of Acoustics, Madrid, p.859, Ju1.1977.

Sánchez, F. J. "Application of dissimilarity and aperiodicity functions to fundamental frequency measurement of speech and voiced-unvoiced decision", 9th International Congress of Acoustics, p.523, Madrid, 1977


5. An actual implementation: PEA



Back to the Beginning     Last page update: 14/03/14     Visitors: contador de visitas