Expressive Timing Extraction & Acoustic Alignment of Polyphonic Audio to MIDI Based on MATCH
EE 675 Approaches to Music Cognition
Musical Prosody and Interpretation
University of Southern California
Spring 2010
Question
Can expressive tempo be automatically measured reliably from an audio sample?
Idea & Method
Tempo rubato is one commonly studied metric of musical prosody. Usually, it is calculated from manual annotation
of note onset times. Rubato is calculated as the a percentage change in instantaneous tempo. Sonic Visualizer
is a modern tool which allows such manual onset annontations. This manual method, however, is tedious and
time consuming. It would be desired to have an automated method for extracting this kind of information.
MIDI representations contain information of all the note onsets for a piece and can thus be used as an
annotation. The MATCH algorithm is used to acoustically align sets of similar recordings (1). The algorithm
has been implemented in JAVA for alignment but is currently unable to playback the two aligned recordings
simultaneously. The alignment process works by way of dynamic time-warping one of the recordings to minimize
the perceptual spectral differences between the pieces.
If we have a MIDI represenation available, we can use the synthesized version to align a corresponding recording.
The result of the MATCH algorithm is time-map which relates time in one recording to time in the other. Using this
time-map, we can warp the onsets available from the MIDI to generate an expressive version.
The comparison of the inter-onset intervals (IOS's) between the MIDI and aligned recording results in a measure of
relative tempo since the metrical level of onsets in MIDI format are ambiguous. If, however, the MIDI represents an
a true constance (known) tempo with expressionless timing, then relative tempo is equivalent to its absolute tempo rubato.

Symbolic and Recorded Musical Sources
Monophonic Alignment
-Two MIDI/mp3 pairs were tested, one, Moonlight Sonata played by Maurizio Pollini was tested and aligned with
a synthesized MIDI version of the same piece.
Spectrogram of first 15 seconds of a recording of Beethoven's Moonlight Sonata (Pollini) alongside the same for the MIDI.
(Hamming Window size = 2048 (46ms), hop size 20 ms).
Spectrogram plots showing only the increase in energy spectral density for first 15 seconds of a recording of Beethoven's Moonlight Sonata (Pollini)
and corresponding MIDI.
Increase in energy spectral density using a nonlinear bin spacing for first 15 seconds of a synthesized MIDI version of the same piece. The nonlinear frequency bins map low frequencies (< 370 Hz) linearly to bin numbers and high frequencies
(> 370 Hz) to a logrithmic semitone bin spacing. Energies at frequencies above 12.5 KHz are all summed together in the highest bin number following MATCH (1).
Cost matrix from dynamic time-warping between Moonlight Sonata MIDI and recording showing the path of total cost. The cost of
alignment between two frames is calculated as the euclidean distance between the 80-element non-linear frequency bin vectors.
Relative tempo for Moonlight sonata between MIDI and Pollini's recording. This shows the relative tempo that is automatically calculated
from the MIDI onset information throught the dynamic time-warping as well as that which is obtained through manual annotation using Sonic Visualizer.
Aligning MIDI to Recordings
Aligning the MIDI representation to match the expressive timing of the recording can simply be accomplished by reversing the source and target
audio files given to the MATCH implementation. However, we can take advantage of the onset information contained with in the MIDI file along
with the dynamic time-warp map obtained before to re-map the MIDI onsets producing a new MIDI file with a similar expressive timing characteristic.
Taking the original Moonlight Sonata MIDI and remapping the onsets to match those of Pollini's recording results in
this MIDI.

Polyphonic Alignment
The same work was applied to an example of polyphonic music. The Beach Boys hit Wouldn't It Be Nice along with
this MIDI provides a more difficult alignment task to test the performance of MATCH.
This MIDI is polyphonic but does not model the vocalization in the original recording.
Below is the MATCH feature extraction for the first 15 seconds of Wouldn't it be Nice
Cost matrix from dynamic time-warping between Wouldn't it be Nice MIDI and recording showing the path of total cost. The cost of
alignment between two frames is calculated as the euclidean distance between the 80-element non-linear frequency bin vectors.
Relative tempo for Wouldn't it be Nice between MIDI and Pollini's recording. This shows the relative tempo that is automatically calculated
from the MIDI onset information throught the dynamic time-warping as well as that which is obtained through manual annotation using Sonic Visualizer.
Conclusions
-The MATCH algorithm allows for the automatic extraction of (relative) tempo. This works with both monophonic and polyphonic pieces and is rubust enough
for alignment even when vocals are not part of the MIDI model
-The alignment of MIDI and a recording can be used to automatically generate a new MIDI which matches the timing expression of the recording.
-The use of a phase vocoder allows simultaneous playback of a recording and its aligned MIDI demonstraiting the alignment result.
MATLAB Script Files
myMATCH.m
myExtractFeatures.m
dp.m
getTempoChanges.m
stft.m
istft.m
matrix2midi.m
midiInfo.m
myMapOnsetsToTarget.m
pvoc.m
pvsample.m
readmidi.m
writemidi.m
Related Work
1) Dixon, S., Widmer, G. "MATCH: A Music Alignment Tool Chest", Proc. 6th International
Symposium on Music Information Retrieval, 2005.
2) Nu, N., Dannenberg, R.B., Tzanetakis, G. "Polyphonic Audio Matching and Alignment
for Musical Retrieval", Proc. 2003 Workshop on Applications of Signal Processing to
Audio and Acoustics, IEEE.
3) Turesky, R., Ellis, D., "Ground-truth Transcriptions of Real Music from Force-Aligned
MIDI Syntheses", Proc. 4th International Conference on Music Information Retrieval, 2003.
Thanks to Dan Ellis of Columbia University for his iterative MATLAB solution to the dynamic programming.
Thanks to www.kenshutte.com for the MATLAB MIDI routines.
You can e-mail me at:
highfill@usc.edu
The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees