Expressive Timing Extraction & Acoustic Alignment of Polyphonic Audio to MIDI Based on MATCH


EE 675 Approaches to Music Cognition
Musical Prosody and Interpretation University of Southern California
Spring 2010

Question

Can expressive tempo be automatically measured reliably from an audio sample?

Idea & Method

Tempo rubato is one commonly studied metric of musical prosody. Usually, it is calculated from manual annotation of note onset times. Rubato is calculated as the a percentage change in instantaneous tempo. Sonic Visualizer is a modern tool which allows such manual onset annontations. This manual method, however, is tedious and time consuming. It would be desired to have an automated method for extracting this kind of information.
MIDI representations contain information of all the note onsets for a piece and can thus be used as an annotation. The MATCH algorithm is used to acoustically align sets of similar recordings (1). The algorithm has been implemented in JAVA for alignment but is currently unable to playback the two aligned recordings simultaneously. The alignment process works by way of dynamic time-warping one of the recordings to minimize the perceptual spectral differences between the pieces. If we have a MIDI represenation available, we can use the synthesized version to align a corresponding recording. The result of the MATCH algorithm is time-map which relates time in one recording to time in the other. Using this time-map, we can warp the onsets available from the MIDI to generate an expressive version.
The comparison of the inter-onset intervals (IOS's) between the MIDI and aligned recording results in a measure of relative tempo since the metrical level of onsets in MIDI format are ambiguous. If, however, the MIDI represents an a true constance (known) tempo with expressionless timing, then relative tempo is equivalent to its absolute tempo rubato.

Diagram

Symbolic and Recorded Musical Sources

Monophonic Alignment

-Two MIDI/mp3 pairs were tested, one, Moonlight Sonata played by Maurizio Pollini was tested and aligned with a synthesized MIDI version of the same piece.

Spectrogram of first 15 seconds of a recording of Beethoven's Moonlight Sonata (Pollini) alongside the same for the MIDI. (Hamming Window size = 2048 (46ms), hop size 20 ms). Moonlight Recording Spectrogram

Spectrogram plots showing only the increase in energy spectral density for first 15 seconds of a recording of Beethoven's Moonlight Sonata (Pollini) and corresponding MIDI. Moonlight Increase in Spectral Energy vs. Time

Increase in energy spectral density using a nonlinear bin spacing for first 15 seconds of a synthesized MIDI version of the same piece. The nonlinear frequency bins map low frequencies (< 370 Hz) linearly to bin numbers and high frequencies (> 370 Hz) to a logrithmic semitone bin spacing. Energies at frequencies above 12.5 KHz are all summed together in the highest bin number following MATCH (1). Moonlight MATCH Features

Cost matrix from dynamic time-warping between Moonlight Sonata MIDI and recording showing the path of total cost. The cost of alignment between two frames is calculated as the euclidean distance between the 80-element non-linear frequency bin vectors. Moonlight Cost Matrix

Relative tempo for Moonlight sonata between MIDI and Pollini's recording. This shows the relative tempo that is automatically calculated from the MIDI onset information throught the dynamic time-warping as well as that which is obtained through manual annotation using Sonic Visualizer. Moonlight Relative Tempo (Smoothed)

Aligning MIDI to Recordings

Aligning the MIDI representation to match the expressive timing of the recording can simply be accomplished by reversing the source and target audio files given to the MATCH implementation. However, we can take advantage of the onset information contained with in the MIDI file along with the dynamic time-warp map obtained before to re-map the MIDI onsets producing a new MIDI file with a similar expressive timing characteristic.
Taking the original Moonlight Sonata MIDI and remapping the onsets to match those of Pollini's recording results in this MIDI.

Aligning MIDI to recording

Polyphonic Alignment

The same work was applied to an example of polyphonic music. The Beach Boys hit Wouldn't It Be Nice along with this MIDI provides a more difficult alignment task to test the performance of MATCH. This MIDI is polyphonic but does not model the vocalization in the original recording.

Below is the MATCH feature extraction for the first 15 seconds of Wouldn't it be Nice

Wouldn't it be Nice  MATCH Features

Cost matrix from dynamic time-warping between Wouldn't it be Nice MIDI and recording showing the path of total cost. The cost of alignment between two frames is calculated as the euclidean distance between the 80-element non-linear frequency bin vectors.

Wouldn't it be Nice Cost Matrix

Relative tempo for Wouldn't it be Nice between MIDI and Pollini's recording. This shows the relative tempo that is automatically calculated from the MIDI onset information throught the dynamic time-warping as well as that which is obtained through manual annotation using Sonic Visualizer.

Wouldnt It Be Nice Relative Tempo (Smoothed)

Conclusions

-The MATCH algorithm allows for the automatic extraction of (relative) tempo. This works with both monophonic and polyphonic pieces and is rubust enough for alignment even when vocals are not part of the MIDI model
-The alignment of MIDI and a recording can be used to automatically generate a new MIDI which matches the timing expression of the recording.
-The use of a phase vocoder allows simultaneous playback of a recording and its aligned MIDI demonstraiting the alignment result.

MATLAB Script Files

myMATCH.m
myExtractFeatures.m
dp.m
getTempoChanges.m
stft.m
istft.m
matrix2midi.m
midiInfo.m
myMapOnsetsToTarget.m
pvoc.m
pvsample.m
readmidi.m
writemidi.m

Related Work

1) Dixon, S., Widmer, G. "MATCH: A Music Alignment Tool Chest", Proc. 6th International Symposium on Music Information Retrieval, 2005.
2) Nu, N., Dannenberg, R.B., Tzanetakis, G. "Polyphonic Audio Matching and Alignment for Musical Retrieval", Proc. 2003 Workshop on Applications of Signal Processing to Audio and Acoustics, IEEE.
3) Turesky, R., Ellis, D., "Ground-truth Transcriptions of Real Music from Force-Aligned MIDI Syntheses", Proc. 4th International Conference on Music Information Retrieval, 2003.
Thanks to Dan Ellis of Columbia University for his iterative MATLAB solution to the dynamic programming.
Thanks to www.kenshutte.com for the MATLAB MIDI routines.


You can e-mail me at: highfill@usc.edu
The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees