
Segmenting Popular Music Sentence by Sentence
by Wan-chi Lee
Final Project for ISE 575
[ DESCRIPTION (below) | MATLAB CODE ]
Introduction:
Automatic audio segmentation is an important topic in audio signal
processing. Segmentation is an essential step for indexing and retrieval of
audio data. Being with a great variety, musical data are difficult to segment
using a fixed set of features or according to a predefined procedure. Different
kinds of musical should be segmented in different way.
In this project I try to deal with the problem of segmenting popular
music with vocal sounds. The vocal part usually constitutes the main melody of
this kind of music. When people sing, there will be breaks between sentences.
Therefore, the signal energy between sentences should be lower than the energy
within a sentence. I will try to use this concept as solution to our
segmentation problem. If we can find these breaks between sentences, we can
segment the music sentence by sentence.
The difficulty of doing the above practically is that there will be many
accompaniment sounds besides vocal. The good news is that the main voice of the
accompaniment sounds will usually be the same with the vocal part, resulting an
energy gap between sentences, too. Other voices of the accompaniment sounds
usually will be in lower pitch. If we extract the frequency band of the vocal
sounds, most of interferences can be eliminated .Of course there is some sounds
such as drum which do not comply with this rule.
Another difficulty is that the dynamic range of audio signal varies
a lot. It is not easy to classify different segment of music by set a threshold
on features or by a fixed classifier. Thus we must treat the features as a time
series. The same value of features can have different meaning when it occurs at
different position. I use piecewise linear representation to segment a time
series in this project.
Method:
l
Feature extraction:
The feature I used here is very simple. I use the short time energy
of one sub-band of the music signal. The waveform samples are first passed to a
6-order elliptic band-pass filter with a passband 800Hz to 1600Hz. Then the
samples are divided to small frame, which are 100 ms in length. Average energy
of the band-pass filtered signal within a frame is calculated as features.
l
Piecewise linear representation:
Detail description of the piecewise linear representation (PLR) can
be found in reference [1]. Here we use a top-down approach to derive the linear
approximation of a time series. The steps are described as the following:
1.
Before we begin to search for linear
segments, an error bound need to be specified.
2.
Find a best point to split the time
series to two segments. For every point in the series, we try it as a splitting
point and calculate the linear regression of two segments and the errors. The
point resulting in smallest error is chosen as the final splitting point.
3.
Recursively split the left segment
using the step 2 if the error bound is not achieved.
4.
Recursively split the right segment
using the step 2 if the error bound is not achieved.
l
Segmenting the time series
After the piecewise linear approximation of the features is found,
ach splitting points are examined as a candidate of segmentation point. If the
slope of the approximation line changes from falling to rising at a splitting
point, it is detected as a valley of the time series. These valley points are
chosen as the segmenting points. Segments that are too long or too short are
then merged or spitted again.
Experiment and Result:
I use the above procedure to analyze two popular songs. One is
¡§Numb¡¨ by Linking Park, and the other is ¡§What you never know¡¨ by Sarah
Brightman. The former is with louder background accompaniment sound and is
considered as harder to find the gap between sentences. The subband energy and
their PLR are illustrated in the figure below. Only 40 seconds of data are
shown in the figure.

In ¡§Numb¡¨, 31
segmentation points are found in 120 seconds of music. If we don¡¦t count those
appears in pure music section, 10 of them sounds to occur within a sentence and
cause false alarm.
In ¡§What you
never know¡¨, the situation is better. 24 segmentation points are found and only
5 seem to be false alarm.
List of the
segmentation points found by the algorithm:
Numb:
(second)
12.4 15.8 20 25.6
29.7 33.8 36.3 42.3 47
51.6 54.1 56.9 60.4 63.4 65.7 68.7 71
74.1 78.3 81.8 85.1 87.7 90.3 94.9 98.3 101.2 105.8
109.4 113.6 117 119.3
What you never know:
4.2
11 19.4 22.9 28 31.4 34 41.3
44.2 49.8 52.7 55.3 63.9 66.6 71.6 74.3 77.3 85.4 93.1 98.4 102.7
105.2 114.4 118.4
Conclusion:
The experiment shows that the proposed method can find meaningful
segmentation points sometimes, but the false rate is still very high. That
should be because we only use one feature in our analysis. If we can
incorporate more features, the performance should be better. The PLR method can
be easily extend to multiple features, but the more challenging problem is how
to choose good features. If we choose inappropriate features, the result can be
worse than only using one feature.
We can refine the system can by integrating the onset detection and
try to align the segmentation point with the start of beats. The results will
sound better if the music is cut at where a beat starts.
This system uses a heuristic method for segmentation. The advantage
of such a system is that it didn¡¦t need any training process and will not over
fit to the training data. The disadvantage is there are too many parameters
needing to be tuned. We have to decide the pass-band of the filter we use, the
error bound of the PLR, etc. In this project I basically decide these values by
try-and-error.
Reference:
[1] ¡§An Online Algorithm for Segmenting
Time Series.¡¨Keogh, E., Chu, S., Hart, D., Pazzani, M. In The IEEE International
Conference on Data Mining (ICDM), 2001.
The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees