Segmenting Popular Music Sentence by Sentence


by Wan-chi Lee


Final Project for ISE 575







[ DESCRIPTION (below) | MATLAB CODE ]


Introduction:


        Automatic audio segmentation is an important topic in audio signal processing. Segmentation is an essential step for indexing and retrieval of audio data. Being with a great variety, musical data are difficult to segment using a fixed set of features or according to a predefined procedure. Different kinds of musical should be segmented in different way.

 

In this project I try to deal with the problem of segmenting popular music with vocal sounds. The vocal part usually constitutes the main melody of this kind of music. When people sing, there will be breaks between sentences. Therefore, the signal energy between sentences should be lower than the energy within a sentence. I will try to use this concept as solution to our segmentation problem. If we can find these breaks between sentences, we can segment the music sentence by sentence.

 

The difficulty of doing the above practically is that there will be many accompaniment sounds besides vocal. The good news is that the main voice of the accompaniment sounds will usually be the same with the vocal part, resulting an energy gap between sentences, too. Other voices of the accompaniment sounds usually will be in lower pitch. If we extract the frequency band of the vocal sounds, most of interferences can be eliminated .Of course there is some sounds such as drum which do not comply with this rule.

 

Another difficulty is that the dynamic range of audio signal varies a lot. It is not easy to classify different segment of music by set a threshold on features or by a fixed classifier. Thus we must treat the features as a time series. The same value of features can have different meaning when it occurs at different position. I use piecewise linear representation to segment a time series in this project.

 

Method:

 

l          Feature extraction:

The feature I used here is very simple. I use the short time energy of one sub-band of the music signal. The waveform samples are first passed to a 6-order elliptic band-pass filter with a passband 800Hz to 1600Hz. Then the samples are divided to small frame, which are 100 ms in length. Average energy of the band-pass filtered signal within a frame is calculated as features.

 

l          Piecewise linear representation:

Detail description of the piecewise linear representation (PLR) can be found in reference [1]. Here we use a top-down approach to derive the linear approximation of a time series. The steps are described as the following:

1.          Before we begin to search for linear segments, an error bound need to be specified.

2.          Find a best point to split the time series to two segments. For every point in the series, we try it as a splitting point and calculate the linear regression of two segments and the errors. The point resulting in smallest error is chosen as the final splitting point.

3.          Recursively split the left segment using the step 2 if the error bound is not achieved.

4.          Recursively split the right segment using the step 2 if the error bound is not achieved.

 

l      Segmenting the time series

After the piecewise linear approximation of the features is found, ach splitting points are examined as a candidate of segmentation point. If the slope of the approximation line changes from falling to rising at a splitting point, it is detected as a valley of the time series. These valley points are chosen as the segmenting points. Segments that are too long or too short are then merged or spitted again.

 

Experiment and Result:


        I use the above procedure to analyze two popular songs. One is ¡§Numb¡¨ by Linking Park, and the other is ¡§What you never know¡¨ by Sarah Brightman. The former is with louder background accompaniment sound and is considered as harder to find the gap between sentences. The subband energy and their PLR are illustrated in the figure below. Only 40 seconds of data are shown in the figure.

 

In ¡§Numb¡¨, 31 segmentation points are found in 120 seconds of music. If we don¡¦t count those appears in pure music section, 10 of them sounds to occur within a sentence and cause false alarm.

In ¡§What you never know¡¨, the situation is better. 24 segmentation points are found and only 5 seem to be false alarm.

 

List of the segmentation points found by the algorithm:

 

Numb: (second)

12.4  15.8  20 25.6  29.7  33.8  36.3  42.3  47 51.6  54.1  56.9  60.4  63.4  65.7  68.7  71 74.1  78.3  81.8  85.1  87.7  90.3  94.9  98.3  101.2  105.8  109.4  113.6  117 119.3 

What you never know:

4.2  11 19.4  22.9  28 31.4  34 41.3  44.2  49.8  52.7  55.3  63.9  66.6  71.6  74.3  77.3  85.4  93.1  98.4  102.7  105.2  114.4  118.4

 

Conclusion:

 

        The experiment shows that the proposed method can find meaningful segmentation points sometimes, but the false rate is still very high. That should be because we only use one feature in our analysis. If we can incorporate more features, the performance should be better. The PLR method can be easily extend to multiple features, but the more challenging problem is how to choose good features. If we choose inappropriate features, the result can be worse than only using one feature.

We can refine the system can by integrating the onset detection and try to align the segmentation point with the start of beats. The results will sound better if the music is cut at where a beat starts.

This system uses a heuristic method for segmentation. The advantage of such a system is that it didn¡¦t need any training process and will not over fit to the training data. The disadvantage is there are too many parameters needing to be tuned. We have to decide the pass-band of the filter we use, the error bound of the PLR, etc. In this project I basically decide these values by try-and-error.

       

Reference:


[1] ¡§An Online Algorithm for Segmenting Time Series.¡¨Keogh, E., Chu, S., Hart, D., Pazzani, M.   In The IEEE International Conference on Data Mining (ICDM), 2001.








The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees