Robust voice activity detection using long-term signal variability
Prasanta Kumar Ghosh, Andreas Tsiartas, and Shrikanth Narayanan


IEEE Trans. Audio, Speech and Language Processing, Volume 19, No. 3, March 2011, pp 600-613

Abstract: We propose a novel long-term signal variability (LTSV) measure, which describes the degree of non-stationarity of the signal. We analyze the LTSV measure both analytically and empirically for speech and various stationary and non-stationary noises. Based on the analysis, we find that the LTSV measure can be used to discriminate noise from noisy speech signal and, hence, can be used as a potential feature for voice activity detection (VAD). We describe an LTSV-based VAD scheme and evaluate its performance under eleven types of noises and five types of signal-to-noise ratio (SNR) conditions. Comparison with standard VAD schemes demonstrates that the accuracy of the LTSV-based VAD scheme averaged over all noises and all SNRs is ~6% (absolute) better than that obtained by the best among the considered VAD schemes, namely AMR-VAD2. We also find that, at -10dB SNR, the accuracies of VAD obtained by the proposed LTSV-based scheme and the best considered VAD scheme are 88.49% and 79.30% respectively. This improvement in the VAD accuracy indicates the robustness of the LTSV feature for VAD at low SNR condition for most of the noises considered.


(pdf)   (Software [coming soon])


References:

[1] Freeman D. K., Southcott C. B., Boyd I., and Cosier G., “A voice activity detector for pan-European digital cellular mobile telephone service”, Proc. IEEE ICASSP, Glasgow, U.K., 1989, vol. 1, pp 369-372.
[2] Sangwan A. Chiranth M.C., Jamadagni H.S., Sah R., Prasad R.V., Gaurav V., “VAD techniques for real-time speech transmission on the Internet”, IEEE Int. Conf. on High-Speech Networks and Multimedia Comm., 2002, pp 365-368.
[3] Itoh K., Mizushima M., “Environmental noise reduction based on speech/non-speech iden-tification for hearing aids”, Int. Conf. on Acoust. Speech Signal Proc., vol. 1, 1997, pp 419-422.
[4] Vlaj D., Kotnik B., Horvat B., and Kacic Z., “A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems”, EURASIP Journal on Applied Signal Processing, 2005, issue 4, pp 487-497.
[5] Enqing D., Heming Z., and Yongli L., “Low bit and variable rate speech coding using local cosine transform”, Proc. TENCON, Oct 2002,vol 1, pp 423-426.
[6] Krishnan P. S. H., Padmanabhan R., Murthy H. A., “Voice Activity Detection using Group Delay Processing on Buffered Short-term Energy”, Proc. of 13th National Conference on Communications, 2007.
[7] Soleimani, S.A., and Ahadi, S.M., “Voice Activity Detection based on Combination of Mul-tiple Features using Linear/Kernel Discriminant Analyses”, 3rd International Conference on Information and Communication Technologies: From Theory to Applications, 7-11 April 2008, pp 1-5.
[8] Evangelopoulos G. and Maragos P., “Speech event detection using multiband modulation energy”, Proc. Interspeech, Lisbon, Portugal, 4-8 Sep 2005, pp 685-688.
[9] Kotnik B., Kacic Z., and Horvat B., “A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm”, Proc. 7th EUROSPEECH, Aalborg, Denmark, September 2001, pp 197-200.
[10] Craciun A., and Gabrea M., “Correlation coefficient-based voice activity detector algorithm”, Canadian Conference on Electrical and Computer Engineering, 2-5 May 2004, vol. 3, pp 1789-1792.
[11] Lee Y. C., and Ahn S. S., “Statistical model-based VAD algorithm with wavelet transform”, IEICE Trans. Fundamentals, June 2006, vol. E89-A, no. 6, pp 1594-1600.
[12] Pwint M., and Sattar F., “A new speech/non-speech classification method using minimal Walsh basis functions”, IEEE International Symposium on Circuits and Systems, 23-26 May 2005, vol. 3, pp 2863-2866.
[13] Haigh J., and Mason J. S., “A voice activity detector based on cepstral analysis”, Proc. 3rd EUROSPEECH, Berlin Germany, September 1993, pp 1103-1106.
[14] McClellan S., and Gibson J. D., “Variable-rate CELP based on subband flatness”, IEEE Trans. Speech Audio Proc., 1997, vol. 5, no. 2, pp 120-130.
[15] Prasad R., Saruwatari H., and Shikano K., “Noise estimation using negentropy based voice-activity detector”, 47th Midwest Symposium on Circuits and Systems, 25-28 July 2004, vol. 2, pp II-149 - II-152.
[16] Renevey P., and Drygajlo A., “Entropy based voiced activity detection in very noisy conditions”, Proc. EUROSPEECH, Aalborg, Denmark, Sep 2001, pp 1887-1890.
[17] Sohn J., Kim N. S., and Sung W., “A statistical model-based voice activity detection”, IEEE Signal Proc. letters, Jan 1999, vol. 6, no. 1, pp 1-3.
[18] Davis A., Nordholm S., and Togneri R., “Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold”, IEEE Trans. on Audio, Speech and Language Proc., March 2006, vol. 14, no. 2, pp 412-424.
[19] Chang J. H., and Kim N. S., “Voice activity detection based on complex Laplacian model”, IEE Electronics letters, April 2003, vol. 39, no. 7, pp 632-634.
[20] Cho Y. D., and Kondoz A., “Analysis and improvement of a statistical model-based voice activity detector”, IEEE Signal Proc. Letters, Oct 2001, vol 8, no. 10, pp 276-278.
[21] Liberman A. M., “Speech: a special code”, MIT Press, 1996.
[22] PadmanabhanΛR., Krishnan P. S. H., and Murthy H. A., “A pattern recognition approach to VAD using modified group delay”, Proc. of 14th National conference on Communications, Feb 2008, IIT Bombay, pp 432-437.
[23] Ramirez J., Segura J. C., Benitez C., Torre A., and Rubio A., “Efficient voice activity detection algorithms using long-term speech information”, Speech Communication, April 2004, vol. 42, issues 3-4, pp 271-287.
[24] Breithaupt, C., Gerkmann, T., and Martin, R., “A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing”, Proc. ICASSP, Apr 2008, pp 4897-4900.
[25] Greenberg S., Ainsworth W. A., Popper A. N., Fay R. R., “Speech Processing in the Auditory System”, Illustrated edition, Springer, 2004, pp 23.
[26] Manolakis D. G., Manolakis D., Ingle V. K., Kogon S. M., “Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing”, Artech House Publishers, April 30, 2005.
[27] “DARPA-TIMIT”, Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1-1.1, 1990.
[28] Varga A. and Steeneken H. J. M., “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems”, Speech Communication, vol. 12, issue 3, July 1993, pp 247-251.
[29] Bies D. A., and Hansen C. H., “Engineering Noise Control: Theory and Practice”, Edition: 3, illustrated, Published by Taylor & Francis, 2003, Sec. 4.6 and pp 150.
[30] Green D.M. and Swets J.M., “Signal detection theory and psychophysics”, New York: John Wiley and Sons Inc., 1966.
[31] Beritelli F., Casale S., and Ruggeri G., “A physcoacoustic auditory model to evaluate the performance of a voice activity detector”, Proc. Int. Conf. Signal Proc., Beijing, China, 2000, vol. 2, pp 69-72.
[32] Beritelli F., Casale S., and Cavallaro A., “A robust voice activity detector for wireless communications using soft computing”, IEEE J. Select. Areas Commun., Dec 1998, vol. 16, no. 9, pp 1818-1829.
[33] Digital Cellular Telecommunications System (Phase 2+); Voice Activity Detector (VAD) for Adaptive Multi Rate (AMR) Speech Traffic Channel; General Description, 1999.
[34] ITU, Coding of Speech and 8 kbit/s Using Conjugate Structure Algebraic Code - Excited Linear Prediction. Annex B: A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommend. V.70, International Telecommunication Union, 1996.
[35] Digital Cellular Telecommunications System (Phase 2+); Adaptive Multi Rate (AMR) Speech; ANSI-C code for AMR Speech Codec, 1998.
[36] ITU, Coding of Speech at 8 kbit/s using Conjugate Structure Algebraic Code - Excited Linear Prediction. Annex I: Reference Fixed-Point Implementation for Integrating G.729 CS-ACELP Speech Coding Main Body With Annexes B, D and E, Int. Telecommun. Union, 2000.
[37] Cover T. M., Thomas J. A., “Elements of Information Theory”, Wiley-Interscience, August 12, 1991.
[38] Gubner J. A., “Probability and Random Processes for Electrical and Computer Engineers”, 1 edition, Cambridge University Press, June 5, 2006, pp 565.











The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees