SiiI-î^iTTEi:' .TC- ТН£ Sf'Sl‘-vb:-r‘r ;E>-CT o f ^f,0.0000-0 ол · ölçüler öznel...

‘W 4; к fr, * .ü .r ¿ - ;?■.■·:.»> i ■.•iWit4aií i . · i · ,« .i-ft,«'· rf '. i'r à. :··'* Ач'г*· a. :"« 'b»;·' Î i \ 5 -'«„ί J · U V 4 Й Л

Í Í '£ T F C “ írV -.i Æ fi'«r s-■r ; ^W■Л 'i ΐ·',,·;ίφ ír5rff.?^\f-s>í ,'.1/;·-3SÄ» İS-ИГ- Ü itSK *Шг S f. f·^. ·, . A ÜAimAJ '».írfr' '*é ..M UiU,. ,ί ,J. ЦГ.4Д£(І.·, *·4(^«Ϊ' .,.ги1 'Κ,ι*>· ^ ИІ' -.u-lUf ’

.........; .■ ’" ·· лт;««»··' t е*и < t -r t ¿W'f . . . , . , ............ · ■ ......................... ОЦ· ' ■ ■.·...·,; i ■' " ■ ·■ ................. ....................

S iiI-î^ iT T E i:'. TC- ТН £ Sf'Sl‘-vb:-r‘r7;E>-CT o f f,0.0000-0 ол'н ft.fş.^âr ,?<í

J O ü. ...i ïb · ; 'au/·' · и. Λ '"'.и'.L. ;ϋΗ-Ί'··';0 •»«•Ui,-· ¿ " J O * ,;ui Іі’лциіім ті· м м * "d ‘Ş)m ·

ä N u ; : .4 s . . . - .иг. и. .* - · і •-ij-~ . ■ .Ь . ■ .'.г -.,-*..з...I** . ·|^/J· 4·-Г: U ·'"'г

'IA ! ,φ. 5 . И ttï-i·· I ■ 5.sîÇîîï,0ïn.ono . A ·<··;·-. 0 ;:···: *·/: ο··-:¿и. и.;·; U. А А .W*: *. лиѵиі и ·<«.«· .... U . ·.. І.Г: 1 .·.! Чі': ÎJ.U. .V І**?·. «·. ' ":— ·'■ ·*ί · “ ·~·—~ ···"'■.<■ ••**· ·* t»< . ьг-*· - --»·~-{,аііЦ, »wwi j :|·.ϊί»·« · . ■ » · · · / > .

ïu- T ·\ -,.> лпіа;- тщіг ■ ■.г;жгл?.гаді . .-я»"...·!«» '.Sf,v; *j'*l· · ; · ■,, ■ ■. ! ; ■ w ■ .,¡1 îu«·-. m. m M . . M . -.y- IÊt M«M(. , ''· ·»

SPEECH SPECTRUM NON-STATIONARITY DETECTION BASED ON LINE SPECTRUM

FREQUENCIES AND RELATED APPLICATIONS

A THESIS

SUBM ITTED TO THE DEPARTMENT OF ELECTRICAL AND

ELECTRONICS ENGINEERING

AND THE INSTITUTE OF ENGINEERING AND SCIENCES

OF BILKENT UNIVERSITY

IN PARTIAL FULFILLM ENT OF THE REQUIREM ENTS

FOR THE DEGREE OF

M ASTER OF SCIENCE

ByAll Erdem ERTAN

October 1998

T tm ı

' S e ç

I certify that I have read this thesis and that in my opinion it is fully adequate,

in scope and in quality, as a thesis for the of Master of Science. ■1

- e y '

. Enis Çetin, Ph. D (Supervisor)

I certify that 1 have read this thesis and that in rny opinion it is fully adequate,

in scope and in quality, as a thesis for the degree of Master of Science.

Miibeccel Demirekler, Ph. D

I certify that I have read this thesis and that in my opinion it is fully adequate,

in scope and in quality, as a thesis for the degree of Master of Science.

Orhan Arikan, Ph. D

Approved for the Institute of Engineering and Sciences:

Prof. Dr. Mehmet Baray y .Director of Institute of Engineering and Sciences

ABSTR AC T

SPEECH SPECTRUM NON-STATIONARITY DETECTION BASED ON LINE SPECTRUM FREQUENCIES AND

RELATED APPLICATIONS

Ali Erdeni ERTANM.S. in Electrical and Electronics Engineering

Supervisor: A. Enis Çetin, Ph. D October 1998

In this thesis, two new speech variation measures for speech spectrum non- stationarity detection are proposed. These measures are based on the Line Spectrum Frequencies (LSF) and the spectral values at the LSF locations. They are formulated to be subjectively meaningful, mathematically tractable, and also have low computational complexity property. In order to demonstrate the usefulness of the non-stationarity detector, two applications are presented: The first application is an implicit speech segmentation system which detects non-stationary regions in speech signal and obtains the boundaries of the speech segments. The other application is a Variable Bit-Rate Mixed Excitation Linear Predictive (VBR-MELP) vocoder utilizing a novel voice activity detector to detect silent regions in the speech. This voice activity detector is designed to be robust to non-stationary background noise and provides efficient coding of silent sections and unvoiced utterances to decrease the bit-rate. Simulation results are also presented.

Keywords: Speech variation measure, spectrum non-stationarity detection, formant estimation. Line Spectrum Frequencies (LSF), speech segmentation, Mixed Excitation Linear Predictive coding (MELP), variable bit-rate vocoder, voice activity detector.

Ill

ÖZET

ÇİZGİ İZGE SIKLIKLARININ TEMEL ALINMASI İLE KONUŞMA İZGESİNDEKİ DURAĞANSIZLIĞIN SEZİMİ VE

İLGİLİ UYGULAMALAR

Ali Erdem ERTANElektrik ve Elektronik Mühendisliği Bölümü Yüksek Lisans

Tez Yöneticisi: Prof. Dr. A. Enis Çetin Ekim 1998

Bu tezde, konuşma izgesindeki durağansızlıklann sezimi için iki yeni konuşma değişgenlik ölçüsü önerilmiştir. Bu ölçüler yaratılırken Çizgi Izge Sıklıkları (ÇİS) ve ÇIS konumlarındaki izgesel değerler taban alınmıştır. Önerilen ölçüler öznel olarak anlamlı ve düşük hesaplama karmaşıklığı olacak ve matematiksel olarak izlenebilecek şekilde formüle edilmişlerdir. Durağansızlık sez- imleyicisinin yararlılığını göstermek için iki uygulama sunulmuştur: Birinci uygulama, konuşma sinyalindeki durağansız bölgeleri bulan ve bu bölgelerde bulunan konuşma parçalarının sınırlarını sezimleyen bir kesin konuşma bölütleyicisidir. Öteki uygulama ise konuşmadaki sessiz bölgeleri sezimleyen yeni bir konuşma faaliyet sezimcisini kullanan Değişken İkil Hızlı-Karışık Tahrikli Doğrusal Öngörülü (DIH-KTDÖ) kodlama ses kodlayıcısıdır. Bu ses faaliyet kestirimcisi, durağansız arka plan gürültüsüne dayanıklı olacak şekilde tasarlanmıştır ve ikil-hızın düşürülmesi için sessiz bölgelerin ve sessiz harflerin verimli kodlanmasına olanak sağlamaktadır. Test sonuçları da tezde sunulmuştur.

Anahtar Kelimeler: Konuşma değişgenlik ölçüsü, izgideki durağansızlıklann sezimi, formant kestirimi. Çizgi Izge Sıklıkları (ÇIS), konuşma bölütleme. Karışık Tahrikli Doğrusal Öngörülü (KTDÖ) kodlama, değişken ikil-hızh ses kodlayıcısı, ses faaliyet sezimi.

IV

ACKNOW LEDGEM ENT

I would like to express my deep gratitude to Prof. Dr. A. Enis Çetin for his supervision, guidance, suggestions and patience throughout this study, and also for encouraging me to attend the international conferences, which provide me motivation and experience.

I would also express my special thanks to Prof. Dr. Mübeccel Demirekler for her enormous help and guidance for the last two years, from whom I really enjoy learning the basics and the theory behind speech processing.

I would like to thank Dr. Orhan Arikan for reading and commenting on the thesis.

I would like to acknowledge to TÜBİTAK-BİLTEN who supported our work.

I would also like to thank to all of my friends in TÜBİTAK-BİLTEN and Bilkent University who have been with me during my M.Sc. study and MELP project, especially Emre, Levent, Önder, Haydar, Murat and Taner for their great friendship and sharing long working nights in TUBITAK, and Dr. H. Gökhan Ilk for the long discussions on my thesis.

My special thanks go to Cem Baydar and Yılmaz Acar for their continuous Eye of the Tiger” style morale support. I also want thank to S. Muzaffer for

providing our security throughout the development of this thesis!

Finally, it is a pleasure for me to express my sincere thanks to my family for their continuous morale support throughout my graduate study, and my sweetheart, Didem Öztürk, for her endless morale support and patience.

C ontents

1 Introduction 1

1.1 Linear Modeling of Vocal T r a c t ...................................................... 6

1.2 Linear P red ic tion ............................................................................... 11

1.2.1 Autocorrelation M e th o d ..................................................... 12

1.2.2 Covariance M ethod ............................................................... 13

1.3 Modeling of Human Speech Production System 14

1.3.1 Line Spectrum Frequencies (L S F ) ...................................... 16

2 Spectrum Non-Stationarity D etection Algorithm Based onLine Spectrum Frequencies 19

2.1 Peak Estimation 21

2.1.1 Experimental D a t a ............................................................... 21

2.1.2 Detection of a Peak in a LSF R e g io n ............................... 22

2.1.3 Accurate Estimation of Peak Location 28

2.2 Non-stationarity D etection............................................................... 31

2.3 Simulation Studies 36

vi

2.3.1 Performance Test for Peak Estimation Algorithm . . . . 36

2.3.2 Performance Test for Non-Stationarity D e tec to r............. 38

2.4 Summary 41

3 Speech Segmentation 43

3.1 Phonological U n i t s ........................................................................... 45

3.1.1 Phonemic and Phonetic Classification............................... 45

3.1.2 Characterization of Segments and Boundaries 47

3.2 Speech Segmentation System........................................................... 48

3.2.1 Pre-Processing S y s tem ........................................................ 49

3.2.2 Boundary Location Estimation 49


3.4 Summary 54

4 Variable Bit-R ate M ixed Excitation Linear Prediction Vocoder 55

4.1 Mixed Excitation Linear Prediction V ocoder................................ 58

4.1.1 Basic Synthesizer.................................................................. 60

4.1.2 Mixed E xcita tion .................................................................. 62

4.1.3 Aperiodic Pulses 63

4.1.4 Adaptive Spectral Enhancement......................................... 63

4.1.5 Pulse Dispersion F i l te r ........................................................ 65

4.1.6 Fourier Series M agnitudes.................................................. 65

Vll

4.1.7 Flowchart of the MELP Decoder 66

4.1.8 Flowchart of the MELP Encoder 67

4.1.9 Performance Evaluation 68

4.2 Voice Activity D e tec to rs .................................................................. 68

4.3 Variable Bit-rate MELP Vocoder 71

4.3.1 VAD for VBR-MELP Vocoder............................................ 72

4.3.2 Bit Allocation for VBR-MELP Vocoder.............................. 81


4.4.1 Diagnostic Rhyme T e s t ........................................................ 83

4.4.2 Vlean Opinion Score 84

4.4.3 Performance of VBR-MELP V ocoder................................. 85

4.5 Summary 86

5 Conclusion 88

A PPENDICES 90

A Threshold Extraction for Elimination of Misclassified LSF Regions 91

B Selection of parameters for minimizing peak location estim ation error 108

vni

List o f Figures

1.1 Simplified interconnected acoutic tube modeling. 6

1.2 Simplified system. 7

1.3 Discrete-time equivalent of the system...................................................... 7

1.4 Single stage of tranvsfer system................................................................... 8

1.5 Modified single stage of transfer system................................................... 9

1.6 Interconnected N stages............................................................................... 10

1.7 LPC vocoder synthesizer. 15

1.8 LP power spectrum and the associated LSFs for voiced and unvoicedspeech.............................................................................................................. 18

2.1 Logarithm and 0.15 ^ power of power spectrum of utterance / a / isplotted in (a) and (b), respectively.......................................................... 23

2.2 Algorithm for classification of the LSF regions. 27

2.3 Number of occurance versus difference between original and estimated peaks. The solid, dashed and dcished dotted lines corresponds to simple mean, energy weighted mean with r = 0.15 and energy weighted mean with r = r ·, respectively................................................... 30

IX

2.4 Time sequence and spectrogram of Turkish word ’’Şanslı adam kaybettiği mücheveri buldu” . Formant frequencies are plotted withon the spectrogram. 31

2.5 The top, middle and bottom figures belong to the signals which contain two noise excited regions, two pulse excited regions and both excitation and spectral change regions, respectively. The region between the dashed lines are the ones which are flagged as non- stationary regions by the proposed algorithm..........................................40

3.1 Pre-processing system applied to 1.05 second male speech containing words Firma tamtim indd\ The regions between two dashed lines .are transient regions detected by proposed algorithm............................ 51

3.2 Boundary location estimator applied to 1.05 second male speech containing words ’’Firma tamtimvndd'^ after detection of non-stationary regions. The dashed lines show detected boundaries.............................. 52

4.1 Fourier transform of mixed excited phoneme. 60

4.2 Synthesizer of MELP Vocoder.................................................................... 61

4.3 Flowchart of the MELP decoder................................................................ 66

4.4 Flowchart of the MELP encoder................................................................ 67

4.5 The voice activity detector for the pan-european digital cellular mobile telephone service. 73

4.6 Voice activity detector for VBR-MELP Vocoder. 75

4.7 Flowchart of initial silence detector........................................................... 76

4.8 State transition diagram of the decision box. SI stands for silencestate. PD stands for primary detection state. SE stands for speech detected frames. HO stands for hangover state ....................................... 79

4.9 Flowchart of periodicity detector. 80

4.10 Flowchart of noise variance adaptation block. PFS and CFS standfor the state of the previous frame and current frame, respectively. 82

A.l Percentage versus T¿ for the LSF regions 1, 2 and 3. First column shows the percentage of correct estimated of regions which contains peak in it. Second column shows the rnisclassified regions which has no peak in it. T\ — 320, Г2 = 300, T3 = 320........................................ 93

A.2 Percentage versus T¿ for the LSF regions 4, 5 and 6. T4 = 330,T5 = 320, Гб = 340................................................................................ 94

A.3 Percentage versus T¿ for the LSF regions 7, 8 and 9. Г7 = 325,Ts = 340, Г9 = 400................................................................................ 95

A.4 Percentage versus 7¿ for the LSF regions 1, 2 and 3. 71 = 130,72 = 106, 7 3 = 164................................................................................. 96

A.5 Percentage versus 7¿ for the LSF regions 4, 5 and 6. 74 = 130,75 = 160, 76 = 140................................................................................. 97

A.6 Percentage versus 7,· for the LSF regions 7, 8 and 9. 77 = 190,78 = 148, 79 = 165................................................................................. 98

A.7 Percentage versus оц for the LSF regions 1, 2 and 3. ai = 164,«2 = 130, «3 = 215. 99

A.8 Percentage versus o;¿ for the LSF regions 4, 5 and 6. = 250,«5 = 250, «6 = 170................................................................................100

A.9 Percentage versus o;¿ for the LSF regions 7, 8 and 9. 017 = 250,as = 120, 0-9 = 250................................................................................ 101

A. 10 Percentage versus /3,- for the LSF regions 1, 2 and 3. Pi = N/A,= N/A, = ............................................................................. 102

A. 11 Percentage versus pi for the LSF regions 4, 5 and 6. P4 = N/A,/З5 = 8 0 , /Зб = 32............................................................................................................................................ 103

XI

A. 12 Percentage versus [}{ for the LSF regions 7, 8 and 9. fir = 25, /3s = 75,/?9 = 25.............................................................................................................. 104

A .13 Percentage versus Q for the LSF regions 1, 2 and 3. Ci = 100,C2 = 100, Ca = 62.5.......................................................................................... 105

A. 14 Percentage versus Q for the LSF regions 4, 5 and 6. C4 = 21, C5 =17.5, C6 - 40..................................................................................................... 106

A. 15 Percentage versus Q for the LSF regions 7, 8 and 9. Ct = 17.5,Cs = 75, C9 = 25.............................................' .............................................107

B . l Statistics of error in peak location estimation error for LSF region 1.First and second column presents statistical data about voiced frames and all frames, respectively. First index in all figures corresponds to the data extracted by simple averaging. First row shows standard deviation of error versus varying r values. Second and third row shows percentage of error in peak location estimation smaller than 25 Hz and 50 Hz versus varying r values, respectively, ti is selected as 2.5................................................................................................................. 112

B.2 Statistics of error in peak location estimation error for LSF region 2.T2 is selected as 2.6.......................................................................................113

B.3 Statistics of error in peak location estimation error for LSF region 3.T:3 is selected as 2.25.....................................................................................114



B.6 Statistics of error in peak location estimation error for LSF region 6.T(3 is selected as 3.0.........................................................................................117

B.7 Statistics of error in peak location estimation error for LSF region 7.Tj is selected as 2.35.......................................................................................118

X I 1

B.8 Statistics of error in peak location estimation error for LSF region 8.Ts is selected 2ls2.9.......................................................................................119

B.9 Statistics of error in peak location estimation error for LSF region 9.tq is selected as 2.3.......................................................................................120

B.IO Number of occurrence versus difference between original and estimated peaks for the LSF regions 1, 2 and 3. First and second column presents statistical data about voiced frames and all frames, respectively. The solid, dashed and dashed-dotted lines corresponds to simple mean, weighted mean with r = 0.15 and weighted mean with T = Tj·, respectively, ti = 2.5, r-2 = 2.6 and = 2.25....................121

B .l l Number of occurrence versus difference between original and estimated peaks for the LSF regions 4, 5 and 6. The solid, dashed and dashed-dotted lines corresponds to simple mean, weighted mean with r = 0.15 and weighted mean with r = r ·, respectively. T4 = 2.6,T5 = 2.35 and re = 3.0..................................................................................122

B.12 Number of occurrence versus difference between original and estimated peaks for the LSF regions 7, S and 9. The solid, dashed and dashed-dotted lines corresponds to simple mean, weighted mean with r = 0.15 and weighted mean with r = r , respectively, ry = 2.35,Ts = 2.9 and rg = 2.3....................................................................................123

X l l l

List o f Tables

2.1 Statistics about classification of LSF regions in bandwidth based nnethod for voiced speech. Pc stands for the percentage of the correct classified LSF regions which contains a peak in it. Pm stands for the percentage of the misclassified LSF regions which contains a peak in it. F/rp stands for the percentage of the false peak assigned LSF regions with respect to total peak assigned LSF regions. Pfnp stands for the percentage of false peak assigned LSF regions with respect to the LSF regions which does not contain a peak................ 24

2.2 Statistics about classification of LSF regions in energy based methodfor voiced speech........................................................................................... 24

2.3 Thresholds for elimination and detection of misclassified regions. 26

2.4 Statistics about classification of LSF regions in both methods forvoiced speech. 28

2.5 Statistics about classification of LSF regions in both methods forentire speech.................................................................................................. 28

2.6 Overall performance of the system for the proposed method. Percentage of correct classification of the LSF region state is tabulated. 28

2.7 Selected and corresponding mean of the error in peak estimation,in radian................................................................................................... 30

X IV

2.8 Statistics about classification of LSF regions for proposed algorithmfor the voiced speech for the test set......................................................... 37

2.9 Statistics about clcussiiication of LSF regions for proposed algorithmfor the entire speech for the test set.......................................................... 37

2.10 Overall performance of the proposed method for the test set. Percentage of correct classification of the LSF region state is tabulated. 37

2.11 Percentage of the error smaller than 25 Hz and 50 Hz between actualpeak location and estimated peak location for test set for the voiced speech............................................................................................................. 38

2.12 Percentage of the error smaller than 25 Hz and 50 Hz between actualpeak location and estimated peak location for test set for the entire speech............................................................................................................. 38

3.1 Some characteristics of the explicit and implicit segmentation methods. 44

3.2 Success rate about the estimation of the transient regions in thecontinuous speech signal. Both end-point detection and segmentation within word is performed by pre-processing system of the new algorithm. Pe stands for the percentage of the correct estimated end-points. Pb stands for the percentage of the correct estimated segment boundaries. Pj stands for the percentage of insertions with respect to the whole non-stationary detected regions. 53

4.1 Bit allocation table for fixed bit-rate MELP vocoder................................. 71

4.2 Setting of the coefficients. 78

4.3 Bit allocation table for variable bit-rate MELP vocoder....................... 82

4.4 DRT scores of MELP and ACELP vocoders. WD stands for wrongdecision. 84

4.5 MOS scores of MELP, ACELP and LPC-10 vocoders........................... 85

XV

4.6 Performance of proposed VAD in various SNR levels for male speech.Pci stands for the percentage of clipped regions with respect to the overall speech sections. P ns stands for the percentage of the missed regions with respect to background noise sections.................................. 86

B .l Mean of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = ri for voiced speech..............................................................................................................109

B.2 Mean of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = for entire speech................................................................................................................ 109

B.3 Standard deviation of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for voiced speech............................................................................... 110

B.4 Standard deviation of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for entire speech................................................................................... 110

B.5 Percentage of the error smaller than 25 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for voiced speech......................................................................... 110

B.6 Percentage of the error smaller than 25 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for entire speech........................................................................ I l l

B.7 Percentage of the error smaller than 50 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for voiced speech....................................................................... I l l

B.8 Percentage of the error smaller than 50 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for entire speech.......................................................................... I l l

XVI

To M y Family and M y Beloved...

Chapter 1

Introduction

Communication is defined as the imparting or interchange of thoughts, opinions, or information by speech, writing, or signs in Webster Dictionary. Every living organism which can move likes to communicate with its own race and its living environment. Even bacteria, a one celled organism, communicate with each other to exchange DNAs to gain more immunity for the changing environment conditions. As the most complex and developed organism on the earth, human race also enjoys to communicate each other to share information and tell their emotions.

For centuries, communication methods of humans are getting complicated starting from body language and simple sounds to highly structured spoken and written languages, consist of syntactic, semantic and linguistic rules. Among these methods, speech is the most used one in the daily life of a human. Humans use their articulatory and auditory systems to generate and perceive speech. The rules for this communication method is described by language. Speaker produces different sounds and concatenates them to generate meaningful words. On the other side, the listener receives generated sound by her/his auditory system and this incoming signal is processed within brain to extract the meaning of this signal.

The advances in the digital signal processing area make speech also a serious communication media between human and machine. As a result, speech becomes a central component in digital communication. For several decades, considerable research focused on several areas of speech processing. These areas can be summarized as compression, recognition, enhancement and synthesis [1].

Speech segmentation is an important first step in coding, recognition and synthesis. The primary purpose is to segment continuous speech signal into phonetic units [2]. The simplest way is to divide the speech for fixed, nonoverlapping time intervals. This type of segmentation is mostly used in variable-rate speech compression algorithms [3]. In continuous speech recognition, end-points of the utterances within the continuous speech is extracted first [4]. After this step, various segmentation algorithms may be applied to the signal to obtain the boundaries of the segments prior to the recognition algorithms.

In this thesis we propose new speech spectrum variation measures, which are similar to speech distortion measures. They are used to detect the amount of change in the speech spectrum according to the variation of selected parameter set. Since our purpose is to detect the speech spectrum variations among analyzed frames rather than quantization effects, we use 'speech variation measure' instead of 'distortion measure'.

Segmentation algorithms can be classified into two major groups: Implicit segmentation, in which no prior information is used about signal [5-8], and explicit information, in which phonetic transcription is also available [9-13]. Success of the implicit segmentation algorithms depends primarily on the selection of the parameter set and the speech variation measure. In literature, following parameters are reported to be used:

• Modeling error of linear prediction filter [6],

• Linear Prediction Coefficients (LPC)-smoothed log amplitude spectra [2],

• parametric filtering [5],

• energies in subbands [8],

• auditory model [12], and

• Line Spectrum Frequencies (LSFs) [14].

These parameters are used in various speech variation measures, whose values are generally compared with a threshold to detect the non-stationcirity between consecutive frames or to obtain exact change point. In [15], it is stated that in order to be useful a distortion measure, following conditions must be satisfied:

1. It must be subjectively meaningful in the sense that small and large distortion corresponds to good and bad subjective quality, respectively.

2. It must be tractable in the sense that it is amenable to mathematical analysis and leads to practical design techniques.

•3. It must be computable in the sense that the actual distortions resulting in a real system can be efficiently computed.

The most commonly used distortion measure is the traditional mean squared error. However, in speech processing, this distortion measure does not provide any subjective meaning. Therefore, in speech processing algorithms, usually Itakura-Saito distortion measure (1.1), Itakura’s likelihood ratio (1.2) or L2 distance of log spectrum are used (1.3). Excellent review of distortion measures for speech processing can be found in [15].

• Itakura-Saito distortion measure:

' .?i(to)S'ziw)

/ 7T

-7T_ Incr _ 1C i \ ^->2 (to)

dw ( 1.1)

where Si{w) and S2{w) are the estimated spectral density functions of the two speech frames.

• Itakura likelihood ratio:Di = ajRi ai

ajRiU2( 1.2)

where Oi, Ri and C2 are LPC coefficient vector of the reference frame, autocorrelation matrix of the reference frame and LPC coefficients vector of the comparison frame, respectively.

L2 distance of log spectrum or spectral distortion:

Dll = I I log .S'i('w) - log S2{w) \ ^J — 7T

d'W (1.3)

where Si{w) and S-iiw) are the estimated spectral density functions of the two speech frames.

Besides these well-known measures, following measures are reported to work successfully in literature:

• Brantd’s generalized likelihood ratio test [6],

• divergence test [6],

• the pulse method: A modified divergence test with a priori unvoicedvoiced detection [6],

• the normalized correlation between selected parameters of two frames [2],

• time-correlation based speech variation measures [5], and

• weighted Euclidean distance measure [14].

In Erkeleris work [16], it is proved that spectral distortion can be approximated by weighted square distance of the coefficients of LPC filter and derived parameters, if cubic and higher terms of Taylor series expansion is neglected. The weighting matrix is equal to the inverse of the theoretical covariance matrix of the coefficients. Furthermore, LSFs are found to be uncorrelated, and only main diagonal entries of the weighting matrix are non-zero. Therefore, with the usage of LSFs, this equation is also reduced to a weighted Euclidean distance measure. Due to this fact, it is possible to derive LSF based speech variation measures which not only provide a meaningful comparison of the spectrum of two speech frames in a low computational complexity way, but subjectively meaningful as well.

The main contribution of this thesis is the new speech variation measures based on the estimated peaks of the spectrum from LSF locations, angular difference between consecutive LSFs, and usage of these measures in the detection

of spectrum non-stationarity between consecutive frames. Both of these measures show high correlation between speech spectrum and LSF displacement. These measures are formulated to be subjectively meaningful, mathematically tractable and have low computational complexity. In order to demonstrate the usefulness of the non-stationarity detector, two applications using this detector are presented: A speech segmentation system and a variable rate speech vocoder.

In Chapter 2, a novel spectrum non-stationarity detection algorithm based on the new speech variation measures is presented. In this algorithm, peaks of the spectrum are estimated from LSFs by a two-step algorithm, and these estimated peaks and the angular difference between consecutive LSFs are used to form new perceptually meaningful speech variation measures. Non-stationarity detection of speech spectrum is obtained by comparing the computed values of these measures with thresholds, which may be adjusted for different algorithms.

Chapter 3 and Chapter 4 presents applications using this non-stationarity detector: A frame based speech segmentation algorithm is presented in Chapter3. This algorithm is of implicit type and only finds the non-stationary regions in the speech signal.

In Chapter 4, a Variable Bit-Rate Mixed Excitation Linear Predictive Coding (VBR-MELP) system is described. This new coder is based on federal standard fixed-rate MELP coder [17]. In order to reduce bit-rate, unvoiced frames are encoded with fewer bits, sufficient to synthesize these sections and parameters for the frames including only silence or background noise are not transmitted. Silent parts and background noise sections are detected by a novel voice activity detector (VAD) which has non-stationary noise immunity.

Conclusions and discussions are given in Chapter 5.

In following sections, the theoretical background information including vocal tract modeling, linear prediction, modeling of human speech production system are presented. Also, detailed information about Line Spectrum Frequencies (LSFs) is given.

1.1 Linear M odeling of Vocal Tract

In this section, description of linear modeling of the vocal tract is presented and parameters, recjuired for this model, is defined. The model should be mathematically tractable, while imitating the actual system as much as possible. This task can be achieved via following assumptions:

1. The effects of nasal tract can be ignored.

2. The vocal tract can be assumed to consist of N interconnected sections where each individual section is of uniform cross-sectional area.

3. The transverse dimension of each section is small enough compared with a wavelength, so that the sound propagation through an individual section can be treated as a plane wave.

4. The internal losses due to wall vibration, viscosity and heat conduction are negligible.

5. A linear, time varying acoustic tube model of the vocal tract, uncoupled from the glottis can be constructed.

A typical example for this modeling can be seen in Figure 1.1.

1 2 3 4 N-1 N

Glottis A,(t)A4(t)

*-3 L, N■1

Figure 1.1: Simplified interconnected acoutic tube modeling.

In digital modeling of speech, length of every tube is assumed to be equal.Also r is defined as the required time for traveling of a wave from one junction toanother. If the system is excited with an impulse, it propagates down the tubes

6

being partially reflected and partially propagated at the junctions. The soonest that the impulse can reach output is N t seconds. The successive impulses due to the reflections can reach to the output at multiples of 2r seconds later. As a result, impulse response of the system can be written as:

h{t) = aoS{t — N t ) + ^ cxk5{t — N t — 2A;t )k=i

(1.4)

and the system function is:

H{.s) =A;=0

(1-5)

The term e is pure time delay. Furthermore, the resonances of the system in Figure 1.2 is defined as follows:

— s‘2kr where h{t) = h{t + N t )k=0

andH{il) =

k=0

( 1 .6)

( 1 . 7 )

uJt) ujt)

Figure 1.2: Simplified system.

Note that H{Cl) is periodic with resembling the frequency response of a discrete time system like in Figure 1.3. If ua{t) is bandlimited with we can sample it with period 2r and filter it with a digital filter whose impulse response is h{n) = an,n > 0 and obtain ui,{n), from which ui{t) can be reconstructed with an appropriate filter. Notice that, delay of N t seconds corresponds to a shift of Y samples.

ujn]

Figure 1.3: Discrete-time equivalent of the system.

The system function, corresponding to A(n), is

-k·=o

This transfer function can also be written in this form:

Ui,{z)

( 1.8)

H(z) =U a(z)

(1.9)

In order to make derivations for the transfer function, first, consider a single stage as shown in Figure 1.4 to find chain parameter representation of a 2-port network:

U,fiz) O

u;(z) CH o U, -(z)

Figure 1.4: Single stage of transfer system.

i^ i .(^ ) = (i + r , ) z - ' i ^ u * ( z ) ^ n u ; ^ , ( z )

u ; ( z ) =

where

- T k Z ^U^{z) + { \ - r k ) zr l / 2 , . , ^ 1 / 2

1 +rfc l + r k ^(1.10)

(1.11)

_ Ak-\.\ — Ak Ak-^i + Ak

(1.12)

and Ak is the cross-sectional area of the junction.

The parameter, rk·, used in modeling is called reflection coefficients and can be used as a representation of the vocal tract. Furthermore, they are more suitable for quantization purposes, since their values are bounded with —1 and 1 for non-negative cross-sectional areas.

8

The chain matrix, Qi and (4 is defined as:

Qk =

1/21+r/c

—Lk ' - 1/2

1+ 1;

1+A; .-1/2 1+A: -J

(1.13)

Ok =o i M

U k{^)

SO (1.10) and (1.11) can be expressed in matrix form:

Ü, = Qk ■ Ok^x

(1.14)

(1.1.5)

To eliminate these half sample delays, a small modification can be performed on Figure 1.4. With an additional half sample delay at end of each stage, the half sample delay in the lower branch can be moved to upper branch to eliminate the usage of half sample delays. Result of the modification is illustrated in Figure 1..5:

U,dz) O

U,-(z) CH

>0

O U, ,-(z)

Figure 1.5: Modified single stage of transfer system.

Q'k =_£ _l+rk 1+r;;;-rh 1

(1.16)L 1+rfc 1+rfc J

Note that, Q'j = z^l^Qk- Apart from the delay term, Q'k and Qk are equal, completing the discussion on the half-sample delays. Now, if N stages are considered like in Figure 1.6:

N

U\ = Qi ■ Q i - ■ ■ Qn · = Y[^Qk' Un +ik=l

(1.17)

Figure 1.6: Interconnected N stages.

The equations for the boundaries are as follows:

1 + 7’g i + ra2 -2rc

Ug{z) .1 + 7-g 1 + ra- U,

and

f 7V+lU l { z )

0

If we write transfer function, we obtain final formulation as follows:

2 -2r,Ul (z) l l + 7 ' G İ + r o .

1 ^ 1I l Q ik=l _ 0 _ H(z)

where

Qk — ^1 -rk

1+rA; l+rk-1 , - l

(1.18)

(1.19)

( 1.20)

( 1.21)

L l+Tk l+rk

The elements of Qk are either constants or z~^, implying that complete matrix product will reduce to a polynomial in z~^ of order N. So transfer function can be written as:

0 . 5 - ( l + r c ) - n i L i ( l + r i ) - z W 2H{z) =D(z)

( 1.22)

where

D{z) = [1 - ra\1 -7-1

—I'lZ 2: ^

—I'N

-r^z z ^

1

0

orN

D{z) ^ I ~ Y ^ a k · Zkz=l

-k

(1.23)

(1.24)

10

Neglecting the delay term, transfer function can be expressed as:

GH{z) = (1.25)

As a result, an all-pole model of vocal tract where the poles of H[z) are the resonance frequencies, so called formants, of the acoustic tube system is obtained.

1.2 Linear Prediction

The linear prediction estimate .s(n) of ¿(n) from previous samples of .s(n) is defined as:

s{n) = ^ ap{k)s{n - k) (1.26)A;=l

where oip{k) are the weights.

The error signal, e(n), between the original and the predicted signal is defined as: p

e(n) = .s(n) — .s(n) = ,s(n) — ^ ap{k)s{n — k) (1-27)k=l

Now the problem is to select the coefficients, ap(k), so that an error criterion is minimized. Generally, this error criterion is chosen to be the total squared error:

£p = ^ e^(n)n = 0

oo

= I ] (•5(’ ) - -5(«))’ = •s(n) - ^ ap(k)s(n - k)

(1.28)

(1.29)n=0 k=l

To find ttp(A;)’s, which minimize £p, the derivative of €p with respect to cvp(j) must be equated to 0 and resulting equations must be solved:

'tt = X] - 2.s(n - j) 1 .s(n) - ^ ap{k)s{n - A:)) = 0 for j = 1 to pdapi j ) n = 0 k=l

( l . M )

11

so:

J2s{n)s{n - j) = Y ^ 0Cp{k)^s{n - k).s{n - j ) for j = 1 to p (1.31)n = 0 k = l n= 0

If we define u{k, j ) as;

= Z ] - i )n=bi

(132)

and set 6/ and to zero and infinity, respectively, the final equation reduces

= Z for j = 1 to p (1-33)A;=l

The steps up to this point is the same for all modeling methods. The assumptions on the boundaries for the summation term in (1.32) make the difference for the finite sample modeling methods.

1.2.1 Autocorrelation M ethod

In this method, the signal is first windowed (generally with a Hamming window) and then it is used in the calculations above. Since boundaries, 6/ and 6/i, are set to 0 and oo, respectively, uj{k,j) reduces to u>{\k — i|)· After this modification, (1.33) reduces to:

^U) = Z 0Cp{k)u{\k - il) for j = 1 to pk=l

(1.34)

Now, we have p equations and p unknowns, hence o;p(A;)’s can be found by solving Equations (1-34) which can also be written in matrix form:

o;(0) a;(l)

u;(l) iu(0)

u{p) u{p - 1)

u{p)

u{p - 1)

u (0)

1 Sp

Q(p(l) 0

, «p(p) . 0

(1.35)

12

orÎ2 · a = £p · Up (i. ;36)

where a = [1, o:p(l),. . . , ap(p)]^ and Up = [1, 0,.

The a vector can be found easily by:

.,0] T

a = £p · Î2 · Up (1.37)

Inverse of fi can be obtained by Gaussian elimination method. Since Î7 is a Toeplitz matrix, a recursive formulation, called Levinson-Durbin recursion, can also be used to obtain a. The complexity of this algorithm is 0{n^), compared to 0{n^) of Gaussian elimination. The whole algorithm is reviewed in [18]. In addition to o;p(A;)’s, this recursion also produces reflection coefficients, r ,., and modeling error, Sp, as side products. Note that the gain parameter in the allpole formulation in 11.2.5) is equal to the square root of Sp. As a final remark, the filter produced by this method is always stable [18].

1.2.2 Covariance M ethod

In covariance method, there is no assumptions made on the sequence. Lower boundary, hi, and upper boundary, b , is set to p and the last element of the sequence, N, in (1.32), respectively. The matrix form of these equations is as follows:

w ( l , l ) ............. ^ ( p , i )

LÜ( l .p ) ............. w(p,p)

o;(0, 1)

_ Oik{p) _

(1.38)

The disadvantage of this method is that the positive definiteness of this matrix is not guaranteed, hence the filter, obtained by this method, may not be stable. Since matrix is not Toeplitz, these equations can not be solved by Levinson-Durbin equation. As there is no assumption on the input, the energy

13

of the residual signal is smaller than the one extracted by autocorrelation method. Therefore, it provides better modeling of the input signal, especially for deterministic sequences.

1.3 M odeling of Human Speech Production

System

As discussed in Section 1.1, the human vocal tract can be modeled by an allpole filter. Furthermore, human speech production system uses two types of excitation to produce desired sounds:

1. Vocal folds make quasi-periodic movements to produce an air flow from lungs through glottis, which has an impulsive nature. This type of excitation can be modeled with impulse train whose periods are same as the period of this quasi-periodic movement.

2. Vocal folds are completely open to produce noise like sounds. This type of excitation can be modeled with uniform distributed white noise.

The block diagram of this basic model can be seen in Figure 1.7. In this model, the human vocal tract is modeled with an all-pole filter as discussed in Section 1.2. This filter is excited with either impulse train or white noise for voiced and unvoiced speech, respectively. Finally, a gain term is applied to the synthesized speech to amplify the signal to a desired level. Generally, it is assumed that statistical characteristics of speech signal do not vary for 20-30 ms periods. Hence, frames, whose lengths are between 20 to 30 ms, can be used to obtain synthetic speech. Most of the speech processing algorithms, mostly speech coders, use these facts to obtain synthetic speech with acceptable quality: Encoder only extracts the state of the voiced/unvoiced switch, pitch period for voiced speech, coefficients of LPC parameters and gain of the input speech and transmits these parameters with efficient quantization methods. The decoder uses these parameters to synthesize the desired speech signal.

14

Figure 1.7: LPC vocoder synthe.sizer.

The vocal tract filter coefficients are not generally directly quantized: Dynamic range of these coefficients are high and quantization error may yield to an unstable filter. To solve these problems, new sets of parameters derived from LPC filter coefficients are used. These parameters can be summarized as follows:

1. Reflection coefficients,

2. log-area ratio parameters,

3. inverse sine transform of reflection coefficients, and

4. Line Spectrum Frequencies (LSF).

As stated before, reflection coefficients are the side products of the Levinson-Durbin recursion and they are used in the lattice form of the same all-pole filter. For the stable filters, these coefficients are bounded by 1 and —1, hence they are more suitable for quantization than LPC coefficients. Furthermore, it is possible to obtain reflection coefficients directly with Schur recursion without computing direct form of the LPC filter. Llsage of this recursion enables the computation of these coefficients with fixed point arithmetic without considering loosing the stability of the filter.

Although the reflection coefficients are bounded by —1 and 1, the spectrum becomes very sensitive to quantization errors when the coefficients are close to the boundaries. To overcome this problem, two new sets of coefficients are introduced. Both of these transformations warp the scale of parameters and then

15

uniform quantization of these parameters becomes non-uniform quantization for reflection coefficients.

Log-area ratio (LAR) is defined as,

LARi = log 1 + Ti1 - Ti

and the inverse sine transform is defined as follows:

Qi = arcsin(ri)

(1.39)

(1.40)

where r; is the reflection coefficient.

Both of these coefficients has good performance in quantization and hence they are widely used in vocoders. And also note that second one is also bounded between — | and j .

Besides these parameters, another type of parameters, called line spectrum frequencies, are used to quantize speech spectrum efficiently. These parameters have some unique features and have excellent performance in quantization purpose.

1.3.1 Line Spectrum Frequencies (LSF)

The linear prediction filter coefficients can be represented by Line Spectrum Frequencies (LSFs). This parameter set is first introduced by Itakura [19]. For a minimum phase, order polynomial, Am(^) = 1 + aiz~^ fi- 022“ -|----- one can construct two (m -|-1)®‘ order LSF polynomials, Fm+i(z)and Qm+i(^), by setting the (m -|- 1)·** reflection coefficient to 1 and —1 in Levinson-Durbin algorithm:

and

= A„ (z ) +

Q„+i(z) = A „ (z ) - z-f ’" * ‘>A„(z-‘ )

(1.41)

(1.42)

This is equivalent to setting the vocal tract acoustic tube model completely closed or completely open at the (rn -|-1)®‘ stage. It is clear that and

16

Qm+i{z) are symmetric and anti-symmetric polynomials, respectively. There are three important properties of these two polynomials:

1. All of the zeros of the LSF polynomials are on the unit circle and can be represented by only their angles,

2. the zeros of symmetric and anti-symmetric LSF polynomials are interlaced, and

3. the reconstructed linear prediction all-pole filter maintains its minimum phase property, if the first two properties are preserved during quantization.

Since these parameters can be represented by only angles, they are called line spectrum frequencies. Therefore the LSFs are also bounded between 0 and 27t, similar to reflection coefficients.

Besides these properties, in recent studies [16], it is found that LSFs are uncorrelated. This property of LSFs makes them a suitable parameter for quantization. It is also observed that LSFs are closely related to the speech formants as shown in Figure 1.8 and hence they provide a spectrally meaningful representation of the linear prediction filter [20]. Furthermore, it is observed that the spectral changes due to the perturbation of any LSF frequency is highly localized around the specific frequency [21].

Due to above reasons, the LSFs are widely used in speech coding [22] and speech recognition as speech feature parameters [23]. For example, it is possible to quantize coefficients of LSFs for lO**· order LPC filter by 21 bits for a speech frame of duration 20 ms without introducing any audible distortion [24]. Various quantization methods can be found in [25] for a review of scalar quantization methods, [26-30] for various vector quantizcition methods and [31-37] for diflferent interframe quantization methods.

Several methods for the computation of LSFs are reported: The simplest way to compute LSFs are obtaining the root locations of these two polynomials by complex arithmetic. However, this method is obviously very complex and due to the iterative nature of the complex root finding algorithms, the time

17

Power Spectrum of a voiced phoneme

Figure 1.8: LP power spectrum and the associated LSFs for voiced and unvoiced speech.

required for the evaluation of this algorithm can not be estimated [38]. To overcome this problem, several methods are proposed: Soong and Juang haveadopted a discrete cosine transform to evaluate cosine functions on a fine grid

/

[39]. Furthermore, an all-pass ratio filter can also be used to extract locations of LSFs [38]. However, all of these methods require large number of computation of trigonometric functions. Therefore, Kabal and Ramachandran [40] presented a backward recursive formulation to determine the values of the cosine function on a fine grid by Chebyshev’s expansion and bisection method. Wu and Chen reported a similar method which uses a modified Newton-Raphson technique for faster convergence [41]. These latter two methods are widely used in realtime speech coding algorithms. Besides these methods. Goalie and Saoudi utilizes Split Levinson algorithm to compute LSFs independently [42]. Finally, an LMS based adaptive algorithm, applied in sample-by-sample basis, to find LSFs are reported by Cheetham [43].

18

C hapter 2

Spectrum N on-Stationarity

D etection A lgorithm Based on

Line Spectrum Frequencies

One of the well-known properties of Line Spectrum Frequencies (LSFs) is that their locations are closely related with the peaks of the speech spectrum: Two or three consecutive LSFs are generally clustered to represent a peak, also called formant frequency, in the spectrum. As the formant frequencies change the LSF locations also change. By taking advantage of this fact, LSFs can be used to tract the changes in spectrum. We introduce two definitions which we use in the rest of this chapter:

1. The area between two consecutive LSFs is defined as an LSF region.

2. The difference between two consecutive LSFs is defined as the bandwidth of that LSF region.

The displacement of the LSFs usually gives a clue about the formation of the spectrum [25]: If the bandwidth of an LSF region is higher than its neighboring

19

LSF regions, usually a valley is present in the spectrum at this LSF region. Similarly, if the bandwidth of an LSF region is smaller than that of previous neighboring region and higher than that of next neighboring region, the energy of the spectrum is said to be increasing by increasing frequency. Besides, if the bandwidths of two consecutive LSF regions are almost the same, usually three LSFs come together to form a peak between these three LSFs. Coetzee and Barnwell [44] used the above relations to make a speech quality measurement algorithm based on LSFs.

However, the generalization described above may not be true in all cases. Sometimes formants become much closer to each other, and two consecutive LSF regions may both contain peaks, or sometimes an LSF region whose bandwidth is smaller than its neighboring regions may not contain a peak due to its wide bandwidth.

In addition to the bandwidth of the LSF regions, the spectral values at the LSF locations, extracted by evaluating prediction filter on the unit circle at the LSF locations, can be used to characterize the spectrum formation: If the energy of an LSF is larger than its neighboring LSFs, a peak is said to be present in the neighborhood of that LSF. In other words, it is easier to characterize the region by using both the difference between LSF locations and corresponding spectrum values.

In this chapter, a new and simple spectrum non-stationarity detector based on LSF related speech variation measures is introduced. In Section 2.1, an algorithm which estimates peaks of the spectrum, or the so-called formant frequencies for voiced speech, is presented. This section is divided into two parts: The first part describes the algorithm which makes the decision whether an LSF region contains a peak or not, and the second part describes the algorithm used to estimate the location of the peaks precisely. The non-stationarity detection algorithm, using speech variation measures based on the bandwidths of the LSF regions and peak locations are presented in Section 2.2. Simulation studies are given in Section 2.3.

20

2.1 Peak Estim ation

In this section, a two-step peak estimation algorithm is presented. In the first step, the LSF regions which contain the peaks are detected and in the second step, the location of the peaks are calculated. Details of the first step and second step is presented in Section 2.1.2 and Section 2.1.3, respectively. In Section 2.1.1, the speech database used in this thesis is described.

2.1.1 Experim ental D ata

In this work, all required statistical data are obtained from two databases, owned by TUBITAK-BILTEN, which contain 50 male and 50 female people in each database. The first database contains telephone speech with various hand-sets, while other one is formed by digitizing speech from close microphone talk. The databases include 12 words from each speaker - the numbers from zero to nine and ’yes’ and ’no’ in Turkish. Total number of frames for the voiced and the entire speech is approximately 40000 and 100000, respectively. Sampling rate is 8000 sample/sec and number of bits per sample is 16.

LSFs used in this work is extracted from the coefficients of 10‘ ‘ order vocal tract filter, calculated by the autocorrelation method followed by Levinson- Durbin recursion [18]. This recursion uses Hamming windowed 200 samples, previously filtered with order type Il-Chebyshev’s high-pass filter with cutoff frequency at 60 Hz. Although it is known that covariance method gives better results, it is not used because of its computational complexity. The method used in extraction of LSFs is defined in [41], which uses a modified version of Newton-Rapson method for faster convergence in root finding algorithm. Before computation of LSFs, a bandwidth expansion of 15 Hz is applied to the poles of all-pole filter to increase bandwidth of the peaks.

21

2.1.2 D etection of a Peak in a LSF Region

Initially, two separate algorithms, running parallel, based on the bandwidths of the LSF regions and energies of LSFs are used to make initial peak estimation assignments to LSF regions.

In the bandwidth based method, the bandwidths are calculated and the LSF regions whose bandwidths are smaller than the bandwidths of their neighboring LSF regions are assigned to contain peaks in them. The bandwidth for the LSF region, is defined as j) — /¿_i, where fi is the LSF.

In the energy based method, first, logarithmic energies of line spectrum frequencies, are calculated as follows:

РГ =1 2r

( 2 . 1 )1 + E l i i

In (2.1), r is selected to be 0.15 as approximates the logarithm of thespectrum as shown in Figure 2.1. Also it can be observed that more emphasis is given to low frequency regions.

Equation (2.1) is also used in peak location estimation algorithm described in Section 2.1.3, where value of r is varied for different LSF regions.

To find a region containing a peak, an energy-bandwidth based measure, Eli is defined for the LSF region before the LSF:

p 0 .1 5 ^i-lEli =fi - U-i

( 2 .2)

where fi represents the LSF.

To detect the LSF region containing peak, (2.2) is applied to the previous and next LSF regions of the LSF, whose spectral value is larger than its neighboring LSFs. It is experimentally observed that the region, which gives the highest score, contains the peak in it.

After finding the peak locations with both algorithms, a merging strategy which reduces misclassification of the state of regions is applied to obtain final states of the regions.

22

Logarithm of power spectrum

Figureplotted

(b)

2.1: Logarithm and 0.15* “ power of power spectrum of utterance /a / is in (a) and (b), respectively

Success rate of both algorithms is obtained from the data set described in Section 2.1.1. Since formants are tried to be estimated, part of the database, which contains only voiced speech, is used. For the voiced/unvoiced estimator, the one, based on normalized autocorrelation of the input sequence, described in detail in [17] is used with an exception that only the frames whose first bandpass voicing strength is larger than 0.8 is considered to be voiced speech. In other words, only strong voiced frames are considered in calculations. After deciding voiced speech, power spectrum is calculated with 1 Hz step size and a peak picking algorithm is applied to find peaks and also the LSF regions which contains these peaks are located. Statistics for correct classification and misclassification rates are calculated for bandwidth based method and energy based method separately and are tabulated in Table 2.1 and Table 2.2, respectively.

23

Table 2.1: Statistics about classification of LSF regions in bandwidth based method for voiced speech. Pc stands for the percentage of the correct classified LSF regions which contains a peak in it. stands for the percentage of the mi.sclas.sified LSF regions which contains a peak in it. Ppp stands for the percentage of the false peak assigned LSF regions with respect to total peak assigned LSF regions. Pfnp stands for the percentage of false peak assigned LSF regions with respect to the LSF regions which does not contain a peak._______________________________________

Regions1 2 3 7 8

97.79 95.38 93.90 96.08 97.54 94.24 95.50 90.69 99.60M 2 . 2 1 4.62 6.10 3.91 2.46 5.76 4.50 9.31 0.40

F P 4.33 23.98 63.41 33.37 24.33 21.32 26.10 28.70 35.42Pfnp 22.13 2.79 18.54 10.63 18.90 5.97 22.77 6.60 45.75

Table 2.2: Statistics about classification of LSF regions in energy based method for voiced speech.

Regions1 2 3 4 5 6 7 8 9

Pc 98.77 91.56 67.98 93.45 93.99 97.85 90.76 98.28 93.00Pm 1.33 8.44 32.02 6.55 6.01 2.15 9.24 1.72 7.00Pfp 2.86 13.25 25.18 19.31 9.95 19.16 11.94 24.35 13.65

Pfnp 14.53 1.29 2.61 4.94 6.26 5.42 8.30 5.72 11.86

In these experiments, bandwidth based method is observed to detect approximately 96% of the regions containing peak, while in the energy based method, this number reduces to 94%. However, the critical problem in both of these methods are the large number of false peak assigned regions: In the third row of both tables, where percentage of false peak assigned regions with respect to total peak assigned regions are shown, nearly 25% and 12% of the peak assigned regions are false alarms for bandwidth based method and energy based method, respectively. Although some of these false alarms occur due to the selection of neighboring LSF regions of the LSF regions containing peak, remaining large number of false peak assignments other than this problem must be eliminated. The best solution for the elimination of these regions are found to be the selection of the only LSF regions detected by both methods.

Furthermore, sometimes three LSFs are clustered together to form a peak. In this case, bandwidths of the two LSF regions formed by these three LSFs

24

are almost equal to each other, and usually peak location is around the middle LSF. If this formation is occurred, the proposed methods detect only one of these LSF regions and sometimes, they detect different LSF regions from these two LSF regions. Because of our merging criteria, such peaks are missed. In order to get rid of this problem, if detection of one LSF region by one method and its neighboring LSF region by other method is encountered, a small test is applied to both LSF regions to select the correct one: If the absolute difference between the bandwidths of these two LSF regions are smaller than 8 percent of the bandwidth of the detected LSF region by bandwidth ba.sed method, the LSF region detected by energy based method is considered to be true. Otherwise decision of the bandwidth based method is accepted.

Unfortunately, still large number of false estimations occur in some regions and also some peaks, estimated by one method but missed by other one, are remaining. To eliminate these false peaks, a bandwidth threshold, Tt, is assigned to LSF region. If bandwidth of the LSF region is larger than Ti, this detected peak is assumed to be a false peak. Also to include missing peaks, detected by only one method, following two tests are applied to those regions:

1. If bandwidth of the region is smaller than a threshold then a peak is assigned to the LSF region. This lower bandwidth threshold for the

LSF region is for bandwidth based method and ai for energy based' th

method.

Let us define the energy-bandwidth based meeisure, Cj·, as follows:

(pO.15 + pO.15)Ci =

(/i - /i- l)(2.3)

for the LSF region. If ti is larger than a threshold then a peak is assigned to the region. This higher energy-bandwidth based measure threshold for the LSF region is (ii for bandwidth based method, and Ci for energy based method.

25

The thresholds used in this algorithm are also estimated from the same database. In order to obtain the thresholds for false peak assignment elimination, the distribution of the percentage of correct estimation and false estimation versus the threshold, T;, is calculated and the point, which introduce minimum loss of correct detected LSF regions and provide maximum false peak assignment elimination is selected. Similar calculations are performed for the thresholds that are used to catch the misclassified LSF regions which actually contains peaks. These thresholds are selected so that maximum number of misclassified LSF regions are corrected, while minimum number of false peak assignment is introduced. The flowdiagram of the algorithm is given in Figure 2.2, and final thresholds are tabulated in Table 2.3. The distribution of the percentage of correct estimation and false estimation versus thresholds for the LSF regions is described in Appendix A in detail.

able 2.3: Thresholds for elimination and detection of misclassified regions.Regions

1 2 3 4 5 6 7 8 90.251 0.2.36 0.251 0.259 0.251 0.267 0.255 0.267 0.314

li 0.102 0.083 0.129 0.102 0.126 0.110 0.149 0.116 0.130OCi 0.129 0.102 0.168 0.196 0.196 0.134 0.196 0.094 0.196A N/A N/A 99949 N/A 101858 40744 31831 95493 40744c· 127324 127324 79578 26738 22282 50930 22282 95493 40744

Final statistics about proposed classification method is given in Table 2.4, Table 2.5 and Table 2.6. In these tables, it can be seen that number of false peak assigned regions reduces dramatically. Furthermore, the highest percentage of false alarm, which occurs in the 3'"' region, only contains 3.5% of whole peaks. If all regions are considered together, false alarm rates goes down to 5% and 10% for voiced speech and entire speech, respectively. It must be noted that since bandwidths are wide in unvoiced speech, increase in the false alarm rate is expected. Also, total percentage of misclassilied regions, which contain peaks, remains at 7%. Overall results can be seen in Table 2.6: Approximately 95% of regions are classified correctly by proposed method for both voiced and whole speech.

26

Figure 2.2: Algorithm for classification of the LSF regions.

27

Table 2.4: Statistics about classification of LSF regions in both methods for voiced speech.

Regions1 2 3 4 5 6 7 8 9

Pc 98.39 81.23 70.11 86.84 93.22 92.64 90.54 85.85 92.67Pm 1.61 18.77 29.89 13.16 6.78 7.36 9.46 14.15 7.33Pfp 1.78 4.80 11.20 7.13 3.78 6.85 5.92 10.53 9.80

Pfnp 10.22 0.56 0.96 1.63 2.38 1.79 4.30 2.12 8.29

Table 2.5: Statistics about classification of LSF regions in both methods for entire speech.

Regions1 3 4 8 9

95.38 80.93 74.45 90.95 91.10 92.30 90.56 86.07 94.69Л/ 4.62 19.01 25.55 9.05 8.90 7.70 9.44 13.93 5.31

P,FP 3.06 7.00 14.03 12.72 10.03 15.39 12.56 14.69 16.34Pfnp 8.64 0.30 2.20 2.98 5.09 3.11 7.54 2.20 15.99

Table 2.6: Overall performance of the system for the proposed method. Percentage

Regions1 2 3 4 5 6 7 8 9

Entire speech 94.34 98.82 94.22 95.90 93.64 96.16 91.76 96.28 88.97Voiced speech 97.11 97.25 96.22 96.10 95.89 97.06 93.47 95.79 92.14

2.1.3 A ccurate Estim ation of Peak Location

After finding the regions which contains the peaks, another algorithm is applied to find the exact location of the peak. In Coetzee and Barnwell’s work [44], peak location is estimated by the mean of the two LSFs which form the region. This estimate gives acceptable peak locations only if the bandwidth of that region is sufficiently small - smaller than 150 Hz. Since bandwidth of the LSF regions may become as large as 300 Hz, this estimate will not give satisfactory results and the difference between actual peak and estimated peak may be as large as 150 Hz. As an alternative, weighted means, whose weights are the

28

same as the one used in quantization of LSFs in [30], may be used to estimate peak locations as follows:

prP i = f i - i + i f i - f i - i ) ‘ ---- h P i (2.4)

where pi represents the location of the peak in the region and and pi is the correction term for the peak in the i ''· LSF region. For this alternative procedure, r is chosen as 0.15 and pi is set to zero for all LSF regions.

Unfortunately, this energy weighted mean only makes a small improvement in the peak location estimation. In order to get better results, different values of r may be considered. For this purpose, mean, standard deviation, percentage of peak estimation error smaller than 25 Hz and percentage of peak estimation error smaller than 50 Hz are calculated for different values of r, ranging from0.15 to 1.75, for both voiced and entire speech data for all LSF regions. Based on this experiment, different r values are selected for different LSF regions such that the standard deviation would be minimum or closer to minimum and percentage of peak estimation error smaller than 25 Hz is maximum. This criterion is selected, because if we try to maximize the percentage of peak estimation error smaller than an error range lower than 25 Hz, the percentage of peak estimation error larger than 50 Hz is also increased, which exceeds the acceptable range. Furthermore, Schafer et al. reported that 25 Hz error is negligible for formant estimation [45]. Besides, the correction term in (2.4), pi, is selected as the mean of the error between actual and estimated peak location for the selected r value for the LSF region.

After selecting optimum exponents for all LSF regions, (2.4) is used to estimate peak locations with different r for different LSF regions. Selected r ’s and p^s are tabulated in Table 2.7. Number of occurrence versus error in estimation of peak location for voiced speech and whole speech is given in Figure2.3a and Figure2.3b, respectively. Nearly 97% and 95% of the peaks are estimated within 25 Hz error range for the voiced and the entire speech, respectively. These values increases to 99.5% and 99%, if 50 Hz is also accepted as a tolerable error range.

The figures for standard deviation, percentage of peak estimation error smaller than 25 Hz and percentage of peak estimation error smaller than 50

29

Table 2.7: Selected Ti and corresponding mean of the error in peak estimation, /x;, in radian.

Regions1 2 3 4 5 6 7 8 9

Tz 1.25 1.3 1.125 1.3 1.175 1.5 1.175 1.45 1.15l i 0.0057 0.0058 0.0019 -0.0009 0.0009 - 0.0001 0.0018 0.0011 0.0040

Peak estimadon error for voiced speech lor all regions Peak esdmadon error for whole speech for all regions

Figure 2.3: Number of occurance versus difference between original and estimated peaks. The solid, dashed and dashed dotted lines corresponds to simple mean, energy weighted mean with r = 0.15 and energy weighted mean with r = Ti, respectively.

Hz versus different r values and tabulated form of statistics for simple mean estimation method, energy weighting method for r = 0.15 and energy weighting method for r = Ti are given in Appendix B. Proposed method decreases standard deviation four times compared to simple mean method and increase accuracy of the estimation dramatically.

Without considering the non-stationary region detector, the output of this algorithm may be used as a formant tracker in conjunction with an voiced/unvoiced estimator. An example for this formant estimator is illustrated in Figure 2.4 for the Turkish sentence ’’Şanslı adam kaybettiği mücheveri buldu.” .

30

X 10

IJ- 1000

Figure 2.4: Time sequence and spectrogram of Turkish word ’’Şanslı adam kaybettiği

mücheveri buldu” . Formant frequencies are plotted with on the spectrogram.

2.2 Non-stationarity D etection

The non-stationarity of the speech spectrum can be detected using the LSF based peak estimation method described in Section 2.1. The simplest way is to examine the L2 norm of the difference between the peak locations. Unfortunately, since the peak estimation algorithm may miss some peak locations or the number of peaks may change especially in transient regions, direct application of L2 norm to estimated peak location will not give good results. Furthermore, ¿2 norm lacks of incorporating perceptual information to the speech variation measure, which is essential in speech. A weighted Euclidean distance measure whose weights are selected according to the nature of peaks is more suitable. Furthermore, instead of using only the estimated peaks, all data related with the LSF regions are used in the calculation of speech variation measure for better results. In this section, two speech variation measures, one based on the

31

bandwidth of the LSF regions and one based on the peak locations in the LSF regions, are used in detection of spectrum non-stationarity.

In the beginning of the algorithm, the LSF regions containing peaks are detected and peak locations are found accurately with the methods described in Section 2.1. Since the new speech variation measure based on peak locations requires a peak location for all regions, virtual peak locations are computed even for the regions which contain no peak as if they have a peak using the same method. For an order LPC model, m — 1 peak locations are calculated. In order to use in speech variation measure, a vector for the frame, p ', whose entries are the weighted difference between the peak locations of the and (k — frames, is defined as follows:

7 7 1 - 1 (2.5)

where p’· = The weights, Wi, are obtained experimentally and setaccording to the state of the regions for the consecutive frames to emphasize the change in state of the LSF region as follows:

1. If the0.1.

2. If the

3. If thebe 1.0.

2. If the region in both frames contain peak, tn, is selected to be 0.8.

The speech variation measure based on peak locations, Ak, for the k '’’ frame is defined as follows:

At = Kt + AE-‘ (2.6)

where = p*'‘Wp^p*^ and = p*^Wp '“^p* and Wp is the weightingmatrix whose entries are determined according to the perceptual sensitivity of the peak locations estimated for the frame. By this method, change in the different part of the spectrum is emphasized in a perceptual manner.

Entries of Wp are determined according to the relationship between the peak locations in the spectrum. We define the weights as the correlation between the peak locations. As discussed in Chapter 1, LSFs are reported to be

32

uncorrelated [16]. If LSFs are used in our system directly, calculation of the main diagonal entries of the weighting matrix would be sufficient and rest of the entries will be set to zero. As peaks are derived from two consecutive LSFs, consecutive peaks are also correlated. Therefore, the diagonal entries next to the main diagonal of the weighting matrix must also be calculated. Although usage of peaks instead of LSFs may seem redundant, it must be noted that the LSF regions which contain a formant can be emphasized easily by this method while forming the vector

Entries of Wp are computed according to (2.10). Note that the entries other than main diagonal entries and diagonal entries next to the main diagonal are set to zero, as there is no correlation between those peak locations.

Let us rewrite (2.4):

PTpi - fi-\ + {fi -

■' i "T M-1= îlfi + + Pi

+ pi

where P7<il — TT" n r

and— 1 —

Pf + Pf-1

P7

(2.7)

(2 .8)

(2.9)Pi + Pi-i

Entries of the weighting matrix are defined as follows:

^ P i j ~ ^{piPj}= £{i<îifi + ^¿2/i-i + + Pj )} (2-10)

The main diagonal entries are given as follows:

P{PiPi} ~ ^{(^îlfi T 1)(^¿1/i P î'2f i—l')}

— ' ^ i i ^ i f i } + 2w tiu;,2£’{ / t / i - i } + ‘ i 2 ^ { f i - i }

and the diagonal entries next to the main diagonal are given by:

^ { p i P i + l } = £ { { ‘ i l f i ^ i 2 f i - l ) i ^ { i + l ) l j i + l P ‘ ( i + l ) 2 f i ) }

= iîlOJ(i+l)2£{ff}

(2.11)

(2. 12)

;]3

and

= (2.13)

The other entries turn out to be zero as shown below:

^{piPi+k} = + <i2fi-l){<^{i+k)lfi+k + ^{i+k)2fi+k-l)j= 0 k > 2 or k < - 2

(2.14)

because £{ f i f j } — 0 for i ^ j and ;U,’s are neglected in calculations because these values are negligibly small compared to the other parameters in (2.10).

The only missing part in this formulation is the variance of the LSFs. For the variances, is used similar to [30,46].

Speech variation measure, F, based on the bandwidths of the LSF regions is also similarly determined. First of all, the area between 0* '' and F* LSFs and the area between last LSF and tt are also considered as LSF regions for this speech variation measure. Therefore, m + 1 parameters are extracted for the order LPC filter. A vector for the k '’’ frame, b^, whose entries are the weighted difference between the bandwidths of the LSF regions in the k '’’ and {k — ly^ frame is defined as:

= hk Uk . . . L,t'o > ^1) ) · (2.15)

where for i = 1 to m-1. Note that ipo and V’m is equal tofo and (tt — fm-i), respectively. The weights, W{, are assigned according to the following criteria and are determined experimentally:

1. If the region in both frames do not contain peak, W{ is selected to be0.25.

2. If the region in both frames contain peak, гvi is selected to be 0.9.

3. If the region in only one of the frames contains peak, Wi is selected to be 1.0.

34

Since the first and the last regions can never contain a peak, their weights are automatically set to 0.25.

The speech variation measure,, F/;, for the frame is defined as follows:

U = r i + rE -‘ (2.16)

where and = b*Wb^~*b* and Wb is the weightingmatrix whose entries are determined according to the perceptual sensitivity of the bandwidth of the peaks for the l*'· frame.

The entries of the weighting matrix for bandwidths are defined as follows:

Wbi, =(•2.17)

The main diagonal entries are given as follows:

= S{{fi - - f i - i ) }

= £ { f n + £ { f L }

The diagonal entries next to the main diagonal are given by:

£{tpi-tpi+i} = £{{ fi - - f i) }

- - 1 - W n

(2.18)

(2.19)

and

£{lpi1pi-l} = £{{fi - /¿_i)(/i_i - /¿_2)}

=

The other entries turn out to be zero as shown below:

£{tpi-(pi+k} = S{(fi - fi-i){fi+k - fi+k-i)} = 0 k > 2 0 T k < - 2

(2.20)

(2 .2 1 )

Finally, the only remaining task is to compare the calculated speech veiria- tion measures with the experimentally determined thresholds, A and r/, for /V

35

and r , respectively. However, to eliminate small fluctuations, one frame delay is introduced to compare the calculated measures with the ones calculated in previous and next frames: If the variation calculated for any of the previous or next frame is larger than the current one, the calculated variation measure is not compared with the thresholds and frame is assumed to be stationary. Otherwise, variation values are compared with thresholds and if one of the speech variation measures exceed these thresholds, frame is flagged to be non- stationary. This method extracts the frames which has the maximum variation with respect to its neighboring frames.

Values of A and 77 can be adjusted for different purposes: It may be lowered to catch small changes, or it may be set to high values to catch only abrupt changes.

2.3 Sim ulation Studies

In this section, two simulations are performed to test the performance of proposed algorithms. We make the first simulation to test the performance of peak estimation algorithm whose results are given in Section 2.3.1. The other simulation is carried out to see the performance of non-stationarity detector. The results for this test are presented in Section 2.3.2.

2.3.1 Perform ance Test for Peak Estim ation Algorithm

For this simulation, a database, owned by TUBITAK-BILTEN, containing voice of 104 people including child voice is used. This database is formed by digitizing speech from close microphone talk. The database includes 10 words from each speaker - the numbers from zero to nine in Turkish. Total number of frames for the voiced and the entire speech is approximately 27000 and 40000, respectively. Sampling rate is 8000 sample/sec and number of bits per sample is 16.

36

Performance Test for Peak D etection Algorithm

Performance of peak detection algorithm for the voiced and the entire speech is given in Table 2.8 and Table 2.9, respectively. Overall results are also illustrated in Table 2.10. In this test set, it is observed that the performance of the algorithm is similar compared to the one obtained in training set, even better results for some LSF regions.

Table 2.8: Statistics about classification of LSF regions for proposed algorithm for the voiced speech for the test set.

Regions1 2 3 4 5 6 7 8 9

Pc 90.79 95.89 85.05 89.35 89.77 96.85 91.05 94.62 96.19Pm 9.21 4.11 14.95 10.65 10.23 3.15 8.95 5.38 3.81Ppp 4.15 4.19 15.67 16.08 8.78 13,04 8.42 14.40 16.96

Pfnp 2.49 1.17 0.96 1.17 3.08 3.93 3.24 4.08 12.26

Table 2.9: Statistics about classification of LSF regions for proposed algorithm for the entire speech for the test set.

Regions1 2 3 4 5 6 7 8 9

Pc 90.01 93.73 75.93 90.26 89.20 96.04 90.01 94.18 96.34Pm 9.99 6.27 24.07 9.74 10.80 3.96 9.99 5.82 3.66Ppp 7.10 6.59 15.27 18.40 10.51 15.53 12.91 18.19 22.20

PpNP 3.14 0.94 1.59 2.54 3.75 4.78 4.20 5.17 13.68

Table 2.10: Overall performance of the proposed method for the test set. Percentage

Regions1 2 3 4 5 6 7 8 9

Entire speech 94.71 98.40 96.07 96.66 94.40 95.40 94.42 94.70 89.65Voiced speech 94.90 98.19 97.58 97.49 95.03 96.24 95.17 95.66 90.99

Performance Test for Accurate Peak Location Estimation Algorithm

In this test, percentage of peak estimation error smaller than 25 Hz and 50Hz are extracted from the same database for selected t i and ¡j.i values and the

37

results are tabulated in Table 2.11 and Table 2.12, for the voiced and the entire speech, respectively.

Table 2.11; Percentage of the error smaller than 2-5 Hz and 50 Hz between actual peak location and estimated peak location for test set for the voiced speech.

< 25 Hz< 50 Hz

< 25 Hz< 50 Hz

LSF Regions1

97.79100.00

691.3198.79

295.8599.80

92.5598.73

393.4099.53

87.1797.62

494.5199.42

88.0998.66

596.6099.58

All Regions92.9099.11

Table 2.12: Percentage of the error smaller than 25 Hz and 50 Hz between actual

LSF Regions1 2 3 4 5

< 25 Hz 94.75 92.65 91.88 90.42 94.60< 50 Hz 99.42 99.73 99.33 97.96 99.24

6 7 8 9 All Regions< 25 Hz 88.21 91.11 85.45 84.97 90.19< 50 Hz 98.29 98.49 96.85 97.08 98.46

From these results, it is observed that nearly 93% and 90% of the peaks are estimated within 25 Hz error range for voiced and the entire speech, respectively. This states a small performance degradation over the training set. As percentage for the error smaller than 50 Hz are approximately 99% for both voiced and entire speech, it can be said that this estimation algorithm can still obtain location of the peaks in an acceptable range.

2.3.2 Perform ance Test for N on-Stationarity D etector

Simulations for this section is performed on three artificially created speech signals. All sequences are created by the MELP synthesizer [17] and they contain two synthesized speech phonemes with a transitional region between them. Parameters used in the synthesizer are extracted from a real speech signal. Proposed algorithm uses a frame length of 22.5 ms and a step size

38

of 11.25 ms. LSFs are extracted with the same method described in Section2.1.1 and A and t] are set to 0.1 and 0.2, respectively. These thresholds are determined experimentally from part of the same database described in Section2.1 and found sufficient to detect small changes.

Three signals are used in this experiment;

1. The first signal contains two noise excited parts: First half contains utterance /sh/ , where as the second half contains utterance /s / . This signal is used to test performance of non-stationarity detector for stochastic signals.

2. The second signal contains two periodic impulse excited parts; First half contains utterance /a / , where as the second half contains utterance / 0/ . This signal is used to test performance of non-stationarity detector for deterministic signals.

3. The third signal contains both periodic impulse and noise excited parts: First half contains utterance /sh/, where as the second half contains utterance / 0/ . This signal is used to test performance of non-stationarity detector in regions where both excitation and spectral shape change.

The result of the application of the algorithm on these signals can be seen in Figure 2.5.

Although these three experiments do not give much information about the real performance of the algorithm, the simulation section of the next chapter, which describes an implicit speech segmentation algorithm, gives detailed success rates of this algorithm for different parameter settings.

39

1“2S_

JS'

Figure 2.5: The top, middle and bottom figures belong to the signals which contain

two noise excited regions, two pulse excited regions and both excitation and spectral

change regions, respectively. The region between the dashed lines are the ones which

are flagged as non-stationary regions by the proposed algorithm.

40

2.4 Summary

In this chapter, a new spectral peak location estimation and spectrum non- stationarity detection algorithm based on LSFs are presented. Proposed algorithm consists of three parts:

1. Detection of the peak presence in the current LSF region.

2. Accurate estimation of the location of the peaks.

3. Detection of the spectrum non-stationarity based on two speech variation measures using the peak presence in LSF regions, and the location of the peaks and bandwidth of these regions.

First part combines two different methods to detect the presence of a peak in the LSF region by comparing these values with experimentally found thresholds. By using this algorithm, correct state of an LSF region can be found with 95% accuracy. Second part of the system obtains location of peaks with weighted mean of LSFs. The weights are calculated by applying exponent values, which are different for each LSF region, to the power spectrum values evaluated at location of LSFs. If 25 Hz error is considered to be an acceptable error, success rate of this algorithm is around 95% for voiced speech. These two parts also prove the strong correlation between displacement of LSFs and speech spectrum. Final part of the system uses the parameters extracted in previous parts to detect spectrum non-stationarity. It is experimentally observed that the system can be tuned by adjusting thresholds for the different requirements of the applications. The thresholds can be set lower values to detect smaller changes to be used in speech segmentation systems, while they can be set higher to detect only abrupt changes.

Finally, it must be stated that the computational complexity of the proposed algorithm is low: LSFs can be extracted in an efficient way described in [41] and the computation of the rest of the parameters can be done without making high complexity arithmetic operations.

In the next two chapters, two applications, using this non-stationarity detection algorithm, is presented. In Chapter 3, an implicit speech segmentation

41

algorithm is discussed. The novel non-stationarity detector is used to find the transient regions of the speech signal in this segmentation system. As the second application, a variable bit-rate speech coder based on mixed excitcitiou linear prediction model is described. This vocoder utilizes a voice activity detector to detect the silence parts of the conversation. The new non-statiouarity detector is used in the voice activity detector to provide non-stationary noise immunity to the vocoder.

42

Chapter 3

Speech Segm entation

Speech segmentation algorithms are necessary for several speech processing applications to overcome various problems like increasing computational complexity according to the nature of the algorithm. They can solve these problems by providing stationary regions in the speech signal.

In a speech recognition system, especially designed for continuous speech recognition, memory and computational complexity of the algorithms increase dramatically with the increasing number of vocabulary size. In order to solve this problem, recognition of sub-word units, like diphone or triphones, followed by a segmentation system may be used. In these systems, segmentation algorithms are used to segment speech signal into desired sub-word units and recognition algorithms are applied to these segmented parts to extract content of the speech signal.

Similarly, speech segmentation can be used in very low bit-rate speech coding systems. In these systems, speech is segmented into stationary regions and these regions are coded separately, sometimes with different algorithms according to the type of the phonological units [47]. Very low bit rates, as low as 150 b/s, can be achieved with segmentation based vocoders [48].

43

Segmentation methods can be roughly classified into two groups: Implicit segmentation methods and explicit segmentation methods. Implicit segmentation methods split up the utterance into segments without the use of phonetic transcription. These systems define segments as spectrally stable part of the signal. References [5-8] are typical examples for this kind of segmentation algorithms. Explicit segmentation methods split up the incoming utterance into segments that are explicitly defined by phonetic transcription. In general, explicit segmentation methods have the disadvantage that the reference patterns have to be generated before the method can be used. Since implicit methods do not use any reference patterns, explicit methods are expected to perform better. However, such patterns may not fit well to the utterance and may not account all variability occurring in the natural speech. References [9-13] are good examples for the explicit segmentation methods. Furthermore, in [2] both implicit and explicit methods ai'e used to obtain better results. Disadvantage and advantage of implicit and explicit methods are tabulated in Table 3.1.

Table 3.1: Some characteristics of the explicit and implicit segmentation methods. Implicit Segmentation Explicit SegmentationThe method does not always give correct number of segments.

The segments are unlabelled.

The segment boundaries are determined accurately enough for diphone segmentation.

The method produces the number of segments given by the phonetic transcription.The segments are labeled in accordance with the phonetic transcription.The segment boundaries may be inaccurate due to a possible poor resemblance between reference and test spectrum.

In this chapter, a speech segmentation system based on the proposed spectrum non-stationarity detection algorithm in Chapter 2 is presented. Outline of this chapter is as follows: In Section 3.1, brief description of phonological units is given. In the following section. Section 3.2, the new speech segmentation algorithm is described. Section 3.3 presents experimental results of this algorithm. Conclusion remarks for this chapter are given in Section 3.4.

44

3.1 Phonological U nits

When humans communicate with each other, they use meaningful words which are constructed with concatenation of some basic sound units. These basic sound units are called phonemes. However, speech production system of humans does not work only to produce these sounds and hence spoken words are not generally composed of these ideal sounds. Therefore, segments of pronounced words are not usually interchangeable. As an example, if the ’b’ used in ’be.f is concatenated to ’o/’ to produce 'boP, it sounds disjointed or weird. This phenomenon can be explained as follows: If a phoneme is spoken in isolation, acoustic waveform of that phoneme can be distinguished without any difficulty. However, if they are used in context, boundaries between phonemes are hardly be detected according to the speech articulators. Since vocal tract articulators are formed by human tissue, transition from one phoneme to other one is controlled by muscle movements. Hence, movement of tissues generally slightly modify the production of phonemes. Therefore, associated with each phoneme, there is a collection of allophones (variation of phones) that represents acoustic variations on the basic unit. Allophones contain the degree of freedom in production of phonemes, which is not only related with the structure of the unit, but also the position of the unit within the word. As a result, despite the phonemes are defined to be the basic units in speech production, speaker has some degree of freedom in producing of these sounds.

3.1.1 Phonem ic and Phonetic Classification

Phonemes can be classified according to the following criteria:

1. Time waveform.

2. Spectral characteristics.

3. Manner of articulation.

4. Type of excitation.

5. The stationarity of phoneme:

45

• Continuant·. Vocal tract configuration is fixed during production of phoneme. Examples for this type of phonemes are vowels, fricatives and nasals.

• Noncontinuant : Change in vocal tract configuration is occurred during production of phoneme. Examples for this type of phonemes are diphthongs, liquids, glides and stops.

Furthermore, phonemes are generally classified according to the articulatory movement and their acoustical features:

Vowels and Vowel Like Phonemes

1. Vowels:

• Vowels have highest energy among all other phonemes.

• Duration of vowels can vary form 40 to 400 ms.

• The variations in cross-sectional areas in the vocal tract determines spectral shape of vowel.

• Vowel formant characteristics have great variation across different speakers.

• The length of vocal tract affects location of formants in spectrum.

• The bandwidths of formants can also characterize vowels.

2. Diphthongs: They contain two target vowel formation. Hence, it can also be defined as transition between two vowels, especially transition of their formant structures.

3. Glides: They contain transient part of one vowel and their duration are short.

4. Liquids: They have similar spectral characteristics with vowels but since vocal tract is more constricted, they are weaker than vowels.

46

Consonants

1. Fricatives: They are produced by excitation of vocal tract with a steady air stream. Turbulence may occur at some points of constructions.

2. Nasals: They are produced by glottal waveform exciting an open luisal cavity and closed oral cavity. Due to the nature of excitation source, they resemble vowel, but their energies are weaker due to the limited ability of nasal cavity to radiate sound.

3. Stops: They are transient sounds that are produced by building up a pressure behind a total closure somewhere along vocal tract, and suddenly releasing this pressure. After the air pressure is released, there is a brief period of noise like frication occurred due to the sudden turbulence from escaping air. TTivoiced plosives usually possess longer periods of frication than voiced stops. The frication and aspiration is called stop-release. The interval of time leading up to the release during which pressure is built up is called stop-gap.

3.1.2 Characterization of Segments and Boundaries

Characterization of Boundaries

Boundaries are formed due to the change in articulatory movement or spectral formation. These changes may be occurred in different forms:

1. An abrupt change such as termination of or start of voice.

2. Some degree of spectral changes:

• A variation of energy inside a frequency band.

• A fluctuating variation of formant locations.

• A loss of formant structure.

47

Segment Categories

1. Stationary segments.

2. Short segments.

3. Transient segments:

• Between two voiced phonemes: Monotonous changes occur cimong formants.

• Between a voiced phoneme and unvoiced phoneme : The formant structure and noise are superposed.

• Between a phoneme and silence : These regions occur at the end of word or before a plosive.

3.2 Speech Segm entation System

In this section, a speech segmentation system based on the spectrum non- stationarity detector proposed in Chapter 2 is described. The proposed segmentation system is of implicit type, i.e. it does not require any phonetical transcription, and success of the algorithm directly depends on the pronunciation of the words which are tried to be segmented. System does not incorporate an end-point detector, but the non-stationarity detector in conjunction with a silence detector may also be used to make accurate detection of the end-points of the utterances. In our system, the non-stationarity detector is used to detect end-points as well.

The system consists of two main parts. The first part is used to detect transient regions by the same method described in Chapter 2. In the second part, a modified version of the same algorithm is applied sample-by-sample to find the exact location of boundaries.

48

3.2.1 Pre-Processing System

In pre-processing system, speech signal is processed frame-by-frame and the non-stationarity detector is applied to each frame to detect any change in spectrum with one frame delay. To detect minor changes, the thresholds, A and 77, are set to 0.1 and 0.2, respectively. As discussed in Chapter 2, these values are determined experimentally from a speech signals consisting phonetically balanced words. Output of this block provides the transient regions explained in Section 3.1.

An example to the output of this block is given in Figure 3.1. In this experiment, the window length, the frame length and the overlapping time amount is set to 25 ms, 22.5 ms and 11.25 ms, respectively.

Proposed algorithm almost detect all of the non-stationary regions, including voiced-voiced transitions, voiced-unvoiced transitions and silence-voiced transitions. Further test results on the performance of the algorithm is given in Section 3.3.

3.2.2 Boundary Location Estim ation

After stationary parts are detected, another measure which is based on the speech variation measures described in Section 2.2 is used to find the location of boundaries. This algorithm is applied to transient regions of the signal in a sample-by-sample basis. For each sample, (2.6) and (2.16) are used to calculate A and r by taking the center of the frame as the current sample, the I*''’ sample. Furthermore, next A and F values are also calculated by taking their center as (/ -b Ly^ sample where L is the frame length. A new measure, ii(, is defined for the sample is defined as follows:

= A i A/+L -b F/ -f- T i+ L (■^•1)

The sample, /*, which maximize 0/ is selected as the boundary location:

r = argmax {fl;} (3.2)(k■ L -{L -0 )< l< {k + l)■ L )

49

where 0 represents the overlapping time amount and k is processed frame number in which transient behaviour exists.

An example of this algorithm applied to the transient regions extracted by the pre-processing system is shown in Figure 3.2.

This algorithm can also be analyzed intuitively: For each sample, a frame, whose center is the current sample, is constructed and the vocal tract filter, corresponding LSFs and required parameters are extracted. This process is also repeated for the samples which are one frame length apart from the current sample in both directions in the time domain. After calculation of the required parameters, A and F are calculated for both directions. Since value of these parameters reflects amount of change in spectrum, the sample which makes maximum change with respect to both previous and next frames is accepted as the boundary point. As length of transient regions may vary from 8.75 ms to 40 ms, success of the algorithm depends on the selected frame length. If the selected frame length is close to the length of the transient regions, exact point of the boundary location is obtained. Otherwise, large deviations from the exact boundary location may be encountered.

3.3 Simulation Studies

Simulation studies are performed on two signals consisting 50 words which contain balanced phonological units, one for male and one for female speech. In all simulations, the window length and the frame update duration is set to 25 ms and 11.25 ms, respectively. The duration between analyzed frame and comparison frame is varied to test the performance of the system for different tests. Success rates are extracted from the transient regions extracted by the pre-processor system described in Section 3.2, since transient region estimation is more important in segmentation systems. Moreover, finding accurate locations of boundaries are sometimes impossible even by hand.

50

I OO CD CD

3000

---OO OO

Figure 3.1: Pre-processing system applied to 1.05 second male speech containing

words ” Firma tamtnmnda”. The regions between two dashed lines are transient

regions detected by proposed algorithm.51

Figure 3.2; Boundary location estimator applied to 1.05 second male speech con

taining words ” Firma tamtimirida” after detection of non-stationary regions. The

dashed lines show detected boundaries.52

Experiments are performed for three different cases:

1. The duration between analyzed frames is set to 22.5 ms.



Results for these three configurations for male and female speech is presented in Table 3.2:

Table 3.2: Success rate about the estimation of the transient regions in the continuous speech signal. Both end-point detection and segmentation within word is performed by pre-proce.ssing .system of the new algorithm. Pe stands for the percentage of the correct estimated end-points. Pb stands for the percentage of the correct estimated segment boundaries. P[ stands for the percentage of insertions with respect to the whole non-stationary detected regions.

Pe Pb Picase for male speech 95.00 94.44 24.23

case for female speech 90.00 93.51 18.722”“ case for male speech 76.00 62.03 8.33

2' case for female speech 59.00 64.81 9.793’"' case for male speech 85.00 85.18 14.61

3'"' case for female speech 80.00 85.18 15.68

The proposed algorithm obtains highest scores when duration between analyzed frames is set to 22.5 ms. This is logical, since any increase in the difference between analyzed frames enables algorithm to catch more transient regions whose transient duration is lower than the difference between analyzed frames. Unfortunately, since thresholds A and 77 is selected so that the algorithm detects even small spectrum changes, number of accidentally detected regions are also increased. It can be observed that in the 2"' case, number of insertions are decreased dramatically, since difference between analyzed frames is decreased, but this also yields to large number of misses.

The best solution for this problem is to set the duration between the analyzed frames larger than the largest possible length for a transient region and use another approach like Bradt’s Generalized Likelihood Ratio (GLR) to eliminate the insertions and to find accurate locations of segment boundaries [6].

53

3.4 Summary

In this chapter, a simple speech segmentation algorithm is proposed. The algorithm consists of two main blocks:

1. Pre-processing system operating on the speech signal in a frame-by-frame basis and extracts transient regions.

2. Boundary location estimator operating on the detected transient regions to extract exact boundaries of the phonemes by sample-by-sample processing.

In simulation studies, it is observed that an increase in the duration between analyzed frames provides the proposed algorithm the capability of catching more transient regions. Unfortunately, since the thresholds for speech variation measures are set to small values to catch even minor spectral changes, the algorithm also classifies non-transient regions as transient regions. In order to overcome this problem, more sophisticated algorithms can be used to eliminate these wrong classified regions. Since most of these sophisticated algorithms requires high computational complexity [5,6], these algorithms may be used only in the regions which is flagged as transient with the pre-processing part of our algorithm, whose computational complexity is proved to be low. Furthermore, more parameters like energy of signal and pitch contour can be used to improve the performance of this system with additional computational cost.

54

Chapter 4

Variable B it-R ate M ixed

Excitation Linear Prediction

Vocoder

Compression of telephone-bandwidth speech has been an ongoing area of research for several decades [49,50]. Especially, in the last several years, with the improving speed and decreasing price of DSP microprocessors, many of the algorithms and coding methods formerly impossible to implement can be realized in real-time. Most of these efforts are focused on the usual telephone- bandwidth which is between 200 Hz and 3.4 kHz.

Speech coding algorithms are classified according to their bit-rates:

• Large bit-rate : Bit-rates larger than 16 kb/s.

• Medium bit-rate : Bit-rates between 8 kb/s and 16 kb/s.

• Low bit-rate : Bit-rates between 8 kb/s and 2.4 kb/s.

• Very low bit-rate : Bit-rates below 2400 b/s.

00

Analysis-by-synthesis models can work well down to 4.8 kb/s but the quality deteriorates rapidly below this rate. Therefore, usually low bit-rates can be achieved only by parametric representation of speech. The most well-known and studied speech coding algorithm is the LPC-10 vocoder based on the human speech production system, described in Chapter 1.

Although LPC-10 algorithm can encode speech intelligibly at 2.4 kb/s, its quality is usually unacceptable for many applications. After 1993, a competition for a new 2400 b/s vocoder standard to replace old LPCIO-E vocoder was started in the USA by U.S. Department of Defense, Digital Voice Processing Consortium (DoD-DDVPC). After long time of testing stages and refinement of the candidate algorithms. Mixed Excitation Linear Predictive coding (MELP) is selected as the new 2400 b/s federal standard in 1997. MELP algorithm is originally developed by Alan McCree and T.P. Barnwell at Georgia Institute of Technology [51].

Variable Bit-Rate (VBR) coding is a special type of multi-mode coding scheme, where each acoustic phonetic class is encoded by different coding algorithm and represented by different amount of bits. VBR coders are particularly useful for voice storage, code-division multiple access (CDMA) wireless networks, and packetized communication systems.

One of the most important part of VBR coders is the voice activity detector (VAD), which is used to detect the presence of speech signal in the channel. In a typical two-way telephone conversation, the voicing activity is generally around 40 percent. Therefore, the average coding rate can be reduced with efficient coding of these silent regions.

In the past decades, several researchers presented different variable rate coding techniques. Although various speech representation techniques are used, most of them are based on CELP coders [52]: The first VR-CELP coder is proposed by Vaseghi [53]. In his work, several versions of different bit-rate CELP coders, allocating different numbers of bits to quantized LPC parameters and excitation signal, are used to code different speech types. Besides this coder, several VR-CELP algorithms are reported in literature which addresses different problems: In Cellario et al. work [54], a VR-CELP coder for CDMA application is demonstrated. Gomez et al. presents real-time implementation

56

of Federal Standard FS1016 CELP coder which can also switch to higher and lower bitrates according to the distortion in the synthesized speech [55]. In lacovo and Serena work [56], an embedded variable rate CELP coding technique is presented. In their algorithm, embedded bit-stream is used to provide robustness to packet loss in packetized transmission systems. Lupini et al. proposed a VR-CELP coder which selects its coding mode due to the conditions of both input signal and network conditions [57]. In recent studies, McClellan and Gibson presents a VR-CELP algorithm based on subband measures of spectral flatness using entropy functional [58] and Kroon and Recchione implements a low-complexity toll-quality VR-CELP coder for CDM.A. cellular systems [59]. Majority of these coders use bit-rates varying from 16 kb/s down to 0.8 kb/s.

In addition to VR-CELP coders, several other variable rate coding algorithms. based on other speech coding methods are reported; Peng and Cuper- man presents a variable rate coder based on lattice low-delay vector excitation coding technique [60]. It is reported that it is possible to obtain good quality speech between 8 and 16 kb/s with this method. Francesco et al. presents an algorithm which makes speech segmentation and coding at the same time with reasonable complexity with fast algebraic codes [61]. Their algorithm is reported to achieve an average bit-rate between 5 and 6 kb/s. Wang and Gersho propose a phonetically segmented vector excitation based vocoder which works at .3.0 kb/s average bit-rate whose subjective performance closely matches 4.8 kb/s DoD GELP coder [62]. Another variable-rate subband coder is presented by Shen et al. [63]. At 12 kb/s average bit-rate, this coder is reported to produce better quality speech than QCELP coder and better quality music than both QCELP and full-rate GSM coder. In Paksoy et al. work [64], a variable rate multimodal speech coder which is based on analysis-by-synthesis method is presented. This coder is reported to achieve 3 kb/s average bit-rate with a quality comparable to full-rate GSM coder. In Yu and Chan work [65], a variable rate coder based on multiband excitation coding algorithm is presented. This coder utilizes different bit allocation for spectral quantization for different frame types. This coder is capable of transmission of good quality speech at average bit-rate of 1.24 kb/s. In addition to these coders, Villette et al. presents a high quality split-band LPC vocoder in both fixed rate and variable rate versions [66]. Variable rate coder is reported to achieve an average bit-rate of 1.4 kb/s.

57

In this chapter, a variable bit-rate MELP vocoder, based on federal standard, is presented. Parameter extraction scheme is the same as the original MELP vocoder, and after detecting the frame type of the speech signal, different number of bits is assigned to different frame types. This encoding scheme reduces bit-rate to approximately 1000 b/s from 2400 b/s without considerable loss of quality. Silence and background noise sections are detected by a novel voice activity detector, in which the non-stationarity detector, proposed in Chapter 2, is used to extract the stationary segments in the signal. Our non-stationarity detector provides non-stationary background noise immunity to the voice activity detector in this vocoder.

Outline of this chapter is as follows: A brief description of MELP vocoder and voice activity detectors are given in Section 4.1 and Section 4.2, respectively. Design of VBR-MELP vocoder and the novel VAD is presented in Section 4.3. Section 4.4 presents performance of the new VBR vocoder.

4.1 M ixed Excitation Linear Prediction Vocoder

Although it is possible to obtain synthetic speech at 2400 b/s with traditional LPC vocoder, its performance is unacceptable for many applications due to low quality. The problems for the traditional LPC vocoder can be summarized as follows:

• Lack of naturalness,

• tonal noise especially for female speakers,

• buzziness especially for male speakers,

• mechanical and tense sound,

• thumps,

• lack of transition control within frames, and

• noise robustness.

58

The first problems are according to the spectral envelope mismatching. Since the LPC vocoder is based on an oversimplified model of the human voice generation system, this system can not produce more complex sounds, humans generate.

Another problem in the LPC vocoder is the wrong decisions by the voiced/unvoiced detectors. Since neither of the voiced/unvoiced decision algorithms can always find the true state of this switch, wrong decisions leads to deterioration of the quality of the synthesized speech. Unvoiced classification for the uncertain situations increases thumps, where voiced classification of these parts makes synthesized speech sound buzzy.

In order to correct the above problems, more complex models must be used to simulate other properties of human voice generation system. Some of these problems can be eliminated as follows:

• Some phonemes have both voiced and unvoiced regions in the spectrum such as voiced fricatives (e.g. / z/ and /u/). Therefore, mixture of voiced and unvoiced excitation can produce them more naturally [67]. Fourier transform of a mixed excited utterance can be seen in Figure 4.1. In this figure, spectrum contains both harmonic structures and noise in distinct bands. In addition to this property, mixed excitation removes the voiced/unvoiced switch which decreases the buzziness and thumps in the synthesized speech. Furthermore, mixed excitation improves background noise immunity of the system [68].

• Since tonal noise occurs according to the strong periodicity of the impulse train, this problem can be eliminated by destroying the periodicity of the impulse train [69].

• Pulse shape of the periodic excitation waveform may be changed to glottal excitation shape by spreading the energy of the excitation signal between consecutive pitch periods [68].

• Noise due to the mismatch of the LPC filter spectrum may be compensated by reducing the noise in the valleys of LPC spectrum by a time varying filter [70].

59

• Vocoders may also try to match the waveform of original sequence in transitional areas by other kind of methods similar to [71].

MELP vocoder implements all of these new features except the hist one. .Although MELP vocoder has no special solution for transitional regions of speech, it can still produce good quality speech at those regions by using its new features.

Frequency plot of a mixed excited utterance

Figure 4.1; Fourier transform of mixed excited phoneme.

4.1.1 Basic Synthesizer

The basic synthesizer of MELP vocoder is shown in Figure 4.2. Both voiced and unvoiced excitation is created separately. The only difference in the generation of the voiced excitation is that periodicity is destroyed if the position jitter is activated and pulse shape changes from frame to frame due to the different Fourier magnitudes. These two excitations are fed into spectral shaping band-pass filters. The band-pass filters are linear phase FIR filters and are complement of each other. Therefore, addition of these two band-pass filters gives unity magnitude in all bands. After this step, two excitations are

60

summed together to form the mixed excitation. It is still possible to simulate voiced/unvoiced switch in the LPC-10 encoder with this excitation generation method.

After this step, the excitation is filtered with the adaptive spectrum enhancement filter. The filtered excitation signal is spectrally shaped by the vocal tract filter and then the resulting signal is filtered with the pulse dispersion filter after energies of the original and synthesized signal are matched. All of these new blocks are described in detail in following subsections.

This system is an enhanced version of the traditional LPC vocoder described in Chapter 1. To increase the performance of this old vocoder, these new blocks are added and the voiced/unvoiced switch is replaced with the mixed excitation model. Note that mixed excitation model is successfully used in multiband excitation coding in frequency domain as well [72].

Figure 4.2: Synthesizer of MELP Vocoder.

The new properties of MELP vocoder can be summarized as follows:

• Mixed Excitation : Reduces buzz, more accurate in spectrum matching and more robust when operating in noisy environments.

• Aperiodic Pulses : Reduces tonal noise.

61

• Adaptive Spectral Enhancement Filter : Emphasize formants and decrease the noise in valleys of the LPC spectrum.

• Pulse Dispersion Filter : Modifies pulse shape and spread the energy of excitation signal between consecutive pitch periods. Since it is spectrally flat, it has no influence on unvoiced excitation signal.

• Fourier Magnitudes : Reduces mismatch between excitation of original and synthesized speech.

4.1.2 M ixed Excitation

Mixed excitation model is mainly used for synthesizing natural sounding speech which can be possible with more accurate matching of the original speech signal’s spectrum with that of the synthesized speech signal’s spectrum. In addition to this property, this type of excitation system eliminates the need for an voiced/unvoiced switch and reduces the thumps and buzziness in the synthesized speech due to the wrong voicing decision. Because of the nature of the excitation generation, the new vocoder’s performance is also superior to the traditional vocoder in noisy environments.

There are several versions of the mixed excitation generation in literatui'e: In Makhoul et al. work [73], periodic impulse train is low-pass filtered and the white noise sequence is filtered with a high-pass filter where the cut-off frequency of these filters are equal. The encoder tries to find the cut-off frequency for these filters to match the original speech with the synthesized one. Since the cut-off frequency is varied in the multiples of 500 Hz, the vocoder does not work well. It is not robust to background noise, as vocoder has only two bands. In Kwon and Goldberg work [67], some degree of voiced and unvoiced excitation signal is mixed for all bands. This approach tries to match the residual signal with the excitation signal of the synthesizer. The problem of this approach is that because the entire spectrum has the same degree of voicing, there is no band specific mixing. First version of the MELP vocoder- uses two filters like in [73]. However in this implementation, the zeros of these filter can be varied so that the cut-off frequency is Vciried continuously and the

62

degree of mixing for the frequency bands can be adjusted. But still this system is not robust to background noise

4.1.3 Aperiodic Pulses

There is short, isolated tones in the synthesized speech especially for female speakers in LPCIO vocoder. This can be removed by adding noise to low frequency region, but the artificial noise results in harsh speech. Another solution is to desti'oy the periodicity of the impulse train. When the periodicity is destroyed for all voiced frames, strong voiced frames sounds distorted. Therefore, an additional voicing state is added to the MELP vocoder [69]. With this jittery voiced flag, speech may be modeled with aperiodic pulses for voiced excitation at the expense of 1 bit/frame. Aperiodic pulses are generated by varying each pitch period length by a pulse position jitter which is uniformly distributed by ±25% of its original pitch period value. .Jittery voicing corresponds to erratic glottal pulses, so it can be detected from either marginal correlation or peakiness in the input speech. Peakiness is defined to be the ratio of RMS power to the average value of the full-wave rectified LPC residual signal. If the value of peakiness is lower than a predetermined value, the sequence is assigned to be aperiodic. This carefully controlled use of aperiodic pulses effectively removes the occasional tones from the synthetic speech without introducing any distortion.

4.1.4 Adaptive Spectral Enhancement

There are various reasons for the usage of adaptive spectral enhancement filter:

1. This block compensates the quantization error of the LPC coefficients. This adaptive filter widens the bandwidth and reduces the peakiness of the formant frequencies in the LPC spectrum. This would increase speech quality because the mismatch in the spectrum is decreased.

63

2. The noise injected into the system is more audible in the frequency bands which corresponds to the valleys in the spectrum of LPC filter due to the masking property of the human auditory system [74]. Since this spectral enhancement filter de-emphasize those regions, the noise is less audible and this increase the subjective quality of the synthesized speech.

3. This filter helps the band-pass filtered speech to match the natural speech waveform in formant regions. Typical formant resonances do not usually completely decay in the time between pitch pulses in either natural or synthesized speech, but synthetic speech waveform reach a lower valley between peaks than natural waveform do [75]. Adaptive spectral enhancement filter corrects this undesired behavior.

These problems can be eliminated by varying the synthesis pole bandwidth within each pitch period. This can be done by replacing term of LPC filter by where a is smaller than 1. This operation moves the poles of the LPCfilter away from the unit circle and weakens the pole resonances. Unfortunately, this filter usually has a low-pass characteristic and therefore makes the speech sound hoarse. Therefore, an all-zero filter which has the same phase angles with the all-pole section but farther away from the unit circle than poles are also added. This would remove the low-pass characteristic of the filter while preserving the emphasis on formant frequencies. The resulting filter becomes:

(4.1)

Generally a is selected to be 0.8 and /? is selected to be 0.5. To reduce the low-pass effect further, a first order FIR filter is added in cascade to the system:

Htiit — 1 fJ, ■ z -i (4.2)

where /j. is generally selected to be 0.5 · ki where ki is the first reflection coefficient. Since this filter produces slightly high-pass spectral tilt, it helps to reduce the low-pass effect.

Excellent review of spectral enhancement filters can be found in [70] and also a frequency domain version of this filter is presented in [76].

64

4.1.5 Pulse D ispersion Filter

One of the main reasons for the buzziness of the LPC vocoder is the pure impulsive excitation of voiced frames. With this type of excitation, the synthesized speech has higher peaks for the first few samples and it decreases so rapidly that it can not match the original speech especially for the frequency bands which does not have formant resonances [68]. This phenomena can also be explained by the duration of the opening and closure of the vocal cords which can not be in pure impulsive form. In order to eliminate this problem, the shape of the impulsive excitation must be changed. Various researchers tried to find the best pulse shape which gives the highest quality in the synthesized speech [77-79]. The best pulse shape is found to be the one which has no discontinuity and varies approximately 50% of the pitch period [79].

In MELP, this spreading of the pulse through samples are performed by a FIR filter [68], continuously applied to the synthesized speech. Coefficients of this filter is synthesized as follows:

First the DFT of a triangular pulse is computed and the magnitude of the DFT is set to unity. Then, its inverse DFT is evaluated by preserving phase. This filter produces less peaked synthesized speech. Since the synthesized pulse does not decay rapidly, it matches the original speech in a higher degree.

4.1.6 Fourier Series M agnitudes

This extension is first used to enhance the quality of the vocoder in higher bit- rate version (4800 b/s). First original speech is filtered with the inverse linear prediction filter. Then, DFT of the resultant residual signal is computed. Although DFT of this signal is expected to be spectrally flat, this is not true in most of the cases. As a remark, if the sequence is considered to be periodic with the period of the size of DFT, the DFT coefficients are exactly the same as the Fourier series expansion coefficients. To increase the quality of speech, magnitudes of the peaks of the harmonics are also transmitted to the decoder. In decoder, these quantized magnitude values are decoded and the impulse train is synthesized by computing the inverse DFT of these coefficients. The

65

magnitudes of harmonics which are not transmitted are assumed to be unity in the decoder.

In MELP vocoder, magnitudes of the only first 10 harmonics are quantized with vector quantization and transmitted to the decoder [80].

4.1.7 Flowchart o f the MELP Decoder

Output Speech

Figure 4.3; Flowchart of the MELP decoder.

66

4.1.8 Flowchart of the MELP Encoder

Sp ch

Preprocessor

SubbandDecomposition

0-0.5 0.5-1 1-2 2-3 3-4

EnvelopeExtraction

V/UV Decision for Bands

Fractional Pitch Analysis

Final Pitch Estimate

Pitch hi doubling pitch ref I

living-check,

nement

Q|[.]

LPC to LSF Conversion

Q (]

LSF to LPC Conversion

1_______ ________ - iLow Pass Filter

fc - 1 Khz10th order LPC

CalculationRMS

Calculation

1rInitial Pitch

Period Extraction15 hz Bandwidth

Expansion Q[.l

1-A(z) HammingWindow

1PeakinessCalculation

512 point FFT

Multiplexer and Coder

Figure 4.4: Flowchart of the MELP encoder.

67

4.1.9 Performance Evaluation

Detailed performance analysis for the MELP vocoder was carried out in [81]. In most of these tests, MELP outperforms LPCIO and CSVD and nearly has the same performance with the higher bitrate, 4.8 kb/s, FS1016 CELP standard. Besides in noisy environments, MELP also outperforms federal standard CELP algorithm. More subjective test results are given in Section 4..3 from the results of the subjective tests conducted by TUBiTAK-BILTEN / Speech Processing Group.

4.2 Voice A ctivity Detectors

The Voice Activity Detector (VAD) is the most important part of VBR coders and DTX transmission systems. Basic assumptions behind the design of VAD algorithms can be summarized as follows;

• Speech is a non-stationary signal which changes in 20-30 ms periods.

• Background noise is stationary during much longer periods, i.e. silences and pauses between '’’talk-spurts '' in two-way conversations.

• Energy of the speech signal is usually higher than that of background noise.

Based on the assumptions, it is possible to design VAD algorithms to detect silence gaps as well as background noise without speech. If energy of background noise in the environment is very low, a simple algorithm based on signal energy level can be used to distinguish between silence and active speech periods. However, for high energy non-stationary background noise, which is usually encountered in mobile communication systems, more sophisticated algorithms must be developed.

Another problem for a VAD algorithm is the discrimination of low-energy unvoiced sounds like fricatives in high background noise energy level. Since it

68

is hard to detect these parts, a ’’hang-over time” is used during which VAD algorithm delays its decision to declare silence and continues to observe the signal until it decides that a transition has occurred from active speech to silence. This approach extremely reduces misclassiflcation of weak fricatives.

The accuracy and robustness of VAD determines the quality and capacity of the vocoders. Reliable silence detection is essential. If silence is detected as speech, the capacity is reduced; On the other hand, if speech is detected as silence, ’’clipping” and other degradations in the synthesized speech are introduced.

Another important consideration for the VBR systems is the synthesis of non-speech segments. In order to preserve naturalness, background noise must be reproduced in some fashion. For stationary noise, it is sufficient to transmit noise characteristic once in the beginning and decoder can produce ’’comfort noise” during non-speech intervals based on this information. For non- stationary background noise, spectral parameters and gain of the noise must also be transmitted continuously at a very low bit rate.

The following parameters were used to detect the speech activity and silence discrimination in the literature:

• Energy of the first derivative of the signal [82],

• ratio between energies of consecutive frames [82],

• modeling error of the LPC filter [83],

• periodicity and the pitch period [83],

• optimum modeling order of the LPC filter [83],

• LPC cepstrum distance between analyzed frames

• energy and log-energy of the signal [4,84],

• zero-crossing rate [84],

• Teager’s energy measure [85], and

• a distance measure based on wavelet analysis [86].

69

It is experimentally observed that the distance measure given in the last entry works best in low SNR signals [86].

The first systems based on VAD, known as time assignment speech interpolation (TASI), was introduced to increase the capacity of submarine telephone system used in analog telephony [87]. TASI was subsequently replaced with a similar digital system, known as digital speech interpolation (DSI) system.

Several VAD algorithms were presented in the literature: The early works are mostly concentrated on the on-off pattern extraction of speech. These algorithms do not provide any noise robustness. In Brady’s work [88], statistical distribution of spurts and gaps were obtained experimentally and a VAD algorithm was designed based on these statistical data. In Yatsuzuka work [87], a speech and voice-band data discriminator for DSI-ADPCM systems was presented. This algorithm provides highly sensitive speech detector with a decision system consisting of a finite state machine, but the system does not have any noise immunity. Another well-known VAD algorithm is used in Pan-European Digital Cellular Mobile Telephone Service [89]. This algorithm is selected by CEPT-GSM to be used in the DTX systems. This VAD is explained in detail in Section 4.3.1. In literature, large number of variations of this algorithm is presented. Most of these variations are reported to increase noise immunity of the system. In Paksoy et al. work [90], energy levels in four distinct subbands are compared with corresponding thresholds and if thresholds of any of these subbands are exceeded, frame is declared as speech. Furthermore, spectral flatness at the output of the noise suppression filter is also measured to detect speech activity. Finally, a variable hang-over time is utilized for different noise levels to improve the eflSciency of the VAD. In Cellario et al. work [54], shorter frames are reported to have better performance in noise. In recent studies, more complicated algorithms are proposed: In Sohn and Sung work, a VAD algorithm, employing a soft decision based noise spectrum adaptation, is presented [91]. This algorithm is reported to have a good noise spectrum tracking property. In Cavallaro et al. work [92], a fuzzy logic based VAD algorithm is presented. This algorithm is reported to have better efficiency than the one presented in [89].

70

4.3 Variable B it-rate MELP Vocoder

Design of a variable bit-rate vocoder mainly concerns the selection of encoding modes for the optimization of the system in both computational complexity and bit-rate reduction. Besides, the sequence of the mode selection and processing of the encoding algorithms must be decided.

Since our aim is to develop a variable bit-rate vocoder based on MELP without changing the overall system, a simple bit-rate reduction scheme is developed. MELP vocoder separates bit-stream format into two sections as shown in Table 4.1:

Table 4.1: Bit allocation table for fixed bit-rate MELP vocoderP aram ete rs Voiced Unvoiced

LSFs 25 25Fourier Magnitudes 8 —

Gain (2 per frame) 8 8Pitch, Overall Voicing 7 7

Bandpass Voicing 4 —

Aperiodic Flag 1 —

Error Protection — 13Sync Bit 1 1

Total Bits / 22.5 ms frames 54 54

Synthesis of unvoiced sections do not require Fourier magnitudes, pitch period, bandpass voicing decisions and aperiodic flag. Hence, these bits are used for error protection (13 bits/frame) and transmission of sequence type (7 bits/frame in the place of pitch period). Since a header is always transmitted in a variable bit-rate vocoder, the requirement for transmission of pitch period information is eliminated for unvoiced frames. Furthermore, error protection is also removed in our vocoder. In addition, gain is generally stable in unvoiced sections, therefore transmission of the first gain parameter is also redundant. As a result, elimination of transmission of these parameters decreases the required bits from 54 bits/frame to 30 bits/frame for unvoiced frames.

In addition to the bit-rate reduction in unvoiced sections, further bit-rate reduction can be achieved by efficient coding of silence and background noise sections of the input sequence, which covers nearly 40 percent of a typical

71

conversation. For this purpose, the new VBR-MELP coder utilizes a voice activity detector to detect silent regions and makes efficient coding to reduce the average bit-rate.

In following subsections, a novel VAD algorithm for MELP vocoder and final bit allocation table for variable MELP vocoder is presented.

4.3.1 VAD for V BR -M ELP Vocoder

The design of proposed voice activity detector is based on the design of two voice activity detectors: First one is designed for DSTADPCM systems [87], and utilizes a finite state machine for the detection of silence parts. This detector uses energy of the frame, ratio of energies in consecutive frames and zero-crossing number as the parameter set. Due to the selection of the parameter set, this detector does not have any noise robustness.

The second VAD, on which our algorithm is based, is the VAD used in second phase of GSM 6.10 standard [89]. This detector, utilizes two separate detectors, one for giving voice activity decision by comparing some parameters with predetermined thresholds and one for adapting inverse prediction filter of background noise. In the first detector, the incoming signal is filtered with the inverse prediction filter of the background noise and energy of the resulting signal is computed. Frame is assumed to contain speech signal, when value of this energy is larger than a threshold. Otherwise, that frame is classified to be noise. Furthermore, an 60-100 ms hang-over time is also introduced to avoid clipping of the final parts of the utterances in the speech signal. In the second detector, the distance between the LPC filter computed from averaged autocorrelation values of noise sequence and the LPC filter computed for the current frame is calculated by Itakura likelihood ratio. Frame is assumed to be stationary, while this result is smaller than a pre-determined threshold. Furthermore, periodicity of the signal is also controlled. If input frame is classified as non- stationary and does not contain any periodic structure, frame is assumed to be initial point of noise sequence. Besides, before updating the coefficients of the prediction filter of the noise, this system waits for 8 frames, to ensure the continuity of the noise sequence. If all these conditions are met, coefficients of

72

the prediction filter of the noise used in the first detector is updated. The illustration of this VAD can be seen in Figure 4.5. The computational complexity of this VAD is low. The parameters required for the algorithm is already extracted within RPE-LTP vocoder and the only computation performed within this system is the computations for the decision system.

Figure 4.5: The voice activity detector for the pan-european digital cellular mobile telephone service.

MELP algorithm extracts following parameters within the encoder:

• Pitch period,

• bandpass voicing strengths,

• LSFs, and

• two gain calculation for the first and the second half of the frame.

In order to decrease computational complexity, these parameters are used directly within our VAD.

The new proposed VAD uses the distance measure, Dk·, defined as follows:

E f (4.3)Dfc = 10 · log __^ /=1 J

This measure is based on the logarithm of the ratio of the signal energies in subbands to the variance of noise in the corresponding bands, which is the

73

modified version of the one used in [86]. cr; is the estimated variance of the background noise in the subband, and is the energy parameter of the

frame for the I*''’ subband over a time window and computed as follows:

1 N,

n=l/= 1,2,---,L (4.4)

where Ni is the number of samples in the subband. In our system, there is no decimation in subband decomposition. Ni is equal to 90 samples corresponding to half of ciii 22.5 ms MELP frame. As signal is decomposed into 5 subbands, L is selected to be 5.

The original method is reported to be successful in end-point detection in noisy environments. However, end-point detection system described in [86] assumes that the initial few frames are always noise and the noise variances in the subbands are extracted from these frames. Unfortunately, in a typical telephone conversation, the first few frames may contain speech information. Therefore, it is impossible to make this kind of assumption in our system. To overcome this problem, the required variances of the noise in the subbands are extracted by a similar method described in GSM-VAD.

The diagram of the new VAD is shown in Figure 4.6. The system has two main parts labeled as VADİ and VAD2:

VADİ is used to detect the presence of the speech signal. First, energy in 5 subbands are extracted for the first and the second half of the frame and Dk is computed twice for the two sub-frames to obtain Dk and Dk^· Besides, an initial silence detector is used to detect inaudible signal by comparing the gain values with an experimentally derived threshold. Furthermore, this detector also provides echo-suppression in some degree when no background noise is present: Voiced sections and information tones always have high energies. If a strong periodic structure is detected with a gain smedler than a second threshold, that frame is assumed to belong to a part of an echo signal. The output of this initial silence detector and Dk values are used by the decision system to make final decision about the voicing state. The decision box contains a finite state machine, which consists of four different states about silence detection including ’hang-over’ state. States are updated twice in a 22.5 ms frame, one for the value of Dk· and one for the value of Dk^·

74

Figure 4.6: Voice activity detector for VBR-MELP Vocoder.

VAD2 is used in parallel with VADİ to update the variances of noise in 5 subbands. The adaptation is performed by Noise Variance Adaptation block, which takes the required parameters from Stationarity Check block, Periodicity Check block and the decision of VADİ for the previous frame. Since proposed non-stationarity detector has one frame delay, VAD2 also has one frame delay. The frame length of this block is 180 sample similar to the MELP vocoder.

Details of the sub-blocks are described in following subsections:

Initial Silence D etector

This block is used to detect inaudible frames prior to the application of the rest of the voice activity detector to the current frame. Furthermore, it uses a simple echo suppression logic in the detection of the frames containing quasi- periodic signal with very low energy level, never encountered in voiced speech.

75

This block requires the gain values and the final estimated pitch period calculated in the MELP encoder. The gain values are compared with two gain thresholds, Tgl .nd Тан- If both gain values are lower than Tgl, the analyzed frame is declared as silence. If any of them exceeds Tghi classification of the state of the frame is left to Speech/Silence Decision Block. If neither conditions are met, voicing strength measure of extracted pitch period is controlled [17]. If periodicity is detected, frame is classified as echo of the other channel. Otherwise, classification of the state of the frame is left to Speech/Silence Decision Block. Flowchart of the algorithm is plotted in Figure 4.7.

Figure 4.7: Flowchart of initial silence detector.

76

The thresholds, Tql and Tqh, are found experimentally from a 2 minute conversation^, including echoes from other channel. Gn and Gt2 are set to 38.0 clB and 48.0 clB, respectively.

Bandpass Energy Computation

This block computes the energies of bandpass signals, E[, for the frame which are filtered in encoder with sixth order Butterworth filters, whose pass- band regions are between following regions: [0 - 500 Hz], [500 - 1000 Hz], [1000 - 2000 Hz], [2000 - 3000 Hz] and [3000 - 4000 Hz].

Energies are computed twice per trame, one for the first half of frame, E l , and one for the second half, These values are used in distance measurecalculation, D*,., and Noise Variance Adaptation block.

Distance Measure Calculation

In this block, Dk defined in (4.3) is computed twice for both of the first and the second half of the frame. The variances, cr;, are set to 1.0 for all bands prior to first update in order to classify all frames as speech detected frames. Computed values, Dk and Dk ·, are used in the decision block.

Speech/Silence Decision Block

In this block, Dkn and the decision of the initial silence detector is used to discriminate silence frames in the conversation. For this purpose, a four states finite state machine is developed:

^This conversation is taken from ’Switchboard Corpus - Recorded Telephone Conversations’ database collected by Texas Instruments

77

1. Silence (SI): These regions contain only background noise.

2. Primary Detection of Signal (PD): These regions are primary detection for signal which may contain information. If system stays in this stage for longer than a pre-determined duration, 'Speech Enable’ state is activated.

3. Speech Enable (SE): These regions contain speech signal.

4. Hang-over Period (HO): Silence is detected in these regions, however, to eliminate misclassification of the weak fricatives and the final nasals at the end of the speech, a hang-over period is inserted in the system.

State transitions are performed due to the values of two coefficients:

• PDFk : Primary detection of speech for the frame. Its value is set to 0, if silence is declared in initial silence detector or Dk is below Dti- Otherwise, it is set to 1.

• SDFk : Definite presence of speech for the frame. Its value is set to 1, if Dkn exceeds Dt2- Otherwise, it is set to 0.

Setting of these coefficients can also be summarized as follows:

Table 4.2: Setting of the coefficients.PDFk = 0 Silence is declared in the initial silence detector or

Dk„ < DtiPDFk = 1 Dti < Dkn < Dt2PDFk = 1, SDFk = 1 Dt2 < Dkr,

Values of Dt\ and Dt2 are obtained experimentally. These values are selected such that clipping of speech and misclassification of silence frames are minimized. The system is found to be optimum, when Dn and Dt2 are set to 5.0 and 10.0, respectively.

The state transition diagram is given in Figure 4.8. Transitions from PD to SE and HO to SI requires some past information, i.e. memory. To go from PD to SE, PDF must be set to 1 for 20 half-frames, that corresponds to a

78

wait state of 225 ms. Transfer from HO to SI in general requires 45 ms in this system. This relatively short hang-over period is due to the robustness of the system to the background noise. Furthermore, if the talk-spurt duration is shorter than 56.25 ms, this period is also reduced to 22.5 ms. In simuhition studies, it is observed that only stop-gaps are missed with this approcich.

SDF, = 1

PDF (1|1...1) for 20 frames

PDF, = 1

PDF, = 1PDF =0

PDF, = 1

PDF^ (0|000) if talk spurt > 56.25 msec PDF^ (0|0) if talk spurt < 56.25 msec

P D F ,= 0

PDF,

Figure 4.8: State transition diagram of the decision box. SI stands for silence state. PD stands for primary detection state. SE stands for speech detected frames. HO stands for hangover state.

Periodicity Control

This block obtains state of periodicity by analyzing the bandpass voicing strength of bands and voicing strength measure of the estimated pitch period in the encoder [17]. The flowchart of the algorithm is shown in Figure 4.9.

The flags vbpl, vbp2, vbpS and vbp.{ are bandpass voicing strengths for the first 4 bands. Only the first bandpass voicing strength has a fractional value.

79

Yes- PeriodicityDetected

Figure 4.9: Flowchart of periodicity detector.

between 0 and 1. Other ones are set to either 0 or 1. The coefhcient, rP3, stands for voicing strength measure of final estimated pitch. For strong voiced frames, the value of this parameter exceeds 0.65.

Stationarity Control

This block uses the non-stationarity detector algorithm described in Chapter 2. Thresholds, A and 77, are both set to 1.0, in order to detect only abrupt changes. Output of this block has one frame delay.

8 0

Noise Variance Adaptation

This block makes adaptation for the variances of noise in sub bands, when the statistics of the background noise is changed. Before adaptation takes place, following conditions must be met:

1. Signal must be stationary for a period of time longer than Sx ■ 22.5 ms. Sx is the number of frames for the system to wait before adaptation takes place in which signal is stationary.

2. Signal must not be periodic. Since information tones has long duration, they may be classified as long stationary regions. These regions must not be included in noise adaptation.

3. Decision state of final decision box in VADİ must be same for previous two frames.

If these conditions are met, variances of noise in five subbands are calculated and then continuously averaged in every frame as follows:

· Ng + Et + 1

(J i = (4,5)

where are the number of stationary frames after the first frame of adaptation and erf is the variance of noise in the previous frame. This averaging is repeated until at least one of the three conditions written above is violated.

Value of Sx changes from 6 to 20. It is observed that 8 is a reasonable value, because practically none of the unvoiced phonemes last longer than 180 ms.

Flowchart of this block is illustrated in Figure 4.10

4.3.2 Bit A llocation for VBR-M ELP Vocoder

In this novel variable rate MELP coder, a two-bit header is used to classify frame type: Voiced, unvoiced and silence/noise. For the silence/noise frames.

81

▼No adaptation

N - S ,

T

No adaptation N =.S^

T

No adaptation N -S ^

Adapt variance of noise in subbands

continuously

Figure 4.10: Flowchart of noise variance adaptation block. PFS and CFS stand for the state of the previous frame and current frame, respectively.

the parameters of the first frame of the silence regions are transmitted as if it is an unvoiced frame and these parameters are repeatedly used until a header showing different frame type other than silence/noise is encountered. Table 4.3 shows final bit allocation.

Table 4.3: Bit allocation table for variable bit-rate VIELP vocoder.P a ram ete rs Voiced Unvoiced S ilence/N oise

Header 2 2 2LSFs 25 25 —

Fourier Magnitudes 8 — —Gain (2 per frame) 8 5 —

Pitch, Overall Voicing 7 — —

Bandpass Voicing 4 — —

Aperiodic Flag 1 — —

Total Bits / 22.5 ms frames 55 32 2

4.4 Sim ulation Studies

As a part of a sponsored project, a MELP vocoder, slightly modified version of the federal standard, is implemented in both floating point and fixed point with

8 2

C programming language in TUBITAK-BILTEN / Speech Processing Lcibora- tory by my colleagues and me. Furthermore, fixed point C source codes are ported to TMS320C54x DSP architecture and in order to obtain highest efficiency, the entire code is hand optimized in assembler language. Both of the encoder and decoder of the MELP vocoder is bitstream compatible with the federal standard.

At the end of the project, subjective listening tests are conducted to make comparison between ACELP, a CELP based vocoder whose quality is better than FS1016 federal standard CELP coder, LPC-10 and MELP algorithms. Tests are evaluated using the speech signals taken from two native Turkish speaker, one for male speech and one for female speech. Two types of tests are conducted:

1. Diagnostic Rhyme Test (DRT)

2. Mean Opinion Score (MOS)

The goal of these tests are to evaluate the intelligibility and quality of the output of the synthesizer. Final data are obtained by evaluation of the test results from 20 subjects.

Noise robustness tests are performed as follows; A Volvo340 car noise^, driven on a rainy asphalt road is mixed with the clean speech. SNR value of the resultant signal is approximately 10 dB. The resultant signals are used as the noisy test data in the following tests.

4.4.1 D iagnostic Rhym e Test

This test was carried out with 50 rhyming word pairs. Subjects are asked to find out which word of the pair is spoken. The classification of the word pairs are as follows:

^Institute for Perception, TNO, The Netherlands

S3

1. Voiced/Unvoiced Difference·. These sounds have only three letters. Word pairs have different utterances only in the beginning of the word. In these tests, subjects are asked to discriminate the letter pairs, b-p, g-k, v-f, j-§ and d-t.

2. Nasality Difference: These sounds have again only three letters. Word pairs have different utterances only in the beginning of the word. In these tests, subjects are asked to discriminate the letter pairs, m-b and n-d.

3. Sustained/Interrupted Difference: In these tests, words which have similar sounds, but different meanings are used. In this case, one word of the pair lasts longer to read.

4. Sibilated/Unsibilated Difference: In these tests, subjects are asked to discriminate the letter pairs, z-t, p-k and j-g.

In this test, following results are obtained:

Table 4.4: DRT scores of MELP and ACELP vocoders. WD stands for wrong decision.

Male Speech ACELP MELP50 word pairs in clean speech 2 WD 3 WD50 word pairs in noisy speech 29 WD 9 WD

Female Speech ACELP MELP50 word pairs in clean speech 7 WD 8 WD50 word pairs in noisy speech 57 WD 33 WD

From these results, ACELP vocoder is only 0.2% more intelligible then MELP vocoder in clean speech. However, in noisy speech, MELP vocoder clearly outperforms ACELP: MELP vocoder is 4% and 4.8% more intelligible for male speech and female speech, respectively.

4.4.2 M ean Opinion Score

This test was carried out with ten short phonetically balanced sentences. In test, subjects are asked to grade the quality of the synthesized speech by giving

84

marks between 1 and 5. 5 indicates excellent and 1 indicates unacceptable. Subjects gave grades for the output all three vocoders for male and female speech.

Results of these tests are as follows:

Table 4.5: MOS scores of MELP, ACELP and LPC-10 vocodersMale SpeechClean speechNoisy speech

Female SpeechClean speechNoisy speech

ACELP4.141.72

ACELP3.761.90

MELP4.184.33

MELP3.643.67

LPC-102..362.99

LPC-102.682.98

From these tables, it is observed that MELP vocoder has a similar quality with ACELP vocoder for clean speech. However, for noisy speech MELP outperforms ACELP vocoder. It is observed that output of the MELP vocoder is usually preferred over both ACELP and LPC-10 vocoder in noisy environments.

4.4.3 Perform ance o f V BR -M ELP Vocoder

Making subjective tests for vocoders requires considerable amount of work and since these tests are already conducted for fixed-rate version, tests are conducted only to obtain the performance of the voice activity detector for various SNR levels. In these tests, SNR values are calculated only from the portions including speech signal. Percentage of clipped regions and misclassified noise regions are tabulated in Table 4.6. Test sequence is obtained from a 50 second telephone conversation^. 57 percent of the conversation consists of only background noise. The source of noise is same as the one used in the evaluation of subjective tests of fixed-rate fixed point MELP vocoder.

^This conversation is taken from ’Switchboard Corpus - Recorded Telephone Conversations.’ database collected by Texas Instruments

85

Table 4.6: Performance of proposed VAD in various .SNR levels for male speech. Pci stands for the percentage of clipped regions with respect to the overall speech sections. P„is stands for the percentage of the missed regions with respect to background noise sections.________

SNR level(dB)oo302015105

Pc,3.842.292.352.411.839.92

3.763.0313.2320.3610.8119.09

Avg. bit-rate977.83 bps940.05 bps1076.00 bps1088.50 bps1012.00 bps998.00 bps

From these experiments, it is observed that our VAD works without any problem when the SNR of the .sequence is higher than 10 clB. The clipped regions in these levels are occurred according to a long laugh, in which the detector assumes these regions as noise and equate noise variances to the energy of the speech in that region. Therefore, some utterances are missed due to this wrong adaptation. With the beginning of background noise, system adapts thresholds again and recovers itself. The misclassified silence regions are mostly according to the hang-over period and little energy variations in the background noise. Finally, it is observed that nearly in all cases, average bit- rate is around 1000 b/s which makes further 1 : 2.4 compression over fixed-rate MELP vocoder without a considerable loss of quality.

4.5 Summary

In this chapter, a new variable bit-rate MELP vocoder based on the fixed bit-rate version is presented. In order to decrease the bit-rate, the unvoiced frames and background noise are encoded with fewer bits by eliminating the parameters which are not required in the synthesizing of these frames.

8 6

In order to discriminate speech signal from background noise, a novel voice activity detector which is robust to both stationary and non-stationaiy background noise is introduced. Since the VAD uses the parameters already extracted in MELP encoder, its computational complexity is low. In the simulations, it is observed that the performance of the VAD is high even if the SNR decreases down to 10 dB. Finally, with this new VBR-MELP vocoder, the average bit-rate is decreased from 2400 b/s to 1000 b/s in a typical telephone conversation.

As a final remark, since only three frame types are present in this system, a fourth one can also be defined to encode different type of speech signal, like onsets in the beginning of the utterances to increase the naturalness of the synthesized speech.

87

Chapter 5

Conclusion

In this thesis, a new frame-based speech spectrum non-stationarity detection algorithm based on line spectrum frequencies is developed. The proposed algorithm is used successfully in a speech segmentation system and a voice activity detector, specifically designed to work in a MELP vocoder.

The proposed algorithm works in three steps: In the first step, the system decides whether a formant is present or not between two consecutive LSFs by analyzing the angular difference between these LSFs and using the logarithmic energies of the spectrum at LSF locations. By applying the decision system described in Section 2.1, the regions which contains formants are detected. The state of the regions are calculated with 95% accuracy. In the second step, exact locations of the peaks are calculated with energy weighted mean of the LSFs, forming that region. To obtain desired accuracy, spectral values at location of the LSFs are calculated by evaluating prediction filter on the unit circle at LSF locations. Then, an exponent value, which is different for each LSF pair and obtained experimentally, is applied to these values. By this method, approximately 95% of the location of the formants are estimated within 25 Hz error range. In the third step, two speech variation measures based on the estimated peak locations and the difference between consecutive LSFs are used to detect speech spectrum non-stationarity. It is observed that the success rate of the algorithm depends on the time difference between consecutive frames

8 8

and the thresholds, compared with the computed values of the proposed speech variation measures for analyzed frames. These thresholds can be adjusted to change the sensitivity of the algorithm to the spectrum changes.

The proposed algorithm is first used in an implicit-type speech segmentation system. The system does not have any prior information about the input speech signal, hence does not make any assumptions on the input signal. The algorithm tries to locate the boundaries of the phonemes in two steps. In the first step, the non-stationary regions are extracted in a frame-by-frame basis, and in the second step, a modified version of the same measures proposed in Chapter 2 is used to detect exact boundaries of the phonemes. Success rate of this algorithm directly depends on the time difference between consecutive frames, because transient regions in the speech signal may vary form 8.75 ms to 40 ms. Unfortunately, an increase in the time difference between consecutive frames yields to large number of insertions, which must be eliminated by a more complex method. The main advantage of this system, especially of the first step, is its low computational complexity, which makes it possible to use it in the front end of a speech recognition system.

In addition to the speech segmentation system, the proposed non- stationarity detection algorithm is used in a Voice Activity Detector (VAD) of a variable bit-rate MELP vocoder. This new V.A.D consists of two parts: The first part extracts the energies of the signal in five bands and compute distance measure based on the variance of the noise in the subbands and the energy of the signal in analyzed frame. These values are compared with experimentally derived thresholds to find state of the frame. The decision system is based on a finite state machine, which also includes hang-over period. Prior to the calculations in this first part, a simple silence detector is utilized to detect the inaudible frames and echo signals in some degree. The second part of the system updates the variances of the noise in subbands by using the proposed non-stationarity detector. Long sequences of background noise are searched in the conversation and the variances of noise in subbands are updated from these portions. Nearly all of the parameters used in this VAD is computed in the MELP encoder. Hence, computational complexity of the detector is

89

low. In simulation studies, it is found that proposed VAD has excellent performance until the input signal SNR decreases below 10 dB level. Below this level, clipping becomes a severe problem.

The selection of the thresholds for the proposed speech variation measures and length of the duration between analyzed frames used in the speech segmentation algorithm can be made adaptive to obtain better results as a future work. This will increase computational complexity of the current method however we believe that number of insertions and missed boundaries will be decreased.

90

A P P E N D IX A

Threshold Extraction for

Elim ination of M isclassified LSF

R egions

As discussed in Chapter 2, elimination of misclassiiied regions are performed by comparing bandwidth and Ci defined by (2.3) with some fixed thresholds. If these thresholds are exceeded, state of the regions are changed. Five types of thresholds are extracted from the speech database;

1. Ti : Threshold for the bandwidth for the LSF region. If bandwidth of LSF regions which are assigned to contain peaks with both methods exceeds this threshold, the state of these LSF regions are changed.

2. 7i : Lower threshold for the bandwidth for bandwidth based method for the LSF region. If a LSF region is assigned to have peak with only bandwidth based method and if bandwidth of this LSF region is smaller than this threshold, this region is assigned to contain peak in it.

3. ai : Lower threshold for the bandwidth for energy based method for the 2 '· LSF region. If a LSF region is assigned to have peak with only energy

91

based method and if bandwidth of this LSF region is smaller than this threshold, this LSF region is assigned to contain peak in it.

4. ¡3i : Higher threshold for ti for bandwidth based method for the i}’'· LSF region. If a LSF region is assigned to have peak with only bandwidth based method and if e,’ of this LSF region is larger than this threshold, this LSF region is assigned to contain peak in it.

5. Ci · Higher threshold for e,· for energy based method for the LSF region. If a LSF region is assigned to have peed< with only energy based method and if ti of this LSF region is larger than this threshold, this LSF region is assigned to contain peak in it.

To extract these thresholds, the percentage of correct and false detected regions which are assigned peak in it versus these thresholds are calculated and plotted on the figures. The thresholds which correct maximum number of misclassified regions and introduce minimum number of misclassified regions are selected. These figures are given in the following pages.

Figures in the first column and the second column show percentage of correct classification and wrong classification versus threshold, respectively. Three pages are clustered to give the figures for one type of thresholds. LSF regions are presented sequentially in those pages. Selected thresholds are also given in the captions.

92

Pcrocnlago ol coned ostimalod peaks in region 1

Ротшпіадо ol corred edimaled peaks in region 2 Peraedage d iaise csbmaled peaks in region 2

Peiccr^age Ы correct estimated peaks n region 3 Penxntage ol labe estimated peaks in region 3

Figure A.l: Percentage versus Ti for the LSF regions 1, 2 and 3. First column shows the percentage of correct estimated of regions which contains peak in it. Second column shows the misclassified regions which has no peak in it. Ti = 320, T-2 = 300, Гз = 320.

93

Poiconlago o( correct eslitnaled peaks r rogm 4

PeiceKage o( correct estimated peaks n region 5 Peroiniage ol false estimated peaks in region b1 1----------1----------1----------1

1i1ji X '

1i ^ 1 /V

: / i : / !

/ : · / !

i: i

_____ i_____ i_____ L

Dilference between Line Speclrum Pairs (tu )

Peicertage d correct estimated peaks n region 6 Perixntage ol tatse estimated peaks in region 6

Figure A.2: Percentage versus T,· for the LSF regions 4, 5 and 6. T4 = 330, T5 = 320, Te = 340.

94

PeiccrUage o( oorrcci cstimalod peaks n region /

Ollerence between Line Speclium Pairs (hz)

Pcicerlago oi oorreci cstimaled peaks n region 8 Peruentage ol latse esbmaled peaks in region 8

Peicerbage o( oorrecl estimated peaks n region 9 Percentage ol labe esbmaled peaks in region 9

Figure A.3: Percentage versus T{ for the LSF regions 7, 8 and 9. T7 = 325, Tg = 340, Tq = 400.

95

Pcroentago ol corred eslimoled peaks in roqnn 1 Pcrcertago o( false estimaled peaks :n region 1

Pcireilage o( oorreci estimaled peaks in region 2 Pementage ol false eslimaled peaks in region 2

Peicerlage of oorieci estimaled peaks n region 3 Pementage ol lalso eslimaled peaks in region 3

Figure A.4: Percentage versus 7 ,· for the LSF regions 1 , 2 and 3. 7 1 = 130, 7 2 = 106,7 3 = 164.

96

Pefoenlage oi coned estimated peaks in rogmn 4 Peroedage ol false esHmated peaks in rcgbn 4

Pfifcenlage ol coned estimated peaks in region 5 Peroenage d false esbmaled peaks in region 5

Percentage ol coned esiimaled peaks in region 6 Peroerlage ol false esiimaled peaks in region 6

Figure A.5: Percentage versus j i for the LSF regions 4, 5 and 6 . 7 4 = 130, 7 5 = 160,7 6 - 140.

97

Percentage of correct estimated peaks in reg»n 7

Percentage c l correct estimated peaks in regun 8Percerlago o( false estimated peaks in region 8

Percentage of correct estimated peaks in region 9 Peroerlage o( false estimaled peaks in regon 9

Figure A.6 : Percentage versus 7 j· for the LSF regions 7, 8 and 9. 7 7 = 190, 7 « = 148,7 9 = 165.

98

Perocniago ol coned odinalcd peaks in tcgK>n 1Peroerlage d false esbmaled peaks in region 1

T----------r ■ 1— '----- 1----------1----------1---------1 ,1

· '.......:.................................... -

1

^r-\............; ...........

1 ;

1 :

1 : ; 1

......: ........

• : :

1 ; 1 ;

:100 110 120 130 140 ISO 160 170 1.

Ollerence between Line Speclrum P a is (lu )

Perocniage ol coned esiimaled peaks in regun 2 Peroedago of lalso estimated peaks in region 2

Peroenlage ol coned esiimaled peeks in region 3 Peroer age d false esiimaled peaks in region 3

Figure A.7: Percentage versus «i for the LSF regions 1, 2 and 3. «i = 164, «2 = 130, «3 = ‘215.

99

Perocniago of coned cslimalod peaks in region 4

Pcrccniago of ; oned oslimalod peaks in region 5 Percentage of falsa eslimafed peaks in regon 5

Percentage of corred eslimafed peaks in region 6 Peroenlage o( laise estimated peaks in region 6

Figure A.8: Percentage versus tVj for the LSF regions 4, 5 and 6. 0:4 = 250, a$ — 250, ae = 170.

1 0 0

Percentage ol corred estimated peaks in region 7Percerlago o i false estimated peaks in region /

Percentage ot correct estimated peaks in regmn 8 Percentage ot (atse estimated peaks in regbn 8

Percentage ol correct estimated peaks in regbn 9 Percentage ol false estimated peaks in region 9

Figure A.9: Percentage versus cvj for the LSF regions 7, 8 and 9. ar = 250, «s = 120, «9 = 250.

1 0 1

Pctccniagool ooircd cstimaled peaks in rcqon 1 Peroeniago o( laise estimated peaks in rogon 1-----------1-----------1-----------11-----------11 11-----------1-----------1

!\

_______ 150 55 60 65 70 /5 SO 65 90 95 too

Eneigy and bandundh based dsianoe measure (EE3k)

Pofocniage ol correct estimated peaks in region 222

2120

19

« 1?I 170.

15

14

13

Percentage ol laise estimated peaks in region 2

;. ...X1--- T--- 11--- 1

.....

...Л....

50 55 60 65 70 75 80 85 90 95 tooEneig>' and bandwidlh based dislanco measure (EQk)

Percentage ol correct estimated peaks in region 3 Percentage ot laise estimated peaks in regon 3

Figure A.10: Percentage versus f i i for the LSF regions 1 , 2 and 3. /?i = N / A ,

/?2 = N / A , 0 3 = 78.5.

1 0 2

Percentage o( corred estimated peaks in refpon 4

Pdccetageot corred estimated peaks in regbn 5 Percentage ot (else estimated peaks in region 5

Percentage ol corred estimated peaks in region 6 Peroertage ol lalse estimated peaks in region 6

Figure A. 11: Percentage versus f i i for the LSF regions 4, 5 and 6 . /?4 = N / A ,

/?5 = 80, /?6 = 32.

103

Poicerlage o( oonect edimalcd peaks n region /

Pcfoenlage of coned esiimaled peaks in regnn 8Penxntage of fatso eslimaled peaks in region 8

Percertage of ooneci esfimatod peaks n region 9 Penxntage of fatso oslimalod peaks in region 9

Figure A. 12: Percentage versus ¡3i for the LSF regions 7, 8 and 9. /?7 = 25, /is = 75,/ig = 25.

104

Pcrocniago ol coned estimalod peaks in rognn 1

Perccniage ol coned cstimaled peaks in rorpon 2 Peroenlage o( false esiimated peaks in region 2

Perocniage of coned eslimaled peaks in region 3Peroonlage ol false cslimaled peaks in regbn 3

Figure A. 13: Percentage versus Ci for the LSF regions 1 , 2 and 3. = lOOi C2 = 1 0 0 ,C3 = 62.5.

105

Pcroonlage ol correct estimated peaks in region 4 Peroenlago ot false estimated peaks in region 4

Eneigy and bancVrdlh based distance measure (EBk)

Percentage ol correct estimated peaks in re^ n 5Pcioer4age ot false estimated peaks in region 5

Percentage ol correct estimated peaks in region 6

^ 93

|9 Z 8

1i1i1i1

11i

............ i ............i

............ 1..........j

\ _ !

'---w,j

_______1_______ i _______i_______ i_______

Percerlage ot false estimated peaks in regbn 6

30 40 50 60 70 60 90Eneigy and bandwidlh based distance measure (EBk)

Figure A.14; Percentage versus Ci for the LSF regions 4, 5 and 6. C4 = 21, Cs = 17-5, Ce = 40.

106

Pcrocnlage ol coned edimaled peaks in rognn 7

Peicertago d oorrecl eslimaled peaks n region 8 Peroertage ol false esliinaled peaks in regon 8

Eneigy and bandwdlh based distance measure (EBk)

92234

92232

9223

92228

^92226

3

S 92224

Percentage of coned estimaled peaks in region 9Percentage of talse eslimaled peaks in region 9

92222

9222

92218

9221610 20 30 40 50 60 70 80 90 100

Energy and bandwidlh based distance measure (EBk)

Figure A. 15: Percentage versus Ci for the LSF regions 7, 8 and 9. Ct = If .5, Cs = 75, C9 = 25.

107

A P P E N D IX B

Selection o f param eters for

m inim izing peak location

estim ation error

In Chapter 2, a novel method to make precise estimation of the peak location is proposed. In this chapter, statistical data used to obtain and ¡j,i for i = 1 · · · 9 are presented.

For each LSF region, the number of occurrence of the error between estimated peak and actual peak is calculated for different values of r, ranging from 0.15 to 1.75 for both voiced and entire speech by (2.4) from database. Value of Ti which minimize or almost minimize standard deviation and maximize percentage of error smaller than 25 Hz is selected.

In the following tables, mean, standard deviation, percentage of error smaller than 25 Hz and percentage of error smaller than 50 hz are tabulated for simple averaging method, energy weighted mean method for r = 0.15 and energy weighted mean method for t = ti. Selected r,· values are tabulated in

108

Chapter 2. Also these tabulated datas versus r,· is plotted and the figures are given in following pages.

Figures are divided into two sets: In the first set, standard deviation, percentage of error smaller than 25 Hz and percentage of error smaller than 50 Hz are plotted. As page setup, all figures in the first and second column belongs to voiced and entire speech, respectively. First row shows standard deviation versus Ti and second and third row show percentage of error in peak location of estimation of smaller than 25 Hz and 50 Hz versus Ti, respectively. LSF regions are sequentially presented in those pages. In the second set, number of occurrence versus error in peak location estimation is given. Column fornicition is same as the first set. LSF regions are presented sequentially.

Table B.l: Mean of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = r,· for voiced speech.


Simple mean -13.44 -25.04 14.05 18.55 -7.681r = 0.15 -9.322 -17.03 13.18 14.27 -5.467

r = Ti 5.656 7.956 4.643 -0.33 1.7046 7 8 9

Simple mean 3.326 -0.9552 -6.884 14.08r = 0.15 2.655 0.2175 -4.508 12.91

T = Ti -0.569 3.094 3.654 4.535

Table B.2: Mean of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = r,· for entire speech.


Simple mean -34.017 -23.4 17.8 8.194 -7.959r = 0.15 -25.67 -16.06 15.34 6.003 -6.062

T = Ti 7.266 7.497 2.394 -1.096 1.1796 7 8 9

Simple mean 4.951 -0.535 -1.565 14.21r = 0.15 4.086 0.266 -0.703 13.03

T =^T i -0.177 2.395 1.339 5.132

109

Table B.3: Standard deviation of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = ri for voiced speech.


Simple mean 37.16 38.42 55.46 62.95 52.22T = 0.15 28.9 31.03 44.84 51.52 40.84

T = Ti 8.191 9.2.35 13.92 18.05 12.986 7 8 9

Simple mean 64.57 55.41 67.91 55.43r = 0.15 53.46 44.16 56.62 44.84

r = Ti 16.85 13.44 17.14 13.93

liable B.4: Staadard deviation of the error between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = for entire speech.


Simple mean 43,74 43.1 58.4 71.45 56.33r = 0.15 34.73 35.4 48.49 61.01 45.96

T = Ti 9.047 10.35 13.43 21.39 13.646 7 8 9

Simple mean 69.4 56.71 71.09 54.44T = 0.15 59.38 46.58 61.09 44.93

r = Ti 19.31 13.73 19.87 13.33

Table B.5: Percentage of the error smaller than 25 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = Ti for voiced speech.


Simple mean 58.23 56.69 30.29 29.43 39.16r — 0.15 67.45 67.04 38.57 38.29 49.17

T = Ti 99.6 98.23 95.28 91.69 97.566 7 8 9 All Regions

Simple mean 25.88 35.41 22.19 38.65 41.07r = 0.15 31.62 44.32 27.92 47.77 50.15

T - Ti 92.6 97.32 91.52 98.12 97.07

110

Table B.6: Percentage of the error smaller than 25 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = r,· for


Simple mean 43.74 50.71 30.68 24.08 34.98r = 0.15 57.29 61.62 37.33 29.99 43.29

r - Ti 99.06 97.44 95.45 86.43 96.26 7 8 9 All Regions

Simple mean 23.74 33.89 20.96 38.48 36.16r = 0.15 28.68 41.84 25.5 46.73 45.25

T - Ti 88.83 96.15 86.68 97.71 95.72

Table B.7: Percentage of the error smaller than 50 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = r; for voiced speech.


Simple mean 81.46 83.36 58.16 55.43 65.23r = 0.15 90.23 88.33 70.76 68.21 78.03r = Ti 99.99 99.81 99.36 97.83 99.69

6 7 8 9 All RegionsSimple mean 49.12 61.9 45.63 66.03 66.46

r = 0.15 61.74 73.45 56.16 75.43 77.14r = Ti 98.55 99.76 98.72 99.78 99.55

Table B.8; Percentage of the error smaller than 50 Hz between actual peak location and estimated peak location for simple mean technique, r = 0.15 and r = r; for entire speech.


Simple mean 74.58 79.02 56.66 47.18 60.77r = 0.15 84.82 84.51 67.24 57.05 72.05

T - Ti 99.97 99.70 99.58 95.82 99.446 7 8 9 All Regions

Simple mean 45.74 60.75 43.36 66.82 63.42r = 0.15 56 71.36 52.48 75.79 73.46

T = Ti 97.27 99.54 97.09 99.71 99.17

I l l

standard deviation ol peak oslimalDn error lor votced speech in region 1 Standard deviaton ct peak estimation error lor whole speech in rerpon 1

Exponents used h evaluation ot irwerse prediction litter power spectrum

Percanlage ol the |waks. estimated within 25 hz error, lor voiced speech in region t

Exponents used in evatuation ol inverse predclion litter power spectrum

Percantage d the peaks, estimated within 50 hz error, lor voiced speech in region 1

Pcrcantage ol the peaks, estimated within 25 hz error, lor whole speech in region t

Exponerls useJ n evaluation d irwerse prediction litter power spectrum

Percailage d the piaks, estimated within 50 hz error, lor whde speech i i region t

Exponents used in evaluation ol inverse predebon filter power spednim

Figure B.l: Statistics of error in peak location estimation error for LSF region 1. First and second column presents statistical data about voiced frames and all frames, respectively. First index in all figures corresponds to the data extracted by simple averaging. First row shows standard deviation of error versus varying r values. Second and third row shows percentage of error in peak location estimation smaller than 25 Hz and 50 Hz versus varying r values, respectively. ri is selected as

2·® · 112

standard deviation ol peak eslimalon error lor vocod speech in region 2 Standard deval on o< peak estimation error lor whole speech in region 2

0iponenls used n evaluation ot inverse prediclion lilter power spectrum

Percanlage ol the |N' ks, estimated wilhin 25 tu error, lor weed speech in region 2

Exponents used in evahralion ol inverse preciction lilter power spectrum

Percanlage ol the peaks, estimated within 50 tu error, lor voiced speech in region 2

Exponents used n evaluation ol inverse prediction tiller power spectrum

Percanlage d the peaks, estimated wilhin 25 hz error, lor vrhoie speech i i region 2

Exponents usrxJ in evatualion ol inverse precklion lilter power spectrum

Percarlage ol the piraks, estimated wilhin 50 hz error, tor whole speech n region 2

Exponents usrxt in evaluation ol inverse preciction litter power spectrum

Figure B.2: Statistics of error in peak location estimation error for LSF region 2 .T2 is selected as 2 .6 .

113

Standard deviation ol peak cstimalbn coor (or voiced speech in region 3 Slandard devalon d peak estimation error (or whoie speech in regnn 3

Exponents used n evaluation ot irwerse prediction (itter power spectrum

Percar age oi the peaks, estimated within 25 hz error, (or voiced speech r regon 3

Percanlage ol the peaks, estimated within SO hz error, (or voced speech in region 3

Exponents used n evaluation d inverse prediction tilter power spectrum

Exponerls u sa j n evaluation ol irweise prediction Tiller power spectmm

Percarlage ol the ptaks, estimated within 50 hz eiror, lor whole speech in region 3

Exponents us4xj in evakialkm ol irwerse predclion (liter power spectrum

Figure B.3: Statistics of error in peak location estimation error for LSF region 3.T3 is selected as 2.25.

114

Statxbrd dovialion ol peak eslimalbn error lor voicod speech in region A Standard devei on ol peak estimation error lor whole speech in regnn 4

Expoircnis used n evaluation oi irweise prediction litter power spectrum

Percar4age ol the peaks, estimated within 25 tu error, lor voiced speech n region 4

Pcrcanlage ol the peaks, estimated within 50 hz error, lor voiced speech in region 4

Exponents used in evalualion ol inverse preciction filler power spectrum

Exponents u sal n evalualion ol inverse prediction tiller power spectrum

Percanlage ol the peaks, estimated within 25 hz enor, tor whole speech in region 4

Exponerts u se j in evalualion ol inverse prediction tiller power spectrum

Percanlage ol the peaks, estimated wilhin 50 hz error, lor whole speech in region 4

Figure B.4: Statistics of error in peak location estimation error for LSF region 4.T4 is selected as 2 .6 .

115

standard deviation ol peak eslimalon error lor voiced speech in region S Standard devaton ol peak estimation error (or whole speech in regwn 5

Percamage ol Ihe peaks, estimated wilhin 25 lu eiror, lor voiced speech n regon 5

[¿xponenis used h evaluation ol irweise prediction tiller power spectrum

Percanlage ol Ihe peaks, estimated within SO hz error, lor voiced speech in region S

Exponents used in evalralion ol itwerse predction tiller power spectrum

Exponents used n evaluation ol inverse prediction liller power spectrum

Percanlage ol Ihe peaks, estimated wilhin 25 hz error, lor whole speech in regKxi 5

Exponer^s used n evaluation ol inveise prediction liller power speciruni

Percailage d ihe piiaks, esiimaled wiiNn 50 hz error, lor whole speech n region 5

Exponents used in evakialion ol inverse predk:lion liller power spedrum

Figure B.5: Statistics of error in peak location estimation error for LSF region 5.Ts is selected as 2.35.

116

standard dovialion ol peak eslimalon error lor voKOd speech in region 6 Standard devial on o( peak estimation error lor whole speech in regnn 6

Pcrcar ago ol the peaks, estimated within 25 tu error, lor voiced speech r region b

Exponents used r evaluation ol inverse prediction filler power spectrum

Percar age of the peaks, estimated within 50 hz error, lor voiced speech n region 6

Exponents used n evaluation ol inverse prediction litter power spectrum

Exponents used n evaluation ol irwerse prediction lilter power spectrum

Percantage ol the peaks, estimated within 25 hz error, lor whole speech in region 6

Pcrcanlage ol the peaks, estimated within 50 hz error, lor whole speech in r e ^ 6

Exponerls used h evaluation ol inverse prediction lilter power spectrum

Figure B.6 ; Statistics of error in peak location estimation error for LSF region 6 .tq is selected as 3.0.

117

standard deviation ol peak estimatbn error lor voicod «poech in region J Standard deval on d peak estimation error lor whole speech in region 7

Exponents used n evaluation d irwerse prediction tiller power spedruni

Porcadago d llio |)oaks. estimated within 25 fu eiror. lor voiced speech n regon /

Exponents used n evaluation d irweise predidion lilter power spectrum

Percantage d the peaks, estimated within SO hz error, lor voiced speech in region 7

Exponents used n evaluation d iiwerse prediction lillc r power spedrum

Pcrcanlage ol the peaks, estimated within 25 hz error, lor whole speech in regron 7

Percantage d the praks, estimated wilhn 50 hz error, lor whole speech in region 7

Exponents used in evalualion ol inverse preciction lilter power spedrum

Figure B.7: Statistics of error in peak location estimation error for LSF region 7.T r is selected as 2.35.

118

standard deviation ol peak eslimalDn error lor voiced speech in region 8 Standard devaton ol peak estimation error lor whole speech in regnn 8

Exponents used n evaluation ot irwerse prediction lilter power spectrum

Percartage ot the peaks, estimated within 25 hz error, lor wioed speech n regon 8

Exponerls used in evaluation of irwerse prediction Tilter power spectrum

Percarlage of the peaks, estimated within 50 lu error, lor iroioed speedi n regbn 8

Exponents used n evaluation of irwerse prediction liller power spectrum

Percantage ol the peaks, estimated within 25 hz error, lor whole speech in region 8

Exponerts useiJ in evaluation ot inverse prediction filler power spectrum

Percantage ol the peaks, estimated within 50 hz error, lor whole speech in regxm 8

Figure B.8 : Statistics of error in peak location estimation error for LSF region 8 .T s is selected as 2.9.

119

standard deviation ol peak estimaton error Iw voiced speech in region 9 Standard deveton o l peak estimation error (or whole speech in region 9

Exponents used n evaluation ol im eise prediction lilter power spectrum

Pcrcanlage ol Iho |)oaks. estimated within 25 hz error, (or voiced speech n region 9

Exponents used in cvalualionot irwerse prediction Tiller power speclmm

Percanlago ot the peaks, oslimaled wilhin 50 hz error, lor voiced speech in regmn 9

Exponents used n evaluation ol inverse prediction lilter power spectrum

Pcrcantage ol the peaks, estimated withm 25 hz error, lor whole speech in regKxr 9

Percarlage ot the peaks, eslimaled wilhin 50 hz error, lor whole speech n regKxr 9

Exponents used in evalualion ol inverse prediction lilter pewer spectrum

Figure B.9: Statistics of error in peak location estimation error for LSF region 9.rg is .selected as 2.3.

12 0

Peek cslimalbn слог (or voiced speech in region 1 Peek eshmalon error (or whole speech n region 1

Peak eslm alion error lor voiced speech in rogon 2 Peak eslm alion error (or whole speech in region 2

Peak eslm alion error lor voiced speech in regbn 3 Peak eslm alion error (w whole speed) in region 3

Figure B.IO: Number of occurrence versus difference between original and estimated peaks for the LSF regions 1, 2 and 3. First and second column presents statistical data about voiced frames and all frames, respectively. The solid, dashed and dashed- dotted lines corresponds to simple mean, weighted mean with r = 0.15 and weighted mean with r = ri, respectively, ti = 2.5, T2 = 2.6 and T3 = 2.25.

121

Peak cslimalion error lor voiced speech in regon 4 Peak eslmalkm error for whole speech in region 4

Peak esiinalion error lor voiced speech in regon 5 Peak estmalion error lor whole speech in region 5

Peak eslioalion error lor voiced speech in regbn 6 Peak esiimalion error lor whole speech in region 6

Figure B .ll: Number of occurrence versus difference between original and estimated peaks for the LSF regions 4, 5 and 6. The solid, dashed and dashed-dotted lines corresponds to simple mean, weighted mean with r = 0.15 and weighted mean with r = Tj, respectively. T4 = 2.6, rs = 2.35 and re = 3.0.

12 2

Peak cslmalion enror lor voiced speech in rognn 7 Peak cstimalon error lor whole speech n regon 7

Peak eslm alion error lor voiced speech in regbn 8 Peak eslmalion error lor whole speech in region 8

-100 -50 0 50 100Oilierence between adual peak and estimated peak (Hz.)

Peak eslimalion error lor voiced speech in region 9

-100 -50 0 50 100OllcFence between actual peak and esiimaled peak (Hz.)

Peak estimalon error lor whole speech n regon 9

i !

300■

Pi| .

'^Iri

' ' r i (250 / i ,

t ' N v “ '

V ' i' i .)

■i ii 1 1 ! J 1

/ '

«200cM

1

^ 100

'i ;

1 11: !

7 ·

1 \ i i

Í ' ' ( ' ' ' ·

' 'i y l ) 1 l' '· 1/' '

50

0

Figure B.12: Number of occurrence versus difference between original and estimated peaks for the LSF regions 7, 8 and 9. The solid, dashed and dashed-dotted lines corresponds to simple mean, weighted mean with r = 0.15 and weighted mean with r = Ti, respectively. Tr = 2.35, rg = 2.9 and tq = 2.3.

12.3

Bibliography

[1] L.R. Rabiner, “Applications of voice processing to telecommunications,” Proceedings of IEEE, vol. 82, 1994.

[2] .J.P. van Hemert, “Automatic segmentation of speech,” IEEE Trans. Signal Process., vol. 39, pp. 1008-1012, 1991.

[3] A. Das, E. Paksoy, and A. Gersho, Speech Coding and Synthesis, chapter 7: Multimode and Variable-Rate Coding of Speech, Elsevier, 1995.

[4] L.F. Lamel, L.R. Rabiner, and A.E. Rosenberg, “An improved endpoint detector for isolated word recognition,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 29, pp. 777-785, 1981.

[5] Ta-Hsin Li and Jerry D. Gibson, “Speech analysis and segmentation by parametric filtering,” IEEE Trans. Speech, Audio and Signal Process., vol. 4, pp. 203-213, 1996.

[6] Regine Andre-Obrecht, “A new statistical approach for the automatic segmentation of continuous speech signals,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 36, pp. 29-40, 1988.

[7] R.J. Di Francesco, “Real-time speech segmentation using pitch and convexity jump models; Applications to variable rate speech coding,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 38, pp. 741-748, 1990.

[8] D.B. Grayden and M.S. Scordilis, “Phonetic segmentation of fluent speech,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 173- 176, 1994.

124

[9] B.J. Pawate, “A new method for segmenting continuous speech,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I53-I56, 1994.

[10] A. Ljolje and M.D. Riley, “Automatic segmentation and labelling of speech,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 47-3- 476, 1991.

[11] S. Krishnan and P.V.S. Rao, “Segmental phoneme recognition using piece- wise linear regression,” Proc. IEEE Int. Conf. Acoust. Speech Signal Pro- ce.ss., pp. I49-I52, 1994.

[12] V. Zue, .J. Glass, M. Phillips, and S. Seneff, “Acoustic segmentation and phonetic classification in the SUMMIT system,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pj). 389-392, 1989.

[13] Mustafa Bayindir, “Automatic segmentation and labeling of isolated turkish words,” M.S. thesis. Middle East Technical University, 1997.

[14] P. Jeanrenaud and P. Peterson, “Segment vocoder based on reconstruction with natural segments,” Proc. IEEE Int. Conf. Acoust. Speech Signal Proce.ss., pp. 605-608, 1991.

[15] R.M. Gray, A. Buzo, A.H. Gray, and Y. Matsuycima, “Distortion measures for speech processing,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 28, pp. 367-376, 1980.

[16] .J.S. Erkelens and P.M.T. Broersen, “On the statistical properties of line spectrum pairs,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 768-771, 1995.

[17] “Specifications for the analog to digital conversion of voice by 2,400 bit/second mixed excitation linear prediction,” 1998.

[18] Monson H. Hayes, Statistical Digital Signal Proces.sing and Modeling, .John Wiley L· Sons, Inc., 1996.

[19] F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signal,” Journal of Acoustical Society of America., p. 535, 1975.

125

[20] R.P. Cohn and J.S. Collura, “Incorporating perception into Isf quantization - some experiments,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1347-1350, 1997.

[21] H.L. Vu and L. Lois, “Spectral sensitivity of Isp parameters and their transformed coefficients,” Proc. European Speech Comm. Technology, EUROSPEECH, p p .1251-1254, 1997.

[22] P.E. Papamichalis, Practical Approaches to Speech Coding, Prentice-Hall, 1987.

[23] K.K. Paliwal, “On the use of line spectrum freqeucies parameters for speech recognition,” Digital Signal Processing, vol. 2, pp. 80-87, 1992.

[24] A.V. McCree and .J.C. De Martin, “A 1.7 kb/s MELP coder with improved analysis and quantization,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 593-596, 1998.

[25] N. Sugamura, “Quantizer design in Isp speech analysis-synthesis,” IEEE .Journal on Selected Areas in Communications, vol. 6, pp. 432-440, 1988.

[26] K.K. Paliwal and B.S. Atal, “Efficient vector quantization of Ipc parameters at 24 bits/frame,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 661-664, 1991.

[27] F.K. Soong and B.H. Juang, “Optimal quantization of Isp parameters using delayed decisions,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 185-188, 1990.

[28] M. Xie and J.P. Adoul, “Fast and low-complexity Isf quantization using algebraic vector quantizer,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 716-719, 1995.

[29] D. Chang, Cho Y, and S. Ann, “Efficient quantization of Isf parameters using classified svq and combined with conditional splitting,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 736-739, 1995.

K.K Paliwal and B.S Atal, “Efficient vector quantization of Ipc parameters at 24 bits/frame,” IEEE Trans. Speech, Audio and Signal Process., vol. 1, pp. 3-14, 1993.

126

[31] s. Nandkumar, K. Swaminatham, and U. Bhaskar, “Robust speech mode based Isf vector quantization for low bit rate coders,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 41-44, 1998.

[32] C.S. Xydeas and C. Papanastasiou, “Efficient coding of Isp parameters using split matrix quantization,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 740-743, 1995.

[33] C.C. Kuo, F.R. .Jean, and H.C. Wang, “Low bit-rate quantization of Isp parameters using two-dimensional differential coding,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I97-I100, 1992.

[34] E. Erzin and A.E. Çetin, “Interframe differential vector coding of line spectrum frequencies,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. II25-II28, 1993.

[35] C.W. Seymour and A..J. Robinson, “A low bit-rate speech coder using adaptive line spectral frequency prediction,” Proc. European Speech Comm. Technology, EUROSPEECH, pp. 1319-1322, 1997.

[36] Y.K. Lee, K.C. Kim, and H.S. Lee, “An efficient coding of Isp parameters using multiple type frame segmentation,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 753-756, 1995.

[37] A.N. Lemma, W.B. Kleijn, and E.F. Deprettere, “Lpe quantization using wavelet based temporal decomposition of the Isf,” Proc. European Speech Comm. Technology, EUROSPEECH, pp. 1259-126, 1997.

A.M. Kondoz, Digital Speech, John Wiley k. Sons Ltd., 1994.

F.K. Soong and B.H. Juang, “Line spectrum pairs (Isp) and speech data compression,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1.10.1-1.10.4, 1984.

[40] P. Kabal and R.P. Ramachandran, “The computation of line spectrum frequencies using chebyshev polynomials,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 34, pp. 1419-1425, 1986.

[41] C.H. Wu and J.H. Chen, “A novel two-level method for the computation of the Isp frequencies using a decimation-in-degree algorithm,” IEEE Trans. Speech, Audio and Signal Process., vol. 5, pp. 106-115, 1997.

127

[42] A. Goalie and S. Saoudi, “An intrinsically reliable and fast algorithm to compute the line spectrum pairs(lsp) in low bitrate celp coding,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 728-731, 1995.

[43] B.M.G. Cheetham, “Adaptive Isp filter,” Electronics Letters, vol. 23, pp. 89-90, 1986.

[44] H..J. Coetzee and T. P. Barnwell III, “An Isp based speech quality measure,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, pp. 596-599, 1989.

[45] R. W. Schafer and L. R. Rabiner, “System for automatic formant analysis of voiced speech,” Bell Labs. Technical .Tournai vol. 47, pp. 634-647, 1970.

[46] H.L Vu and L. Lois, “.A new general distance measure for quantization of Isf and their transformed coefficients,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 45-48, 1998.

[47] Bishnu S. Atal, Vladimir Cuperman, and Allen Gersho, Advances in speech coding, Boston ; Kluwer Academic Publishers, 1991.

[48] S. Roucos, R. M. Schwartz, and .1. Makhoul, “A segment vocoder at 150 b/s.,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 61-64, 1983.

[49] A. Gersho, “Advances in speech coding and audio compression,” Proceedings of IEEE, vol. 82, 1994.

[50] A.S. Spanias, “Speech coding; A tutorail review,” Proceedings of IEEE, vol. 82, 1994.

[51] A.V. McCree and T.P. Barnwell III, “A mixed excitation ipe vocoder model for low bit-rate speech coding,” IEEE Trans. Speech, Audio and Signal Process., vol. 3, pp. 242-250, 1995.

[52] V. Cuperman, B.S. Atal, and A. Gersho, Advances in Speech Coding, Kluwer Academic Publishers, 1991.

[53] S.V. Vaseghi, “Finite state celp for variable rate speech coding,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 37-40, 1990.

128

[54] L. Cellario, M. Giani, P. Blocher, D. Sereno, and K. Hellwig, “A vr-celp codec implementation for cdma mobile communications,” Proc. IEEE Ini. Conf. Acoust. Speech Signal Process., pp. 1281-Г284, 1994.

[55] L.A.H. Gomez, F.J.C. Quiros, G.G. Mateo, and .J.O. Garcia, “Real-time implementation and evaluation of variable rate celp coders,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 585-588, 1991.

[56] R.D. De lacovo and D. Sereno, “Embedded celp coding for variable bit- rate between 6.4 and 9.6 kbit/s,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 681-684, 1991.

[57] P. Lupini, N.B. Cox, and V. Cuperman, “A multi-mode variable rate celp coder based on frame classification,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 406-409, 1993.

[58] S. A. McClellan and J.D. Gibson, “Variable rate celp based on subband flatness,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1409- 1413, 1995.

[59] P. Kroon and M. Recchione, “A low-complexity toll-quality variable bit- rate coder for cdma cellular systems,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 5-8, 1995.

[60] R. Peng and V. Cuperman, “Variable-rate low-delay analysis-by-synthesis speech coding at 8-16 kb/s,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 29-32, 1991.

[61] R.D. Francesco, C. Lamblin, A. Leguyader, and D. Massaloux, “Variable rate speech coding with online segmentation and fast algebraic codes,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 233-236, 1990.

[62] S. Wang and A. Gersho, “Improved phonetically-segmented vector excitation at 3.4 kb/s,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1349-1352, 1992.

A. Shen, B. Tang, A. Alwan, and G. Pottie, “A robust variable-rate speech coder,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 249-252, 1995.

129

[64] E. Paksoy, A. McCree, and V. Viswanathan, “A variable-rate multimodal speech coder with gain matched analsis-by-synthesis,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 751-754, 1997.

[65] E.W.M. Yu and C.F. Chan, “Variable bit-rate mbelp speech coding via v/uv distribution dependent spectral quantization,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1607-1610, 1997.

[66] S. Villette, M. Stefanovic, I. Atkinson, and A.M. Kondoz, “High quality split band Ipc vocoder and its fixed point real time implementation,” Proc. European Speech Com/m. Technology, EUROSPEECH, pp. 124.3- 1246, 1997.

[67] S.Y. Kwon and A..J. Goldberg, “An enhanced Ipc vocoder with no voiced/unvoiced switch,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 32, pp. 851-858, 1984.

[68] A.V. McCree and T.P. Barnwell III, “Improving performance of a mixed excitation Ipc vocoder in acoustic noise,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, pp. 137-140, 1992.

[69] A.V. McCree and T.P. Barnwell III, “A new mixed excitation Ipc vocoder,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 593- 596, 1991.

[70] J.H. Chen and A. Gersho, “Adaptive postfilter for quality enhancement of coded speech,” IEEE Trans. Speech, Audio and Signal Process., vol. 3, pp. 59-71, 1995.

[71] P.A. Laurent and P. de La Nove, “A robust 2400 bps subband Ipc coder,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 500-503, 1995.

[72] D.W. Griffin and .J.S. Lim, “Multiband excitation vocoder,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 36, pp. 1223-1235, 1988.

[73] .1. Makhoul, R. VisWanathan, R. Schwartz, and A.W.F. Huggins, “A mixed-source model for speech compression and synthesis,” Journal of Acoustical Society of America., vol. 64, pp. 1577-1581, 1978.

130

[74] B.S. Atal M.R Schroder and J.L Hall, “Optimizing digital speech coders by exploiting masking properties of the human ear,” Journal of Acoustical Society of America.^ vol. 66, pp. 1647-1652, 1979.

[75] A.V. McCree and T.P. Barnwell III, “Implementation and evaluation of a 2400 bps mixed excitation Ipc vocoder,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, pp. 159-16.3, 1993.

[76] C.F. Chan and E.W.M. Yu, “Frequency domain postfiltering for multiband excited linear predictive coding of speech,” Electronics Letters, vol. .32, pp. 1061-1063, 1996.

[77] G.S. Kang and S.S. Everett, “Improvement of the excitation source in the narrowband linear prediction vocoder,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 33, pp. 377-386, 1985.

[78] M.R. Sambur, A.E. Rosenburg, L.R. Rabiner, and C.A. McGonegal, “On reducing the buzz in Ipc synthesizer,” Journal of Acoustical Society of America., vol. 63, pp. 918-924, 1978.

[79] A.E. Rosenburg, “Effects of glottal pulse shape on the quality of natural vowels,” Journal of Acoustical Society of America., vol. 49, pp. 583-590, 1971.

[80] L.M. Suppléé, R.P. Cohn, .J.S. Collura, and A.V. McCree, “MELP: The new federal standard at 2400 bps,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1591-1594, 1997.

[81] M.A. Kohler, “A comparison of the new 2400 bps MELP federal standard with other standard coders,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1587-1590, 1997.

[82] E.S. Dermetas, N.D. Fakotakis, and G.K. Kokinakis, “Fast endpoint detection algorithm for isolated word recognition in office environment,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 733-736, 1991.

[83] H. Kobatake, K. Tawa, and A. Ishida, “Speech/nonspeech discrimination for speech recognition systems under real life noise environment,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 36.5-368, 1989.

131

[84] L.R. Rabiner and M.R. Sambur, “An algorithm for determining endpoints of isolated utterances,” Bell Labs. Technical Journal., vol. 54, pp. 297-315, 1975.

[85] G.S. Ying, C.D. Mitchell, and L.H. .Jamieson, “Endpoint detection of isolated utterances based on a modified teager energy measure,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 732-735, 1993.

[86] Engin Erzin, New Methods for Robust Speech Recognition, Ph.D. thesis, Bilkent University, 1995.

[87] Y. Yatsuzuka, “Highly sensitive speech detector and high-speed voiceband data discriminator in dsi-adpcm systems,” IEEE Trans, on Communications, vol. 30, pp. 739-750, 1982.

[88] P.T. Brady, “A technique for investigating on-off patterns of speech,” Bell Labs. Technical Journal, vol. 44, pp. 1-22, 1965.

[89] D.K. Freeman, G. Cosier, C.B. Southcott, and I. Boyd, “The voice activity detector for the pan-european digital cellular mobile telephone service,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 369-372, 1989.

[90] E. Paksoy, K. Srinivasan, and A. Gersho, “Variable rate speech coding with phonetic segmentation,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 155-158, 1993.

[91] J. Sohn and W. Sung, “A voice activity detector employing soft decision based noise spectrum adaptation,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 365-368, 1998.

[92] A. Cavallaro, F. Beritelli, and S. Casale, “A fuzzy logic-based speech detection algorithm for communications in noisy environments,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 565-568, 1998.

132

SiiI-î^iTTEi:' .TC- ТН£ Sf'Sl‘-vb:-r‘r ;E>-CT o f ^f,0.0000-0 ол · ölçüler öznel...

Documents

Transcript of SiiI-î^iTTEi:' .TC- ТН£ Sf'Sl‘-vb:-r‘r ;E>-CT o f ^f,0.0000-0 ол · ölçüler öznel...