Adding New Words Into A Language Model Using Parameters Of Known Words With Similar Behavior
Main Article Content
Abstract
This article presents a study on how to automatically add new words into a language model without re-training it or adapting it (which requires a lot of new data). The proposed approach consists in finding a list of similar words for each new word to be added in the language model. Based on a small set of sentences containing the new words and on a set of n-gram counts containing the known words, we search for known words which have the most similar neighbor distribution (of the few preceding and few following neighbor words) to the new words. The similar words are determined through the computation of KL divergences on the distribution of neighbor words. The n-gram parameter values associated to the similar words are then used to define the n-gram parameter values of the new words. In the context of speech recognition, the performance assessment on a LVCSR task shows the benefit of the proposed approach.
Article Details
References
P.CWoodlhand ,S.E.Johnson, P.Jourlin , and K.S. Jones, Effect of out of vocabulary words in spoken document retrieval, in proceedings of the 23rd annual international retrieval.ACM.2000,pp.372-374.
H.Ketabdar, M.Hannemann, and H.Hermansky, detectin of out-of-vocabulary words in posterior based ASR, in proceedings of Intersêech, 2007, pp. 1757-1760.
J.R.Bellegarda, Statistical language model adaptation: review and perspective s, speech communication, vol.42,pp.93-108,2004.
F.Keller and M.Lapata, Using the web to obtain frequencies for unseen bigrams, computational linguistics , vol.29,no.3,pp.459-484,2003.
B.Suhn and A.Waibel, Towards better language models for spontaneous speech, in the 3rd International conference on spoken language processing ICSLP. ISCA,1994.
A.Asadi, R.Schwartz, and j.Makhoul, Automatic modeling for adding new words to a large-vocabulary continuous speech recognition system, in proceedings of the IEEE International Conference on Acoustics, speech and signal processing ICASSP, vol.1, 1991,pp.305-308.
P.F. Brown, P.V. desouza, R.L. Mercer ,V.J. Della pietra , and J.C.Lai, Class-based n-gram models of natural language, Computational Linguistics, Vol.18, pp.467-479,1992.
A.Prazak, P.Ircing, and L.Muller , Language model adaptation using different class-based models, in proceedings of SPECOM, 2007,pp.449-454.
W.Naptali,.M.Tsuchya ,and S.Nakagawa, Class-based n-gram language model for new words using out of vocabulary to in vocabulary similatity, IEICE Transactions on information and systems, vol.E95-D, no.9,pp.2308-2317,2012.
I.Zitouni ,Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition, computer speech and language, vol.21,no.1,pp.1602-1605.
G.Lecorv,G.Gravier , and P.Sbillot, Automatically finding semantically consistent n-grams to add new words in LVCSR systems, in proceedings of the IEEE International Conference on a coustics, speech and signal processing ICASSP, 2011,pp.4676-4679.
I.Dagan , L.Lee, and F.C.N.Pereira, Similarity-based models of word co-occurrence probabilities, in Machine learning, vol.34,no.1-3,1999,pp.43-69.
S.Galliano,G.Gravier, and L.Chaubard, The ESTER 2evaluation camping for rich transcription of French broadcasts, in proceedings of interspeech,2009.
G.Gravier, G.Adda, N.paulson,M.Carr, A.Giraudel, and O.Galibert, The Etape corpus for the evaluation of speech-based TV Content processing in the French language, in proceeding of the international conference on language resources,Evaluation campaign for rich transcription of French broadcasts, in proceedings of interspeech,2009.
G.Gravier,G.Adda,N.Paulson,M.Carr, A.Giraudel,and O.Galibert, The Etape corpus for the evaluation of speech-based TV content processing in the French language , in proceedings of the international conference on language resources, evaluation and corpora, 2012.
Y.Estve, T.Bazillon , j.-y.Antoine, FARINAS, The EPAC CORPUS:Manual and automatic annotations of conversational speech infrench broadcast news, in proceedings of the seventh international conference on language resource and evaluation, 2010.
A.Mendona, D.Graff, and D. Dipersio, Frenche Gigaword third edition, in proceedings of the linguistic data consortium,2011.
M.de Calms and G.prennou, BDLEX: a lexicon for spoken and written French , in language resources and evaluation, 1998, pp.1129-1136.
I.ILLINA, D.Fohr , and D.jouvert, Grapherme-to-phoneme conversion using conditional random fields, in proceedings of interspeech,2011,pp.2313-2316.
D.Jouvet, D. Fohr, and I.Illina, Evaluating grapheme-to-phoneme convertrezrs in automatic speech recognition context, in proceedings of the IEEE International conference on acoustics, speech and signal processing ICASSP, 2012,pp.4821-4824.
F.Sajous, Corpus WIKIPEDIAFR 2008, “http://redac.univ-tles2.fr/corpus/wikipedia.html.
A.Stolcke, SRILM an extensible language modeing toolkit, in conference on spoken language processing,2002.
H.Schmid, Probabilistic part-of-speech tagging using decision trees, in proceedings of the international conference on new methods in conference on new methods in language processing, 1994,pp.44-49.
P.Placeway.S.Chen,M.Eskenazi, U.Jain, V.Parikh,B.Raj,M.Rabis-hankar, R.Rosenfeld, K.Seymore,M.Siegler,R.Stern, and E.Thayer, The 1996 Hub-4 sphinx-3 system, in darppa speech recognition workshop,1996.
S.Kullback and R.A Leibler, On information and sufficiency, the annals of mathematical statistics, vol.22,no.1,pp.79-86,1951.
Title : Textual Data Selection For Language Modelling In The Scope Of Automatic Speech Recognition
Authors : Mezzoudj Freha . Langlois David . Jouvet Denis . Benyettou Abdelkader .
Abstract
The language model is an important module in many applications that produce natural language text, in particular speech recognition. Training of language models requires large amounts of textual data that matches with the target domain. Selection of target domain (or in-domain) data has been investigated in the past. For example [1] has proposed a criterion based on the difference of cross-entropy between models representing in-domain and non-domain-specific data. However evaluations were conducted using only two sources of data, one corresponding to the in-domain, and another one to generic data from which sentences are selected. In the scope of broadcast news and TV shows transcription systems, language models are built by interpolating several language models estimated from various data sources. This paper investigates the data selection process in this context of building interpolated language models for speech transcription. Results show that, in the selection process, the choice of the language models for representing in-domain and non-domain-specific data is critical. Moreover, it is better to apply the data selection only on some selected data sources. This way, the selection process leads to an improvement of 8.3 in terms of perplexity and 0.2% in terms of word-error rate on the French broadcast transcription task.
Keywords
data selection process- language models- speech transcription
references:
R.C.Moore and W.Lewis.Intelligent selection of language model training data, in proceedings of the ACL2010 conference short papers.Association for computational linguistics,2010,pp.220-224.
L.Lamel,j.Gauvain, v.le.I.Oparin,and S.Meng, Improved medels for mandarin speech-to-text transcription, in Acoustics, speech and signal processing ICASSP, 2011 IEEE International conference on.IEEE, 2010, pp. 4660-4663.
A.Rousseau, p.Deléglise, and Y.Estéve,Enhancing the ted-lium corpus with selected data for language modeling and more ted talks in proc. Of LREC, 2014,pp.3935-3939.
B.Dalvi,C.Xing ,and J.Callan , A Language modeling approach to entity recognition and disambiguation for search queries, in proceedings of the first international workshop on entity recognition and disambiguation.ACM,2014,pp.45-54.
P.Koehn andB.Haddow, Towards effective use of training data in statistical machine translation.Association for computational linguistics,2012,pp.317-321.
P.Goyal, L.Behera, and T.M.Meginnity, A novel neighborhood based document smoothing model for information retrieval, Information retrieval, vol.16,no3,pp391-425,2013.
R.d.Brown, Finding and identifying text in900+ languages, Digital investigation, vol.9,pp.s34,2012.
M.Hamdani,p.Doetsch,M.Kozielski, A.E.D.Moussa, and H.Ney, The rwth large vocabulary Arabic handwriting recognition system, in document analysis systems DAS ,2014 11th IAPR International workshop on. IEEE, 2014,pp.111-115.
R.Rosenfield, two decades of statistical language modeling where do we go from here 2000.
L.R.Rabiner and B.Juang, Statistical methods for the recognition and understanding of speech ,Encyclopedia of language and linguistics, 2004.
S.Galliano, E.Geoffrois, D.Mostefa, K.Choukri, J-F.Bonastre, and G.Gravier, The ester phase ii evaluation campaign for the rich transcription of French broadcast news.in interspeech,2005,pp.1149-1152.
S.Galliano, G.Gravier, and L.Chaubard, the ester 2 evaluation campaign for the rich transcription of French radio broadcasts.in interspech, vol.9,2009,pp.2583-2586.
G.Gravier,G.Adda, N.Paulson,M.Carré,A.Giraudel, and O.Galibert, The etape corpus for the evaluation of speech-based tv content processing in the French language, in LREC-Eighth international conference on language resources and evaluation,2012,p.na.
Y.Esteve,T.Bazilloon, J.Y.Antoine , F.Béchet , and J.Farinas, the eâc corpus: Manual and automatic annotations of conversational speech in French broadcast news.in LREC,2010.
T.Bazillon, Transcription et traitement manuel de la parole spontanéé pour sa reconnaissance automatique, ph.D.dissertation, université du maine,2011.
D.Klakow,Selecting articles from the language model training corpus, in acoustics, speech and signal procesing ,2000.ICASSP 00.proceedings.2000 IEEE International conference on, vol.3. IEEE,2000?PP.1695-1698.
X.Shen and B.Xu,The study of the effect of training set on statistical language modeling for Chinese, June 2000.online.Available: http://research.microsoft.com/apps/default.aspxid = 68833
J.Gao.Goodman, M.Li,and K.F.Lee,Toward a unified approach to statistical language modeling for Chinese, ACM Transactions on ASIAN Language information processing TALIP ,vol.1,no.1,pp.333,2002.
K.Yasuda, R.Zhang,H.Yamamoto , and E. Sumita, Method of selecting training data to build a compact and efficient translation, model.in IJCNLP, 2008,pp.655-660.
G.Foster,C.Goutte, and R.Kuhn, Discriminative instance weighting for domain adaptation in statistical machine translation, in natural language processing.Associztion for computational linguistics,2010,pp.451-459
A.Axelrod, X.he, and J.Gao, Domain adaptation via pseudo in domain data selection,in proceeding of the conference on Empirical methods in natural language processing.Association for computational linguistics,2011.pp.3555-362.
H.Schwenk, A.Rousseau, and M.Attik,Large,pruned or continuous space language models on a gpu for statistical machine translation, in proceedings of the NAACL-BLT 2012 Workshop: Will we ever really replace the N-gram Model on the future of language modeling for HLT.Association for computational linguistics,2012,pp.11-19.
A.Mendona,G.David and D.Denise, French gigaword second edition, web download,2009.
D.jouvet and D.Langlois, A machine learning based approach for vocabulary selection for speech transcription,in text, speech,and dialogue.Springer,2013,pp.60-67.
A.Stolcke et al., Srilm-an extensible language modeling toolkit.in Interspeech.2002.
A.Stolcke.J.Zheng,w.wang, and V.Abrash, Srilm at sixteen: Update and outlook,I, proceeding of IEEE Automatic Specch Recognition and outlook , in proceeding of IEEE Automatic speech recognition and understanding workshop ,2011,p.5.
S.F.Chen and J. Goodman , An empirical study of smoothing teechniques for language modeling ,in proceeding of the 34th annual meeting on association for computational linguistic.Association for computational linguistics,1999,pp.310-318.