Corpus régional de l’arabe standard moderne
##plugins.themes.bootstrap3.article.main##
Résumé
Jusqu'à récemment, seuls deux corpus arabes étaient couramment disponibles pour les chercheurs : le fil de presse arabe de l'Agence France-Presse (AFP) du Linguistic Data Consortium (LDC) et la collection de journaux Al-Harm' de l'Agence européenne de distribution des ressources linguistiques (ELDA). La disponibilité d'un corpus approprié est une clé pour de nombreuses recherches objectives en ingénierie du langage ou dans tout autre domaine lié au langage naturel. Cet article présente les résultats expérimentaux de comparaison de corpus. pour Modern Standard Arabic IMSA) collectés à partir d’échantillons de journaux publiés en ligne dans différents pays arabes. Les résultats des expériences montrent des différences significatives de vocabulaire et de styles au sein des différentes régions. Des études approfondies de ces différences permettront une meilleure compréhension de la langue et auront des implications sur différentes recherches informatiques et linguistiques. Développer des ressources adéquates est plus crucial que jamais pour mener à bien cette tâche.
##plugins.themes.bootstrap3.article.details##
Références
Society for Information Science and technology (JASIST). Volume 55, Number 1, 2004. pp. 23-
28.
[2] Abdelali, A. Cowie, J. Soliman S. H. (2004) Arabic Information Retrieval Perspectives.
Proceedings of JEP-TALN 2004 Arabic Language Processing, Fez 19-22. April 2004.
[3] Al Samarae I. (1981). The historical linguistic evolution, 2nd edition. Dar Al Andalus.
Beirut. Lebanon. (Book in Arabic).
(4] Al-Kharashi, 1. A. and Evans, M. W. (1994) Comparing words, stems. and roots as index
terms in an Arabic information retrieval system. Journal of the American Society for Information
Science (JASIS) 45(8). pp 548-560.
|5] Cavnar, W. B.. and Trenkle, M. J.. (1994) N-Gram-Based Text Categorization. Proceedings
of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Las
Vegas, US. pp. 161-175. 1994
l6] Clarkson P.R. and Rosenfeld. R. (1997) Statistical Language Modeling Using the CMUCambridge Toolkit. Proceedings ESCA Eurospeech 1997
[7] Cowie, J.: Ludovik: Y.. and Zachars
lingual documents. Pros al Conference. Venice pp. 209-214.
18] Dunning. T. (1994) Sta ion of Language. Technical report CRL MCCS-94-
273, Computing Research Lab, New Mexico State University, March 1994,
{91 Filali, I1. (2001) Studies on the poem of Mafdi Zakaria El-Kasida magazine, Issue 9. Algeria.
(Document in Arabic).
[10]Goweder, A. and De Roeck, A. (2001) Assessment of a significant Arabic corpus. Presented
at the Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France, 2001.
[11]Grefenstette, G. (1995) Comparing Two Language Identification Schemes.3" International
Conference on Statistical Analysis of Textual Data. Rome, 1995.
|12}Hmeidi, 1., Kanaan, G. and M. Evens (1997) Design and Implementation of Automatic
Indexing for Information Retrieval with Arabic Documents. Journal of the American Socicty for
Information Science, 48/10, pp. 867-881.
{13] Hunston, S. Corpora in applied linguistics Cambridge University Press May 2002.
[14]Kirehhoff, K. (2002) Novel Speech Recognition Models for Arabic. Johns-Hopkins
University Summer Rescarch Workshop 2002. Final Report.
115]Larkey, L. S. and Connell, M. (2002) Arabic Information Retrieval at UMass in TREC-10 In
Voorhees, . & Harman, D.K. (Eds.) The Tenth Text Retrieval Conference, TREC 2001 NIST
Special Publication 500-250, pp. 562-570.
[16]Larkey, L. S., Ballesteros, L., and Connell, M. (2002) Improving Stemming for Arabic Information Retrieval, Proceedings of SIGIR 2002, pp. 275-282
|17]Ludovik, Y., and Zacharski. R. (1999) Multilingual document language recognition. Proceedings of the Machine Translation Summit VII. Singapore. pp. 317-323.
[18]Maamouri, M.. (1998) Arabic Diglossia and its Impact on the Quality of Education in the Arab Region HUMAN DEVELOPMENT: MOVING FORWARD WORKSHOP. Mediterrancan Development Forum. Marrakech, Morocco. September 3 - 6, 1998
[19]Madar Research - In Focust Article (2004) http://www.madarresearch.com/news/newsdetail.aspx?nwsld=6 Retrieved Sept 22, 2004 120|Manning, C. Schütze, H. (1999) Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA. May 1999. ISBN 0-262-133600-1
[2!|MeNamee, P. (2004) Language Identification: A Solved Problem Suitable for Undergraduate Instruction. Proceedings of the 20th Annual Consortium for Computing Sciences in Colleges East (CCSCE-04), pp. 94-101, October 2004.
[22]Meyer, C. F. (2002) English corpus linguistics: an introduction Cambridge University Press July 2002.
123}Moreh, S. (1988) Studies in Modern Arabic Prose and Poetry, Leiden, E.J. Brill, 1988. 124Stetkevych. J. (1970) The Modern Arabic Literary Language Lexical and Stylistic Developments University of Chicago 1970.
[25]Van Mol, Mark (2000). "Exploring annotated Arabic corpora, preliminary results", in Corpora and Natural Language Processing. proceedings of the International Conference on Artificial and Computational Intelligence for Decision, Control and Automation in Engineering and Industrial Applications, Monastir, pp. 94-98.
[26] Worldwide Internet Population. (2002) www.commerce.net/other/research/stats/Ww stats.html Retrieved Sept 14, 2002.
[27]Xu, J. Fraser, A. Weischedel M. R. (2001) TREC 2001 Cross-lingual Retrieval at BBN NIST Text RE-trieval Conference TREC10 Proceedings, Gaithersburg, MD, pp. 68-77.