1. Machine Translation:
- is in Buckwalter format.
- More than 6000 sentences.
PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the National Research Project "TORJMAN", code: 24/u23/1902, led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research.
- contains six dialects in addition to MSA.
I have prepared this corpus in order to achieve experiments on Topic Identification for Arabic language. It has been extracted from thousands of articles which had been downloaded from an online newspaper.
- M. Abbas, K. Smaili. Comparison of Topic Identification Methods forArabic Language, International conference RANLP05 : Recent Advances in Natural Language Processing , 21-23 september 2005, Borovets, Bulgary. [pdf]
- M. Abbas, K. Smaili, D. Berkani. Multi-category support vector machines for identifying Arabic topics, 10th International Conference on Intelligent Text Processing and Computational Linguistics - CICLing 2009 (2009), Mexico [pdf]
- M. Abbas, D. Berkani. Topic Identification by Statistical Methods for Arabic language. Wseas Transactions on Computers", Issue 9. Volume 5. pp. 1908-1913. 2006. [pdf]
|Topic||Corpus Size (Number of documents) |
| International News ||953|
| Local News ||2398|
| Economy ||909|
| Total number of docs ||5690|
Watan-2004 corpus contains about 20000 articles talking about the six following topics "categories":
Culture, Religion, Economy, Local News, International News and sports.
In this corpus, punctuation has been omitted intentionally in order to make it useful for Language Modeling.
- M. Abbas, K. Smaili, D. Berkani. (2011). Evaluation of Topic Identification Methods on Arabic Corpora. Journal of Digital Information Management Vol. 9 No. 5, pp.185-192.
- M. Abbas, K. Smaili, D. Berkani. (2010). TR-Classifier and kNN Evaluation for Topic Identification Tasks. Special Issue on Advances in Arabic Language Processing, the International Journal on Information and Communication Technologies (IJICT), Vol 3, N 3, pp. 65-74, Serial Publications.
- M. Abbas, K. Smaili, D. Berkani. Efficiency of TR-Classifier versus TFIDF. First International Conference on Integrated Intelligent Computing, August 5-7, 2010.
|Topic||Corpus Size (Number of documents)|
|Culture ||2782 |
| Local News ||3596|
| International News ||2035 |
|Total number of docs ||20291|
Other Arabic corpora can be found in the Blark Content.
ToolAR : A research engine of corpora and tools for Arabic language processing.