+213 (0) 23 18 00 95



وزارة التعليم العالي والبحث العلمي

Ministry of Higher Education and Scientific Research

المديرية العامة للبحث العلمي والتطوير التكنولوجي

 General Directorate of Scientific Research and Technogical Development

مركز البحث العلمي والتقني لتطوير اللغة العربية

Center of Scienific and Technical Research for the Development of Arabic Language



Text corpora
Speech corpora
Text corpora

1. Machine Translation:

PADIC: Parallel Arabic DIalectal Corpus
- contains six dialects in addition to MSA.
- is in Buckwalter format.
- More than 6000 sentences.

PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the  National Research Project "TORJMAN", code: 24/u23/1902, led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research.
            -  contains six dialects in addition to MSA.
            - is in Buckwalter format.
            - More than 6000 sentences.

2.Topic Identification:


Size: 4.1 MB
Number of categories (topics): 4
Size: 14.4 MB
Number of categories (topics): 6

I have prepared this corpus in order to achieve experiments on Topic Identification for Arabic language. It has been extracted from thousands of  articles which had been downloaded from an online newspaper.

The corpus contains more than 5000 articles which correspond to nearly 3 millions words.
Punctuation has been deleted on purpose. For more information, check the works based on Khaleej-2004 corpus:
  • M. Abbas, K. Smaili. Comparison of Topic Identification Methods forArabic Language, International conference RANLP05 : Recent Advances in Natural Language Processing , 21-23 september 2005, Borovets, Bulgary. [pdf]
  •  M. Abbas, K. Smaili, D. Berkani. Multi-category support vector machines for identifying Arabic topics, 10th International Conference on Intelligent Text Processing and Computational Linguistics - CICLing 2009 (2009), Mexico [pdf]
  • M. Abbas, D. Berkani. Topic Identification by Statistical Methods for Arabic language. Wseas Transactions on Computers", Issue 9. Volume 5. pp. 1908-1913. 2006. [pdf]                                          
 Topic Corpus Size  (Number of documents)
 International News
 Local News
 Sports  1430
 Total number of docs

Watan-2004 corpus :

Watan-2004 corpus contains about 20000 articles talking about the six following topics "categories":
Culture, Religion, Economy, Local News, International News and sports.
In this corpus, punctuation has been omitted intentionally in order to make it useful for Language Modeling.

My works based on Watan-2004 corpus:
  • M. Abbas, K. Smaili, D. Berkani. (2011). Evaluation of Topic Identification Methods on Arabic Corpora.  Journal of Digital Information Management Vol. 9 No. 5, pp.185-192.
  • M. Abbas, K. Smaili, D. Berkani. (2010). TR-Classifier and kNN Evaluation for Topic Identification Tasks. Special Issue on Advances in Arabic Language Processing, the International Journal on Information and Communication Technologies (IJICT), Vol 3, N 3, pp. 65-74, Serial Publications.
  • M. Abbas, K. Smaili, D. Berkani. Efficiency of  TR-Classifier versus TFIDF. First International Conference on Integrated Intelligent Computing, August 5-7, 2010.
 Topic Corpus Size  (Number of documents)
 Religion  3860
 Economy  3468
 Local News
 International News
 Sports 4550
Total number of docs


Other Arabic corpora can be found in the Blark Content.

N.B: This corpus is only for scientific use. However, any use of it in order to create and release other ressources or software must have the authorization of Mourad Abbas.
Speech corpora
Arpod corpus is a speech corpus is built for language and dialect identification. It can be downloaded from:
More details can be found in:
Khaled Lounnas, Mourad Abbas, Mohamed Lichouri. Building a Speech Corpus based on Arabic Podcasts for Language and Dialect Identification, 3rd International Conference on Natural Language and Speech Processing (ICNLSP2019m Italy, 2019.



ToolAR : A research engine of corpora and tools for Arabic language processing.