Computational Analysis of Printed Arabic Text Database for Natural Language Processing

Hassina Bouressace

doi:10.11649/cs.3027

Authors

Hassina Bouressace 8 Mai 1945 - Guelma University, Guelma , 8 Mai 1945 - Guelma University, Kalima https://orcid.org/0000-0002-2858-8999 (unauthenticated)

DOI:

https://doi.org/10.11649/cs.3027

Keywords:

Arabic language, vocabulary, Arabic documents, frequency dictionary, Arabic printed text database

Abstract

A frequency dictionary of printed Arabic text is essential for natural language processing. It includes 1,251 XML files of Arabic documents collected from ten newspapers and magazines from different countries and created as the PATD database. A total of 2,344 articles were created with various structures: open vocabulary, multi-font, multi-size, and multi-style text. From these articles, 1,102,078 tokens, 19,926 sentences, and 1,000,000 words were extracted. This dictionary provides detailed information for each word, including English equivalents, usage statistics, usage distribution, and the most widely used terms. A thematic vocabulary list of the top words on various topics is also provided. This frequency dictionary is a useful resource of modern Arabic vocabulary for various specialists, students, and learners. The frequency dictionary is freely available to interested researchers on the webpage.

References

Abdelali, A. (2003). Localization in modern standard Arabic. Journal of the American Society for Information Science and Technology, 55(1), 23–28. https://doi.org/10.1002/asi.10340 DOI: https://doi.org/10.1002/asi.10340

Abdelali, A., Cowie, J., & Soliman, H. S. (2005). Building a modern standard Arabic corpus: Paper presented at the Computational Modeling of Lexical Acquisition Workshop, Croatia, 25th to 28th of July. https://www.researchgate.net/publication/228958341_Building_a_modern_standard_Arabic_corpus

Abdul Razak, Z. R. (2011). Modern media Arabic: A study of word frequency in world affairs and sports sections in Arabic newspapers [Doctoral dissertation, University of Birmingham]. https://etheses.bham.ac.uk/id/eprint/2882/

Abuleil, S., & Evans, M. (2002). Extracting an Arabic lexicon from Arabic newspaper text. Journal of Computer and the Humanities, 36({2), 191–221. https://doi.org/10.1023/A:1014368121689 DOI: https://doi.org/10.1023/A:1014368121689

Adham, M. A. A., al-Angelo, A. M., Agresti, A. N. D., & Finlay, B. (2009). Statistical methods for the social sciences (4th ed.). Pearson Education.

Alderson, J. C. (2007). Judging the frequency of English words. Applied Linguistics, 28(3), 383–409. https://doi.org/10.1093/applin/amm024 DOI: https://doi.org/10.1093/applin/amm024

Alhaj, Y. A., Wickramaarachchi, W. U., Hussain, A., Al-Qaness, M. A. A., & Abdelaal, H. M. (2018). Efficient feature representation based on the effect of words frequency for Arabic documents classification. In Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering (ICTCE 2018) (pp. 397–401). Association for Computing Machinery. https://doi.org/10.1145/3291842.3291900 DOI: https://doi.org/10.1145/3291842.3291900

Almutiri, T., & Nadeem, F. (2022). Markov models applications in natural language processing: A survey. International Journal of Information Technology and Computer Science (IJITCS), 14(2), 1–16. https://doi.org/10.5815/ijitcs.2022.02.01 DOI: https://doi.org/10.5815/ijitcs.2022.02.01

Alshammari, R. (2018). Arabic text categorization using machine learning approaches. International Journal of Advanced Computer Science and Applications, 9(3). 226–230. Retrieved May 25, 2019, from https://doi.org/10.14569/IJACSA.2018.090332 DOI: https://doi.org/10.14569/IJACSA.2018.090332

Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171. https://doi.org/10.1075/ijcl.11.2.02als DOI: https://doi.org/10.1075/ijcl.11.2.02als

Ayadi, R., Maraoui, M., & Zrigui, M. (2016). A survey of Arabic text representation and classification methods. Research in Computer Science, 117, 51–62. DOI: https://doi.org/10.13053/rcs-117-1-4

Bouressace, H. (2023). A frequency dictionary of printed Arabic text. http://www.inf.u-szeged.hu/patd/fdpatd/

Bouressace, H., & Csirik, J. (2019). Printed Arabic text database for automatic recognition systems. In Proceedings of the 2019 5th International Conference on Computer and Technology Applications (ICCTA '19) (pp. 107–111). Association for Computing Machinery. https://doi.org/10.1145/3323933.3324082 DOI: https://doi.org/10.1145/3323933.3324082

Buckwalter, T., & Parkinson, D. (2011). A frequency dictionary of Arabic – core vocabulary for learners: Edition bilingue anglais-arabe. Routledge.

Dornyei, Z. (2007). Research methods in applied linguistics: quantitative, qualitative, and mixed methodologies. Oxford University Press.

Duwairi, R., Al-Refai, M. N., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology, 60(11), 2347–2352. https://doi.org/10.1002/asi.21173 DOI: https://doi.org/10.1002/asi.21173

El Kourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic document categorization based on the naïve Bayes algorithm. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (pp. 51–58). COLING. https://doi.org/10.3115/1621804.1621819 DOI: https://doi.org/10.3115/1621804.1621819

Goweder, A., & De Roeck, A. N. (2001, July 6). Assessment of a significant Arabic corpus. In Proceedings of the Arabic NLP Workshop at ACL/EACL 2001: ARABIC Language Processing: Status and Prospects. https://www.researchgate.net/publication/233967788_Assessment_of_a_significant_Arabic_corpus

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1(1), 17–36. https://doi.org/10.1007/s40607-014-0009-9 DOI: https://doi.org/10.1007/s40607-014-0009-9

Kilgarriff, A., Rychlý, P., Smrž, P., & Tugwell, D. (2004). The Sketch Engine. In Proceedings of the 11th EURALEX International Congress (pp. 105–116). Universite de Bretagne-Sud.

Masrai, A., & Milton, J. (2016). How different is Arabic from other languages? The relationship between word frequency and lexical coverage. Journal of Applied Linguistics and Language Research, 3(1), 15–35.

Mesleh, A. M. A. (2007). Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System. Journal of Computer Science, 3(6), 430–435. https://doi.org/10.3844/jcssp.2007.430.435 DOI: https://doi.org/10.3844/jcssp.2007.430.435

Suleiman, D., Awajan, A., & Al Etaiwi, W. (2017). The use of hidden Markov model in natural ARABIC language processing: A survey. Procedia Computer Science, 113, 240–247. https://doi.org/10.1016/j.procs.2017.08.363 DOI: https://doi.org/10.1016/j.procs.2017.08.363

Syiam, M., Fayed, Z., & Habib, M. (2006). An intelligent system for Arabic text categorization. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.

Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, 226–235. https://doi.org/10.1016/j.knosys.2012.06.005 DOI: https://doi.org/10.1016/j.knosys.2012.06.005

Computational Analysis of Printed Arabic Text Database for Natural Language Processing

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Make a Submission

Language

Indexing

Metrics

Latest publications

Other Journals

Publisher

Membership

Partnership