Exploring the CAMeL Tools for Arabic Language Processing

16 min readApr 17, 2021

Natural Language Processing has been among the most prosperous applications of Artificial Intelligence in the past few decades. Nevertheless, the work on the processing of different languages has been full of challenges and limitations. If we take Arabic for example, it is regarded as one of the most difficult languages to learn — you can imagine how difficult it is to attempt to come up with development tools that work on the proessing of such a language.

Lack of relevant resources, the lexical and morphological complexity of the language, the dialect and language variations and last but not least the shortage in supporting research are all some of the challenges that Arabic Language Processing researchers and developers have been thriving to overcome. Today, we have a number of tools, libraries and models that have paved the way to the progressive development in this field. Among the top of these tools is the reason behind this article — today, we will be exploring the set of CAMeL tools for Arabic Language Processing. This set of tools has been developed by a team of great researchers and developers at the New York University of Abu Dhabi (NYUAD) under the supervision of Dr. Nizar Habash.

I cannot but admire the tremendous efforts dedicated to the creation of the CAMeL tools. The word “camel” has two meanings in Arabic; the first meaning that most probably came to your mind is the famous desert animal; however, the transliterated meaning of the word in Arabic is “perfect.” But in our context, CAMeL stands for “Computation Approaches to Modeling Language,” (did not mean to disappoint you).

This article will go through some of the useful tools provided by CAMeL that developers could utilize. In a future article, we will hopefully be implementing a full project using these tools.

Shall we start? Great!

Transliterating:

Transliterating is the representation of a linguistic expression (a word or a letter) using the closest corresponding character/s from another language. An example that I mentioned above was the word “camel,” which is a transliteration of the Arabic word “كامل.”

In this example, we will see how we can use this tool and its benefits. First of all, it should be noted that there are many character mapping schemes; we will explore some of them here and if you want to know more, you can find more information in this link.

We need to define a block of text to experiment on.

text_block

Output

بين أستوديوهات ورزازات وصحراء مرزوكة وآثار وليلي ثم الرباط والبيضاء انتهى المخرج المغربي سهيل بن بركة من تصوير مشاهد عمله السينمائي الجديد الذي خصصه لتسليط الضوء عن حياة الجاسوس الإسباني دومينغو باديا الذي عاش فترة من القرن التاسع عشر بالمغرب باسم علي باي هذا الفيلم الذي اختار له مخرجه عنوان حلم خليفة يصور حياة علي باي العباسي الذي ما زال أحد أحياء طنجة يحمل اسمه عاش حياة فريدة متنكرا بشخصية تاجر عربي من سلالة الرسول صلى الله عليه وسلم فيما كان يعمل جاسوسا لحساب إسبانيا وكشف مخرج الفيلم سهيل بن بركة في تصريح لهسبريس أن الفيلم السينمائي دخل مرحلة التوضيب التي تتم خارج المغرب مبرزا أن الفيلم الذي يروي حياة الجاسوس الإسباني دومينغو باديا منذ أن قرر من طنجة بدء رحلاته نحو عدد من المناطق في العالم الإسلامي بداية القرن العشرين سيكون جاهزا بعد شهرين ويجمع الفيلم السينمائي عددا من الممثلين من مختلف الجنسيات واختار لدور البطولة الممثلة السينمائية الإيطالية كارولينا كريشنتيني للقيام بدور الإنجليزية الليدي هستر ستانهوب التي اشتهرت في الكتب الغربية بـ زنوبيا والتي عاشت بدورها بالدول العربية وارتبطت بعلي باي بعلاقة عاطفية إضافة إلى وجوه سينمائية معروفة وعن اختيار المخرج المغربي لحياة علي باي العباسي يوضح في تصريح لوكالة الأنباء الفرنسية هذه الشخصية عاشت أحداثا مشوقة كثيرة تستحق أن تسلط عليها الأضواء مشيرا إلى أن الفيلم سيحمل الكثير من المفاجآت لا سيما أن البطل قتل على يد امرأة دست له السم خلال رحلة الحج وأضاف شخصية طموحة وشجاعة ومثقفة ومذهلة في آن واحد كان يرى نفسه مستكشفا في أول الأمر نال علي باي إعجاب السلطان بعلمه فجعله من المقربين منه في ظرف وجيز ودعاه إلى اللحاق به إلى فاس وبرحيله إلى فاس تنتهي قصته مع طنجة وعاش علي باي العباسي بمدينة طنجة على أنه رجل مسلم أصله من الشام ونال ثقة الجميع في هذه المدينة حيث تم تشييد تمثال له في عروسة الشمال نظرا لتمكنه من بعض العلوم خاصة علم الفلك الذي مكنه من رصد كسوف الشمس الذي تزامن مع وجوده في طنجة فكان لعلمه دور كبير ساعده في إخفاء هويته كما أبان هذا الأمر أيضا عن تراجع كبير في ميدان العلم والمعرفة لدى المغاربة والمسلمين بصفة عامة

The following segment of code will return our Arabic block of text as a Buckwalter transileterated English text. We start by defining the CharMapper scheme and then instantiating the Transliterator object and applying it to our text.

Code:

from camel_tools.utils.charmap import CharMapper from camel_tools.utils.transliterate import Transliterator ar2bw = CharMapper.builtin_mapper(‘ar2bw’) ar2bw_translit = Transliterator(ar2bw) bw_str1 = ar2bw_translit.transliterate(text_block) bw_str1

Output

"byn >stwdywhAt wrzAzAt wSHrA' mrzwkp w|vAr wlyly vm AlrbAT wAlbyDA' AnthY Almxrj Almgrby shyl bn brkp mn tSwyr m$Ahd Emlh AlsynmA}y Aljdyd Al*y xSSh ltslyT AlDw' En HyAp AljAsws Al<sbAny dwmyngw bAdyA Al*y EA$ ftrp mn Alqrn AltAsE E$r bAlmgrb bAsm Ely bAy h*A Alfylm Al*y AxtAr lh mxrjh EnwAn Hlm xlyfp ySwr HyAp Ely bAy AlEbAsy Al*y mA zAl >Hd >HyA' Tnjp yHml Asmh EA$ HyAp frydp mtnkrA b$xSyp tAjr Erby mn slAlp Alrswl SlY Allh Elyh wslm fymA kAn yEml jAswsA lHsAb <sbAnyA wk$f mxrj Alfylm shyl bn brkp fy tSryH lhsbrys >n Alfylm AlsynmA}y dxl mrHlp AltwDyb Alty ttm xArj Almgrb mbrzA >n Alfylm Al*y yrwy HyAp AljAsws Al<sbAny dwmyngw bAdyA mn* >n qrr mn Tnjp bd' rHlAth nHw Edd mn AlmnATq fy AlEAlm Al<slAmy bdAyp Alqrn AlE$ryn sykwn jAhzA bEd $hryn wyjmE Alfylm AlsynmA}y EddA mn Almmvlyn mn mxtlf AljnsyAt wAxtAr ldwr AlbTwlp Almmvlp AlsynmA}yp Al<yTAlyp kArwlynA kry$ntyny llqyAm bdwr Al<njlyzyp Allydy hstr stAnhwb Alty A$thrt fy Alktb Algrbyp b_ znwbyA wAlty EA$t bdwrhA bAldwl AlErbyp wArtbTt bEly bAy bElAqp EATfyp <DAfp <lY wjwh synmA}yp mErwfp wEn AxtyAr Almxrj Almgrby lHyAp Ely bAy AlEbAsy ywDH fy tSryH lwkAlp Al>nbA' Alfrnsyp h*h Al$xSyp EA$t >HdAvA m$wqp kvyrp tstHq >n tslT ElyhA Al>DwA' m$yrA <lY >n Alfylm syHml Alkvyr mn AlmfAj|t lA symA >n AlbTl qtl ElY yd Amr>p dst lh Alsm xlAl rHlp AlHj w>DAf $xSyp TmwHp w$jAEp wmvqfp wm*hlp fy |n wAHd kAn yrY nfsh mstk$fA fy >wl Al>mr nAl Ely bAy <EjAb AlslTAn bElmh fjElh mn Almqrbyn mnh fy Zrf wjyz wdEAh <lY AllHAq bh <lY fAs wbrHylh <lY fAs tnthy qSth mE Tnjp wEA$ Ely bAy AlEbAsy bmdynp Tnjp ElY >nh rjl mslm >Slh mn Al$Am wnAl vqp AljmyE fy h*h Almdynp Hyv tm t$yyd tmvAl lh fy Erwsp Al$mAl nZrA ltmknh mn bED AlElwm xASp Elm Alflk Al*y mknh mn rSd kswf Al$ms Al*y tzAmn mE wjwdh fy Tnjp fkAn lElmh dwr kbyr sAEdh fy <xfA' hwyth kmA >bAn h*A Al>mr >yDA En trAjE kbyr fy mydAn AlElm wAlmErfp ldY AlmgArbp wAlmslmyn bSfp EAmp"

Other schemes include the “Arabic to safe Buckwalter” ar2safebw

'byn OstwdywhAt wrzAzAt wSHrAC mrzwkp wMvAr wlyly vm AlrbAT wAlbyDAC AnthY Almxrj Almgrby shyl bn brkp mn tSwyr mcAhd Emlh AlsynmAQy Aljdyd AlVy xSSh ltslyT AlDwC En HyAp AljAsws AlIsbAny dwmyngw bAdyA AlVy EAc ftrp mn Alqrn AltAsE Ecr bAlmgrb bAsm Ely bAy hVA Alfylm AlVy AxtAr lh mxrjh EnwAn Hlm xlyfp ySwr HyAp Ely bAy AlEbAsy AlVy mA zAl OHd OHyAC Tnjp yHml Asmh EAc HyAp frydp mtnkrA bcxSyp tAjr Erby mn slAlp Alrswl SlY Allh Elyh wslm fymA kAn yEml jAswsA lHsAb IsbAnyA wkcf mxrj Alfylm shyl bn brkp fy tSryH lhsbrys On Alfylm AlsynmAQy dxl mrHlp AltwDyb Alty ttm xArj Almgrb mbrzA On Alfylm AlVy yrwy HyAp AljAsws AlIsbAny dwmyngw bAdyA mnV On qrr mn Tnjp bdC rHlAth nHw Edd mn AlmnATq fy AlEAlm AlIslAmy bdAyp Alqrn AlEcryn sykwn jAhzA bEd chryn wyjmE Alfylm AlsynmAQy EddA mn Almmvlyn mn mxtlf AljnsyAt wAxtAr ldwr AlbTwlp Almmvlp AlsynmAQyp AlIyTAlyp kArwlynA krycntyny llqyAm bdwr AlInjlyzyp Allydy hstr stAnhwb Alty Acthrt fy Alktb Algrbyp b_ znwbyA wAlty EAct bdwrhA bAldwl AlErbyp wArtbTt bEly bAy bElAqp EATfyp IDAfp IlY wjwh synmAQyp mErwfp wEn AxtyAr Almxrj Almgrby lHyAp Ely bAy AlEbAsy ywDH fy tSryH lwkAlp AlOnbAC Alfrnsyp hVh AlcxSyp EAct OHdAvA mcwqp kvyrp tstHq On tslT ElyhA AlODwAC mcyrA IlY On Alfylm syHml Alkvyr mn AlmfAjMt lA symA On AlbTl qtl ElY yd AmrOp dst lh Alsm xlAl rHlp AlHj wODAf cxSyp TmwHp wcjAEp wmvqfp wmVhlp fy Mn wAHd kAn yrY nfsh mstkcfA fy Owl AlOmr nAl Ely bAy IEjAb AlslTAn bElmh fjElh mn Almqrbyn mnh fy Zrf wjyz wdEAh IlY AllHAq bh IlY fAs wbrHylh IlY fAs tnthy qSth mE Tnjp wEAc Ely bAy AlEbAsy bmdynp Tnjp ElY Onh rjl mslm OSlh mn AlcAm wnAl vqp AljmyE fy hVh Almdynp Hyv tm tcyyd tmvAl lh fy Erwsp AlcmAl nZrA ltmknh mn bED AlElwm xASp Elm Alflk AlVy mknh mn rSd kswf Alcms AlVy tzAmn mE wjwdh fy Tnjp fkAn lElmh dwr kbyr sAEdh fy IxfAC hwyth kmA ObAn hVA AlOmr OyDA En trAjE kbyr fy mydAn AlElm wAlmErfp ldY AlmgArbp wAlmslmyn bSfp EAmp'

And the “Arabic to Habash-Soudi-Buckwalter” ar2hab

"byn ÂstwdywhAt wrzAzAt wSHrA' mrzwkħ wĀθAr wlyly θm AlrbAT wAlbyDA' Anthý Almxrj Almγrby shyl bn brkħ mn tSwyr mšAhd ςmlh AlsynmAŷy Aljdyd Alðy xSSh ltslyT AlDw' ςn HyAħ AljAsws AlĂsbAny dwmynγw bAdyA Alðy ςAš ftrħ mn Alqrn AltAsς ςšr bAlmγrb bAsm ςly bAy hðA Alfylm Alðy AxtAr lh mxrjh ςnwAn Hlm xlyfħ ySwr HyAħ ςly bAy AlςbAsy Alðy mA zAl ÂHd ÂHyA' Tnjħ yHml Asmh ςAš HyAħ frydħ mtnkrA bšxSyħ tAjr ςrby mn slAlħ Alrswl Slý Allh ςlyh wslm fymA kAn yςml jAswsA lHsAb ĂsbAnyA wkšf mxrj Alfylm shyl bn brkħ fy tSryH lhsbrys Ân Alfylm AlsynmAŷy dxl mrHlħ AltwDyb Alty ttm xArj Almγrb mbrzA Ân Alfylm Alðy yrwy HyAħ AljAsws AlĂsbAny dwmynγw bAdyA mnð Ân qrr mn Tnjħ bd' rHlAth nHw ςdd mn AlmnATq fy AlςAlm AlĂslAmy bdAyħ Alqrn Alςšryn sykwn jAhzA bςd šhryn wyjmς Alfylm AlsynmAŷy ςddA mn Almmθlyn mn mxtlf AljnsyAt wAxtAr ldwr AlbTwlħ Almmθlħ AlsynmAŷyħ AlĂyTAlyħ kArwlynA kryšntyny llqyAm bdwr AlĂnjlyzyħ Allydy hstr stAnhwb Alty Ašthrt fy Alktb Alγrbyħ b_ znwbyA wAlty ςAšt bdwrhA bAldwl Alςrbyħ wArtbTt bςly bAy bςlAqħ ςATfyħ ĂDAfħ Ălý wjwh synmAŷyħ mςrwfħ wςn AxtyAr Almxrj Almγrby lHyAħ ςly bAy AlςbAsy ywDH fy tSryH lwkAlħ AlÂnbA' Alfrnsyħ hðh AlšxSyħ ςAšt ÂHdAθA mšwqħ kθyrħ tstHq Ân tslT ςlyhA AlÂDwA' mšyrA Ălý Ân Alfylm syHml Alkθyr mn AlmfAjĀt lA symA Ân AlbTl qtl ςlý yd AmrÂħ dst lh Alsm xlAl rHlħ AlHj wÂDAf šxSyħ TmwHħ wšjAςħ wmθqfħ wmðhlħ fy Ān wAHd kAn yrý nfsh mstkšfA fy Âwl AlÂmr nAl ςly bAy ĂςjAb AlslTAn bςlmh fjςlh mn Almqrbyn mnh fy Ďrf wjyz wdςAh Ălý AllHAq bh Ălý fAs wbrHylh Ălý fAs tnthy qSth mς Tnjħ wςAš ςly bAy AlςbAsy bmdynħ Tnjħ ςlý Ânh rjl mslm ÂSlh mn AlšAm wnAl θqħ Aljmyς fy hðh Almdynħ Hyθ tm tšyyd tmθAl lh fy ςrwsħ AlšmAl nĎrA ltmknh mn bςD Alςlwm xASħ ςlm Alflk Alðy mknh mn rSd kswf Alšms Alðy tzAmn mς wjwdh fy Tnjħ fkAn lςlmh dwr kbyr sAςdh fy ĂxfA' hwyth kmA ÂbAn hðA AlÂmr ÂyDA ςn trAjς kbyr fy mydAn Alςlm wAlmςrfħ ldý AlmγArbħ wAlmslmyn bSfħ ςAmħ"

What is the difference? And what is the purpose?

They are practically the same, they do the same task of converting a certain scheme from and to Arabic text. The only difference is in the symbols used to represent the same Arabic letters. For more information on the symbols used in each scheme, refer to this link. The purpose of transliterating is to further understand how Arabic words are written and pronounced in English (Very useful for non-native speakers). Moreover, many of the databases containing Arabic text data use such schemes; therefore, this tool is essential to be able to fully utilize such forms of data.

Dediacritizing:

One of the complex aspects of the Arabic language is the use of Diacritization. Similar to English, the Arabic language has vowels and constants, but diacritization plays a major role in terms of the way a specific word is pronounced. In fact, it can also help in determining the pos (part of speech) of a certain word in a sentence. Ironically enough, the same word in two different diacritizations could produce two words with totally different meanings. For example, the word “ْأَنَس” is a name for a person pronounced “Anas,” whereas the word “إِنْسَ” is an imperative verb meaning “forget” and is pronounced as “Ensa.” As much as diacritization is meant to make things easier, it makes them as complex.

The upcoming tool dediacritizes any word, i.e. removes diacritization from all the words of a sentence. The following text is the same as the previous block_txt but with diacritization.

diac_txt

Output

بَيْن أستوديوهات ورزازات وَصَحْرَاء مرزوكة وَآثَارٌ وَلِيَلِي ثُمّ الرِّبَاط وَالْبَيْضَاء انْتَهَى الْمَخْرَج الْمَغْرِبِيّ سُهَيْلِ بْنِ بِرْكَةٍ مِنْ تَصْوِيرِ مُشَاهَدٌ عَمَلِه السِّينِمائِيّ الْجَدِيد الَّذِي خَصَّصَه لِتَسْلِيط الضَّوْء عَنْ حَياةِ الْجَاسُوس الإسباني دومينغو بَادِئًا الَّذِي عَاش فَتْرَةً مِنْ الْقَرْنِ التَّاسِعَ عَشَرَ بِالْمَغْرِب بِاسْم عَلِيّ بِأَيّ هَذَا الفِيلْم الَّذِي اخْتَارَ لَهُ مَخْرَجِه عُنْوَان حَلَم خَلِيفَة يُصَوِّر حَيَاة عَلِيّ بِأَيّ الْعَبَّاسِيّ الَّذِي مَا زَالَ أَحَدُ إحْيَاء طَنْجَة يُحْمَل اسْمُه عَاشَ حَيَاةً فَرِيدَة متنكرا بشخصية تَاجِرٌ عَرَبِيٌّ مِنْ سُلَالَةٍ الرَّسُولِ صَلَّى اللَّهُ عَلَيْهِ وَسَلَّمَ فِيمَا كَانَ يَعْمَلُ جَاسُوسًا لِحِسَاب إِسْبانِيا وَكَشَف مَخْرَج الفِيلْم سُهَيْلِ بْنِ بَرَكَةٌ فِي تَصْرِيحِ لهسبريس أَن الفِيلْم السِّينِمائِيّ دَخَل مَرْحَلَة التوضيب الَّتِي تَتِمُّ خَارِج الْمَغْرِب مُبَرِّزًا أَن الفِيلْم الّذِي يَرْوِي حَيَاة الْجَاسُوس الإسباني دومينغو بَادِئًا مُنْذ أَنْ قَرَّرَ مِنْ طَنْجَة بَدَأ رحلاته نَحْو عَدَدٍ مِنْ الْمَنَاطِق فِي الْعَالَمِ الإِسْلاَمِيِّ بِدَايَةِ القَرْنِ العِشْرِينَ سَيَكُون جَاهِزًا بَعْدَ شَهْرَيْنِ وَيُجْمَع الفِيلْم السِّينِمائِيّ عَدَدًا مِنْ الْمُمَثِّلِينَ مِنْ مُخْتَلِفِ الجنسيات وَاخْتَار لِدُور البُطولَة الْمُمَثَّلَة السينمائية الإيطالية كارولينا كريشنتيني لِلْقِيَام بِدُور الإنْجلِيزِيَّة الليدي هستر ستانهوب الَّتِي اشْتُهِرَتْ فِي الْكُتُبِ الْغَرْبِيَّة بـ زنوبيا وَاَلَّتِي عَاشَت بدورها بالدول الْعَرَبِيَّة وارتبطت بِعَلِيّ بِأَيّ بِعِلَاقَة عَاطِفِيَّةٌ أَضَافَهُ إلَى وُجُوهِ سينمائية مَعْرُوفَةٌ وَعَن اخْتِيَار الْمَخْرَج الْمَغْرِبِيّ لِحَيَاة عَلِيّ بِأَيّ الْعَبَّاسِيّ يُوَضِّح فِي تَصْرِيحِ لِوَكَالَة الْإِنْبَاء الْفَرَنْسِيَّة هَذِه الشَّخْصِيَّة عَاشَت أَحْدَاثًا مشوقة كَثِيرَةٌ تَسْتَحِقّ أَن تَسَلَّط عَلَيْهَا الأَضْوَاء مُشِيرًا إلَى أَنْ الفِيلْم سيحمل الْكَثِيرِ مِنْ المفاجآت لَا سِيَّمَا إنْ البَطَل قُتِلَ عَلَى يَدِ امْرَأَةٍ دَسْت لَه السُّمّ خِلَال رَحْلِه الْحَجّ وَأَضَاف شَخْصِيَّةٌ طُمُوحِه وَشَجَاعَة ومثقفة ومذهلة فِي أَنَّ وَاحِدٍ كَانَ يَرَى نَفْسَهُ مُسْتَكْشِفًا فِي أَوَّلِ الْأَمْرِ نَال عَلِيّ بِأَيّ إعْجَاب السُّلْطَان بِعِلْمِه فَجَعَلَهُ مِنْ الْمُقَرَّبِينَ مِنْهُ فِي ظَرْفٍ وَجِيز وَدَعَاه إلَى اللِّحَاق بِهِ إلَى فَاس وبرحيله إلَى فَاس تَنْتَهِي قِصَّتَهُ مَعَ طَنْجَة وَعَاش عَلِيّ بِأَيّ الْعَبَّاسِيّ بِمَدِينَة طَنْجَة عَلَى أَنَّهُ رَجُلٌ مُسْلِمٍ أَصْله مِنْ الشَّامِ وَنَال ثِقَةٌ الْجَمِيعِ فِي هَذِهِ الْمَدِينَةِ حَيْثُ تَمَّ تَشْيِيد تِمْثَالٌ لَهُ فِي عَرُوسَه الشِّمَال نَظَرًا لِتَمَكُّنِهِ مِنْ بَعْضِ الْعُلُومِ خَاصَّة عِلْمِ الفَلَكِ الَّذِي مَكَّنَهُ مِنْ رَصَد كُسُوفِ الشَّمْسِ الَّذِي تَزَامَن مَعَ وُجُودِهِ فِي طَنْجَة فَكَان لِعِلْمِه دُور كَبِيرٌ سَاعَدَهُ فِي إخْفَاءِ هُوِيَّتِه كَمَا أَبَان هَذَا الْأَمْرِ أَيْضًا عَنْ تَرَاجَع كَبِيرٌ فِي مَيْدَانِ الْعِلْمِ وَالْمَعْرِفَةِ لَدَى الْمَغَارِبَة وَالْمُسْلِمِين بِصِفَةٍ عَامَّةٍ.

Code:

from camel_tools.utils.dediac import dediac_ar dediac_sent = dediac_ar(diac_txt) print(dediac_sent)

The output of this code will be the same as the initial text we saw above.

The removal of diacritization is among the initial steps in Arabic language processing as it eliminates any non-essential components in the Arabic text.

Normalization:

There are some letters in Arabic that can be described as confusing in some cases. For instance, the letter ‘ه’ is pronounced in a similar manner to the letter ‘h’ in “hospital.” However, the letter ‘ة’ is pronounced like ‘t’ and can be pronounced like its counterpart ‘ه’. Thus, in certain situations, we would like to normalize the whole text by unifying such instances. The following excerpt of code will make it clearer:

Code:

from camel_tools.utils.normalize import normalize_teh_marbuta_ar, normalize_alef_ar

teh_marbuta = “تاء مربوطة” alef_maksura = “إزدراء الغير ليس مفخرة”

print(“Non-normalized Teh Marbuta:”, teh_marbuta) print(“Non-normalized Alef Maksura:”, alef_maksura)

print(“Non-normalized Teh:”, ar2bw_translit.transliterate(teh_marbuta)) print(“Non-normalized Alef:”, ar2bw_translit.transliterate(alef_maksura), “\n\n”)

normalized_teh = normalize_teh_marbuta_ar(teh_marbuta) normalized_alef = normalize_alef_ar(alef_maksura)

print(“Normalization of Teh Marbuta:”, normalized_teh) print(“Normalization of Alef Maksura:”, normalized_alef)

print(“\n\nNormalized Teh:”, ar2bw_translit.transliterate(normalized_teh)) print(“Normalized Alef:”, ar2bw_translit.transliterate(normalized_alef))

Output:

Non-normalized Teh Marbuta: تاء مربوطة
Non-normalized Alef Maksura: إزدراء الغير ليس مفخرة


Non-normalized Teh Marbuta: tA' mrbwTp
Non-normalized Alef Maksura: <zdrA' Algyr lys mfxrp 


Normalization of Teh Marbuta: تاء مربوطه
Normalization of Alef Maksura: ازدراء الغير ليس مفخرة


Normalized Teh: tA' mrbwTh
Normalized Alef: AzdrA' Algyr lys mfxrp

We can notice the change in the state of certain characters. Not only did the state change but the pronunciation of the words containing these letters has also changed as the corresponding transliteration illustrates.

Use of Morphology Database:

Morphology is defined as the study of words structure. CAMeL provides built-in models that specialize in certain dialects and forms of Arabic (currently only the modern formal and Egyptian Arabic models are available). We will see how we can utilize these models in more than one application.

The following excerpt of code showcases the analyzer that I find to be a very rich tool. This tool uses the built-in model to provide different properties related to the morphology of the word/ set of words.

Code:

from camel_tools.morphology.database import MorphologyDB from camel_tools.morphology.analyzer import Analyzer

morph_db = MorphologyDB.builtin_db(flags = ‘r’) analyzer = Analyzer(morph_db) analysis = analyzer.analyze(‘جاءت’)

analysis[0]

Output:

{'diac': 'جاءَت',
 'lex': 'جاء_1',
 'bw': 'جاء/PV+َت/PVSUFF_SUBJ:3FS',
 'gloss': 'arrive;come;occur+it;they;she_<verb>',
 'pos': 'verb',
 'prc3': '0',
 'prc2': '0',
 'prc1': '0',
 'prc0': '0',
 'per': '3',
 'asp': 'p',
 'vox': 'a',
 'mod': 'i',
 'stt': 'na',
 'cas': 'na',
 'enc0': '0',
 'rat': 'n',
 'source': 'lex',
 'form_gen': 'f',
 'form_num': 's',
 'pattern': '1اءَت',
 'root': 'ج.#.#',
 'catib6': 'VRB',
 'ud': 'VERB',
 'd1seg': 'جاءَت',
 'd1tok': 'جاءَت',
 'atbseg': 'جاءَت',
 'd3seg': 'جاءَت',
 'd2seg': 'جاءَت',
 'd2tok': 'جاءَت',
 'atbtok': 'جاءَت',
 'd3tok': 'جاءَت',
 'bwtok': 'جاء_+َت',
 'pos_lex_logprob': -3.293341,
 'caphi': 'j_aa_2_a_t',
 'pos_logprob': -1.023208,
 'gen': 'f',
 'lex_logprob': -3.293341,
 'num': 's',
 'stemcat': 'PV_V',
 'stem': 'جاء',
 'stemgloss': 'arrive;come;occur'}

The output can be specified to include only the properties wanted as we will show later. However, for the purpose of showing the powerful ability of this tool, we opted to print the whole output. As for the flag, we can choose ‘g,’ which stands for “generate,” we also have ‘a’ for “analysis.” In our case, we used ‘r,’ which stands for “reinflection” and is equivalent to using ‘ag.’ We can also use the function analyze_words() to provide collective analysis for a number of words in a sentence.

For more information regarding the morphological features that CAMeL tools support, refer to this link.

Generators:

One of the successful tools relying on the built-in morphological models is the generator. It works by providing a word and a set of morphological features to a Generator() object. The object provides the word or set of words that belong to the same morphological branch and correspond to the featrues provided.

Code:

from camel_tools.morphology.generator import Generator

morph_gen = Generator(morph_db)

teacher = ‘موظف’ features = { ‘pos’: ‘noun’, ‘gen’: ‘m’, ‘num’: ‘p’ }

analysis_gen = morph_gen.generate(teacher, features) for word in set([a[‘diac’] for a in analysis_gen]): print(word)

Output:

مُوَظَّفُون مُوَظَّفِين مُوَظَّفِي مُوَظَّفُو

These outputs correspond to the noun, plural and male features we specified in our code. To know more about the features and the options available, you can refer to this link.

Reinflection:

Reinflection is similar to generation, but in generation we need to provide the root word and we will get the variations corresponding to our specified morphological features as in the example above. In reinflection, we may provide any form of the a word (plural, singular, male, female,…etc.) and obtain the words corresponding to the morphological features we provide.

Code:

from camel_tools.morphology.reinflector import Reinflector morph_reinf = Reinflector(morph_db)

word = ‘طير’ features1 = { ‘gen’: ‘m’, ‘num’: ‘p’, ‘prc1’: ‘bi_prep’ }

analysis_reinf = morph_reinf.reinflect(word, features1) for word in set([a[‘diac’] for a in analysis_reinf]): print(word)

Output:

بِأَطْيارِ
بِأَطْيار
بِطُيُورٍ
بِطُيُور
بِطُيُورِ
بِأَطْيارٍ

Dismbiguator:

Disambiguator is usually used to obtain the morphological features of a word in a certain context (sentence). As we may know, a word in a certain context does not necessarily hold the same meaning it has when used in other contexts.

We first start by importing the model and then using the function disambiguate()

Code:

from camel_tools.disambig.mle import MLEDisambiguator mle = MLEDisambiguator.pretrained()

sentence = [‘قَام’,’الْمُعَلِّمُون’,’بواجبهم’,’عَلَى’,’أَكْمَلِ’,’وَجْهٍ’] disambiguated = mle.disambiguate(sentence)

diacritized_word = [d.analyses[0].analysis[‘diac’] for d in disambiguated] pos_tag_word = [d.analyses[0].analysis[‘pos’] for d in disambiguated] stem_word = [d.analyses[0].analysis[‘stemgloss’] for d in disambiguated]

for row in zip(diacritized_word, pos_tag_word, stem_word): print(row)

Output:

('قامَ', 'verb', 'undertake;carry_out')
('المُعَلِّمُونَ', 'noun', 'teacher')
('بِواجِبهم', 'noun', 'duty;obligation;requirement')
('عَلَى', 'prep', 'on;above')
('أُكْمِل', 'verb', 'complete;finish')
('وَجْهِ', 'noun', 'face;front')

In our case, we chose to display the diacritized word, part of speech and meaning in English. Native Arabic speakers will notice that the word “أَكْمَلِ” has not been disambiguated correctly. In the output, it was diacritized as a verb, while the initial diacritization clearly shows its position as a noun in the sentence. Such mistakes can be found as we do not expect the model to be 100% accurate (maybe it is not as “camel” as we thought). Nevertheless, other words are correctly disambiguated.

Tagger:

Every NLP enthusiast knows how important is word tagging in the NLP world. In our tagging, we will still rely on the the MLEDisambiguator() model to tag words.

Code:

from camel_tools.tagger.default import DefaultTagger

tagger = DefaultTagger(mle, ‘pos’) tags = tagger.tag(sentence)

for i in range(len(sentence)): print(f’{sentence[i]}: {tags[i]}’)

Output:

قَام: verb
الْمُعَلِّمُون: noun
بواجبهم: noun
عَلَى: prep
أَكْمَلِ: verb
وَجْهٍ: noun

As we have discussed, we have control over the information to be displayed in our output. Since we are analyzing the same sentence we yield the same result.

Tokenizer:

Finally! You were probably waiting for this — to know how tokenizers in the Arabic language operate. Well, guess what? There is not one specific tokenizer. Each tokenizer is used for a specific purpose. You may recall that we have visited the concept of CharMapper schemes — some of these schemes do govern how tokenizing work. Let’s see..

Code:

from camel_tools.tokenizers.morphological import MorphologicalTokenizer

normal_sen = ‘نجحت المستثمرة في استقطاب العديد من المشتركين الجدد الذين سيساهمون في نمو الشركة’.split()

d1tok_tokenizer = MorphologicalTokenizer(disambiguator = mle, scheme = ‘d1tok’) print(“d1tok:”, d1tok_tokenizer.tokenize(normal_sen))

d2tok_tokenizer = MorphologicalTokenizer(disambiguator = mle, scheme = ‘d2tok’) print(“\nd2tok:”, d2tok_tokenizer.tokenize(normal_sen))

d3tok_tokenizer = MorphologicalTokenizer(disambiguator = mle, scheme = ‘d3tok’) print(“\nd3tok:”, d3tok_tokenizer.tokenize(normal_sen))

bwtok_tokenizer = MorphologicalTokenizer(disambiguator = mle, scheme = ‘bwtok’) print(“\nbwtok:”, bwtok_tokenizer.tokenize(normal_sen))

atbtok_tokenizer = MorphologicalTokenizer(disambiguator = mle, scheme = ‘atbtok’) print(“\natbtok:”, atbtok_tokenizer.tokenize(normal_sen))

Output:

d1tok: ['نجحت', 'المستثمرة', 'في', 'استقطاب', 'العديد', 'من', 'المشتركين', 'الجدد', 'الذين', 'سيساهمون', 'في', 'نمو', 'الشركة']

d2tok: ['نجحت', 'المستثمرة', 'في', 'استقطاب', 'العديد', 'من', 'المشتركين', 'الجدد', 'الذين', 'س+_يساهمون', 'في', 'نمو', 'الشركة']

d3tok: ['نجحت', 'ال+_مستثمرة', 'في', 'استقطاب', 'ال+_عديد', 'من', 'ال+_مشتركين', 'ال+_جدد', 'الذين', 'س+_يساهمون', 'في', 'نمو', 'ال+_شركة']

bwtok: ['نجح_+ت', 'ال+_مستثمر_+ة', 'في', 'ٱستقطاب', 'ال+_عديد', 'من', 'ال+_مشترك_+ين', 'ال+_جدد', 'الذين', 'س+_ي+_ساهم_+ون', 'في', 'نمو', 'ال+_شرك_+ة']

atbtok: ['نجحت', 'المستثمرة', 'في', 'استقطاب', 'العديد', 'من', 'المشتركين', 'الجدد', 'الذين', 'س+_يساهمون', 'في', 'نمو', 'الشركة']

You will be able that the bw_tok is the most brutal as it seperates all prefixes and suffixes a word has. This could cause the loss of the correct word, whereas d3_tok only works on the spaces and prefixes. On the contrary, d1_tok is the most naive tokenizer as it only works with spaces. The tokenizers d2_tok and atb_tok work with spaces but not all prefixes.

So, which one should I use?

I belive the answer to this question can only be determined by you and how deep does your task/analysis go.

More…

There are more tools that I will not be able to talk about here but are worthy of mentioning. CAMeL tools does not only work with formal modern Arabic, it also works with Egyptian Arabic (but not on Windows machines). In fact, there is a module concerned with identifying the dialect of the Arabic speech. Also, there is another module that specializes in sentiment analysis (requires the installation of the AraBert model), in addition to the NER (Named Entity Recognition) module that requires the installation of the whole CAMeL data. I urge everyone who is interested to check out the official documentation of CAMeL and try things out yourself. You can also watch the introductory session available on Youtube by Dr. Nizar Habash and Eng. Ossama Obeid.

Conclusion…

I hope that you have benefitted from reading this long blog. I personally enjoyed exploring the tools and understading more about the challenges related to the Arabic language processing. I hope that I will be able to write about my experience with CAMeL under the umbrella of a real project in the near future :).

Exploring the CAMeL Tools for Arabic Language Processing

Transliterating:

Dediacritizing:

Normalization:

Use of Morphology Database:

Generators:

Reinflection:

Dismbiguator:

Tagger:

Tokenizer:

More…

Conclusion…

Thank you for Reading.

Written by Zyad Al-Azazi