TUNIS, TUNISIA, 22.06.2021: InstaDeep and iCompass announced Tuesday that the two AI companies would make TunBERT, their world-first Natural Language Processing (NLP) model for the underrepresented Tunisian dialect, available free and open-source to help spur innovation in Tunisia’s already rapidly growing AI and tech ecosystem.
Tunisia’s tech ecosystem has emerged as an economic and technological phenomenon in North Africa in the last decade. TunBERT uses the latest advances in AI and Machine Learning (ML) to accomplish several tasks, including sentiment analysis, dialect classification, and question-answering to show reading comprehension.
By open-sourcing TunBERT, InstaDeep and iCompass aim to help pave the way for further R&D in multiple directions, accelerating progress by providing a foundation that others can build upon and apply to nearly any field of expertise.
“We’re excited to open-source TunBERT, a joint research project between iCompass and InstaDeep that redefines the state-of-the-art for the Tunisian dialect. This work also highlights the positive results that are achieved when leading AI startups collaborate, benefiting the Tunisian tech ecosystem as a whole”, InstaDeep CEO and Co-Founder, Karim Beguir said.
Overcoming dialect variations and misinterpretation
“We are thrilled to make our findings available to the wider community as very little research has been done on underrepresented languages in the past. Especially, misinterpretation of dialect variations is an area of great challenge today as spoken Arabic features a wide variety of regional dialects, making NLP for dialects especially challenging, and the Tunisian language is no different”, iCompass Co-Founder and CTO Hatem Haddad said.
Improving diversity and better representation of all people — and their languages — is crucial for developing future AI that is fair.
TunBERT has generated much interest across Tunisia and beyond since InstaDeep and iCompass announced their collaboration in March. The result was unveiled by Haddad and InstaDeep AI Research Engineer Nourchene Ferchichi at chipmaker NVIDIA’s annual GPU Technology Conference (GTC), in May.
Spoken by 12 million people, Tunisian is closely linked to North African dialects, which are spoken by an estimated 105 million people. The biggest challenge when it comes to the Tunisian dialect is that it is a non-standard language because it has no grammatical rules. It is also considered an under-resourced language compared to other languages (e.g. English), due to the scarcity of publicly available Tunisian datasets. With numerous variations and interpretations, translations can easily be misunderstood and lead to negative reactions by other Arabic speakers. For example:
Arabic Sentence | English Translation | Tunisian Interpretation |
وين احصل على جلبيات و عبايات كشخة؟ | Where can I find dresses and pretty cloaks? | Where can I find dresses and ugly cloaks? |
عندي شقيقة | I have a sister | I have a headache |
To overcome these challenges, the InstaDeep and iCompass research team created a 67.2 MB common-crawl-based dataset extracted from social media. The dataset size may seem small, but combined with deep learning NLP advances, it has proven sufficient for great results with a performant model. Furthermore, the team used the NVIDIA NeMo toolkit, taking advantage of the BERT model that was optimised by NVIDIA to adapt and fine-tune the neural network on relevant data to pre-train the language model on the collected Tunisian writing.
To evaluate the language model’s performance, the team conducted extensive benchmark experiments with six datasets for three downstream tasks; fine-tuning TunBERT on the sentiment analysis, the dialect identification and the question-answering tasks.
Let’s look at each task and the results the research team achieved.
When fine-tuned on the various labelled datasets, TunBERT achieves state-of-the-art results on all three downstream tasks using the fine-tuning datasets. Compared to larger models such as m-BERT, GigaBERT and AraBERT, TunBERT shows a better representation of the Tunisian dialect tokens and yields better performances while being less computationally costly at inference time.
Sentiment Analysis
For Sentiment Analysis, the team benchmarked TunBERT’s performance against multiple models including Word2Vec Doc2Vec and BERT-based models such as multilingual-BERT, GigaBERT and AraBERT. The results show that TunBERT outperformed those models significantly using the accuracy and the F1 macro scores.
Examples in this table show the model’s predictions on the Sentiment Analysis test sets. It shows that the model is able to correctly identify positive from negative comments.
Tunisian Example | English Translation | Predicted Label |
واحد ماسط لاسط قعر موش متربي ميجيش ربع فنان | He is rude, he can’t be considered an artist | 0 (negative) |
تعجبني وتشرف التمثيل والمسرح في تونس ربي يوفقها | I like her, she is a great Tunisian actress, may God bless her | 1 (positive) |
Dialect Identification
In the Dialect Identification task, the team built two new datasets (TAD) and (TADI), benchmarking TunBERT’s performance against multilingual-BERT, GigaBERT and AraBERT using their test data set. The results show that TunBERT significantly outperformed those models using the accuracy and the F1 macro scores. This highlights the positive impact of having a dialect based language model for such a specific task. Also, pre-training with ‘noisy data’ instead of ‘uniform data’ proved helpful in this specific instance.
The following table highlights TunBERT’s ability to recognise the Tunisian dialect from the Egyptian dialect, even though both North African dialects can be written in a very similar way.
Sample Example | English Translation | Predicted Label |
التفكير في الزحمة الي هلاقيها قادر يخليني استنى لبعد العيد عادي | Thinking about the crowds that I will find can make me wait until after Eid | 0 (Non-Tunisian dialect) |
مرا و عليها الكلام | A wonderful woman | 1 (Tunisian dialect) |
Question Answering
In the Question Answering task, the team created the Tunisian Reading Comprehension Dataset (TRCD), and benchmarked TunBERT’s performance against multilingual-BERT, GigaBERT and AraBERT, after adding a pre-training step on a Modern Standard Arabic reading comprehension dataset for all previously mentioned models.
The example below showcases TunBERT outputs when tested on questions from a dialectal version of the Tunisian constitution. The predicted response demonstrates the model’s understanding of the question and the context of the given paragraph.
Tunisian Dialect | English Translation | |
Question | شكون ترجم الدستور؟ | Who translated the constitution? |
Context | الجمعية التونسية للقانون الدستوري جمعية علمية و الناس الي ترجمو الدستور باللغة الدارجة أساتذة متاع قانون | The Tunisian association of constitutional law is a scientific association and people who translated the constitution to the dialectal language are law professors |
Predicted Answer | أساتذة متاع قانون | Law professors |
Overall, the results achieved by TunBERT outperformed previous research in the field. The experimental results indicate that the proposed pre-trained TunBERT model, trained on small noisy data, yields improvements, compared to other BERT-based language models trained on large uniform data.
By making the model open-source, as well as the newly created fine-tuning datasets, InstaDeep and iCompass are excited to see what the greater AI research community in Africa can achieve using the TunBERT model as a foundation.
The open-source code is available to download here.
Download the press kit including animations and images here.
For more information, get in touch: