ABSTRACT
Language models have proved to achieve high performances and outperform state of the art results in the Natural Language Processing field. More specifically, Bidirectional Encoder Representations from Transformers (BERT) has become the state of the art model for such tasks. Most of the available language models have been trained on Indo-European languages. These models are known to require huge training datasets. However, only a few studies have focused on under-represented languages and dialects. In this work, we describe the pretraining of a customized Google BERT Tensorflow implementation model (named TunBERT-T) and the pretraining of a PyTorch implementation of BERT language model using NVIDIA implementation (named TunBERT-P) for the Tunisian dialect. We describe the process of creating a training dataset from collecting a Common-Crawl-based dataset, filtering and pre-processing the data. We describe the training setup and we detail fine-tuning TunBERT-T and TunBERT-P models on three NLP downstream tasks. We challenge the assumption that a lot of training data is needed. We explore the effectiveness of training a monolingual Transformer-based language model for low-resourced languages, taking the Tunisian dialect as a use case. Our models results indicate that a proportionately small sized Common-Crawl-based dataset (500K sentences, 67.2MB) leads to comparable performances as those obtained using costly larger datasets (from 24GB to 128GB of text). We demonstrate that with the use of newly created datasets, our proposed TunBERT-P model achieves comparable or higher performances in three downstream tasks: Sentiment Analysis, Language Identification and Reading Comprehension Question-Answering. We release the two pretrained models along with all the datasets used for the fine-tuning.