Project – Digitizing Armenian Linguistic Heritage (DALiH)

The project Digitizing Armenian Linguistic Heritage (DALiH): Armenian Multivariational Corpus and Data Processing* aims at building for the first time an open-access and open-source unified digital linguistic platform for the whole spectrum of Armenian language variation, more particularly annotated corpora for 1) Classical Armenian; 2) Modern Western Armenian; 3) a pilot corpus of Middle Armenian; 4) three pilot corpora of dialects, and 5) one updated Modern Eastern Armenian corpus on the basis of the existing one.

Research will be conducted in Natural language processing (NLP) and linguistic perspectives in order to provide full grammatical annotation and Automatic speech recognition (ASR) models for the target Armenian varieties. Multi-approach deep-learning and rule-based resources will be designed in order to process the written and oral databases and to cross-check their value for further corpus enlargement, in a context of multiparameter language variation for an under-resourced language.

NLP-based linguistic researches, such as language identification and variety distance measuring, lexical and morphological disambiguation, will be carried out to revisit the existing research issues and to introduce new ones backed by the new available processed written and oral data.

*The project is funded by French National Research Agency (ANR-21-CE38-0006).