Lezghian ML — AI for Language Preservation
Founded a 1,000+ volunteer community preserving the Lezgin language through AI. Built the first Russian-Lezgin translator, TTS model, and open-source language corpus.
Lezghian ML — AI for Language Preservation
Founded an international community of ML enthusiasts and linguists dedicated to preserving the Lezgin language — a UNESCO-classified vulnerable language spoken by ~800,000 people in Dagestan and Azerbaijan.
Honored with the Lezgi Star award for this work.
What we built
Russian ↔ Lezgin Translator 2.0
Fine-tuned NLLB-200-distilled-600M model for translation between Russian and Lezgin. The 2.0 release trained on a 200K synthetic corpus tagged via Gemini, significantly improving quality over v1. Available as a Telegram bot and on HuggingFace.
Lezgin Text-to-Speech
Trained a VITS-based TTS model using 30 hours of studio-recorded speech in collaboration with publicdictionary.org. One of the first TTS systems for the Lezgin language — integrated directly into the translator.
Language corpus & datasets
Assembled the largest open Lezgin language corpus with 1,000+ volunteers:
| Dataset | Size | Description |
|---|---|---|
| Synthetic corpus | 200K sentences | Tagged via Gemini, used for Translator 2.0 |
| Manual annotations | 40K sentences | Community-validated by linguistic experts |
| Bible (Lezgin-Russian) | 13.8K parallel sentences | Largest parallel corpus |
| Lezgi Gazet Archives | 402 articles | News articles corpus |
| CNAL Lezgin-Russian | 762 entries | Literary translations |
| Lez Wiki | 4.4K articles | Wikipedia dump |
Multilingual embeddings
Fine-tuned multilingual-e5-large and LaBSE models for Lezgin language understanding and semantic search.
Impact
- 1,000+ volunteers contributing translations and validations
- First open-source NLP toolkit for the Lezgin language
- Telegram bot used by native speakers daily
- Lezgi Star award for cultural preservation through AI
- All models and datasets freely available on HuggingFace
- Also building Lekion — a professional network for the Lezgian community
Technologies
Python, PyTorch, HuggingFace Transformers, NLLB, VITS, LaBSE, Gemini, mT5