Lezghian ML — AI for Language Preservation

Founded a 1,000+ volunteer community preserving the Lezgin language through AI. Built the first Russian-Lezgin translator, TTS model, and open-source language corpus.

NLP Translation TTS Language Preservation HuggingFace Python

Lezghian ML — AI for Language Preservation

Founded an international community of ML enthusiasts and linguists dedicated to preserving the Lezgin language — a UNESCO-classified vulnerable language spoken by ~800,000 people in Dagestan and Azerbaijan.

Honored with the Lezgi Star award for this work.

What we built

Russian ↔ Lezgin Translator 2.0

Fine-tuned NLLB-200-distilled-600M model for translation between Russian and Lezgin. The 2.0 release trained on a 200K synthetic corpus tagged via Gemini, significantly improving quality over v1. Available as a Telegram bot and on HuggingFace.

Lezgin Text-to-Speech

Trained a VITS-based TTS model using 30 hours of studio-recorded speech in collaboration with publicdictionary.org. One of the first TTS systems for the Lezgin language — integrated directly into the translator.

Language corpus & datasets

Assembled the largest open Lezgin language corpus with 1,000+ volunteers:

DatasetSizeDescription
Synthetic corpus200K sentencesTagged via Gemini, used for Translator 2.0
Manual annotations40K sentencesCommunity-validated by linguistic experts
Bible (Lezgin-Russian)13.8K parallel sentencesLargest parallel corpus
Lezgi Gazet Archives402 articlesNews articles corpus
CNAL Lezgin-Russian762 entriesLiterary translations
Lez Wiki4.4K articlesWikipedia dump

Multilingual embeddings

Fine-tuned multilingual-e5-large and LaBSE models for Lezgin language understanding and semantic search.

Impact

  • 1,000+ volunteers contributing translations and validations
  • First open-source NLP toolkit for the Lezgin language
  • Telegram bot used by native speakers daily
  • Lezgi Star award for cultural preservation through AI
  • All models and datasets freely available on HuggingFace
  • Also building Lekion — a professional network for the Lezgian community

Technologies

Python, PyTorch, HuggingFace Transformers, NLLB, VITS, LaBSE, Gemini, mT5