Unlike major global languages such as English and Mandarin, Dzongkha severely lacks large-scale digitized datasets, machine translation tools, and essential NLP resources.
Speakers globally, making this a uniquely Bhutan-specific challenge.
A significant gap exists between traditional written orthography and its modern spoken form, making computational processing and modeling incredibly difficult.
The physical difficulty of learning Dzongkha, combined with the heavy use of English in high-paying jobs, is driving modern Bhutanese youth to increasingly prefer English.
This linguistic shift threatens cultural preservation. A majority of manuscripts, heritage, and scientific knowledge (like Himalayan medicine) are written in Dzongkha.
Through His Majesty's digital economy initiative, various documents have successfully been digitized. However, they remain highly inaccessible to youth due to a severe lack of ease-of-search and readability.
We aim to graduate Dzongkha from a low-resource language to a stable language status, directly aligning with His Majesty's national vision for the preservation of culture and heritage.
Enabling seamless searchability across bilingual documents for foreign investors in the Gelephu Mindfulness City initiative.
Mapping topics and discovering patterns in traditional Himalayan medical knowledge that currently remain inaccessible.
Providing an English-based interface to navigate digitized Dzongkha documents, aligning with His Majesty's vision of cultural preservation.
Collect & filter ~700,000 Dzongkha-English pairs with parallel semantic meaning.
Initialize a ~100M parameter encoder-only model projecting into a 1024-dimension space.
Masked Language Modeling on large generic Dzongkha and English corpuses to build basic semantics.
Apply contrastive learning (InfoNCE loss) on the 700k pairs for search optimization.
Align the model with deployment use cases using English search keywords and Dzongkha documents.
Apply Matryoshka Representation Learning for dimension pruning (usable with ClickHouse QBit).
Validate cross-language semantic alignment via recall scores and UMAP visualization.
Fine-tune weights for document-level mapping and full long-context retrieval capabilities.
Take an English or Dzongkha sentence and tokenize it into IDs using a Unigram LM tokenizer (potentially using SentencePiece).
Feed the token IDs into the model’s input layer and run it through the deep layers of the transformer.
At the final layer, we read the values assigned to each token and pool/average them out.
The pooled output translates into a single mathematical point in high-dimensional space.
Once trained, the model is deployed for search, clustering, and topic modeling on Dzongkha documents. Here is how a query is processed in real-time.
Run the English query through the model (e.g. "medicinal herbs") to get its high-dimensional point in space.
Run the target Dzongkha documents through the model to get the spatial points for those documents.
Sort the documents by their vector distance to the query point to get an ordered list from closest to furthest.
Display the retrieved documents. The closest documents are the top results, and the furthest are the least related at the bottom.