DzoSEM - Dzongkha Sentence Embedding Model

The Challenge

Core Issue

Low-Resource Status

Unlike major global languages such as English and Mandarin, Dzongkha severely lacks large-scale digitized datasets, machine translation tools, and essential NLP resources.

Demographics

600K

Speakers globally, making this a uniquely Bhutan-specific challenge.

Linguistics

Orthographic Divergence

A significant gap exists between traditional written orthography and its modern spoken form, making computational processing and modeling incredibly difficult.

Socioeconomics

Economic Pressures

The physical difficulty of learning Dzongkha, combined with the heavy use of English in high-paying jobs, is driving modern Bhutanese youth to increasingly prefer English.

Heritage

Cultural Threat

This linguistic shift threatens cultural preservation. A majority of manuscripts, heritage, and scientific knowledge (like Himalayan medicine) are written in Dzongkha.

The Paradox

Digitized, Yet Inaccessible

Through His Majesty's digital economy initiative, various documents have successfully been digitized. However, they remain highly inaccessible to youth due to a severe lack of ease-of-search and readability.

The Mission

Graduating the Language

We aim to graduate Dzongkha from a low-resource language to a stable language status, directly aligning with His Majesty's national vision for the preservation of culture and heritage.

The Solution & Impact

Core Innovation

A bilingual sentence embedding model optimized for cross-language retrieval and clustering.

GMC Investment

Enabling seamless searchability across bilingual documents for foreign investors in the Gelephu Mindfulness City initiative.

Medical Knowledge

Mapping topics and discovering patterns in traditional Himalayan medical knowledge that currently remain inaccessible.

Youth Access

Providing an English-based interface to navigate digitized Dzongkha documents, aligning with His Majesty's vision of cultural preservation.

Training Roadmap

Data Collection

Collect & filter ~700,000 Dzongkha-English pairs with parallel semantic meaning.

Architecture Setup

Initialize a ~100M parameter encoder-only model projecting into a 1024-dimension space.

Pre-training (MLM)

Masked Language Modeling on large generic Dzongkha and English corpuses to build basic semantics.

Fine-tuning

Apply contrastive learning (InfoNCE loss) on the 700k pairs for search optimization.

Post-training

Align the model with deployment use cases using English search keywords and Dzongkha documents.

MRL Optimization

Apply Matryoshka Representation Learning for dimension pruning (usable with ClickHouse QBit).

Evaluation

Validate cross-language semantic alignment via recall scores and UMAP visualization.

Document Tuning

Fine-tune weights for document-level mapping and full long-context retrieval capabilities.

System Architecture

Phase 1: Inference Pipeline

Step 1

Tokenization

Take an English or Dzongkha sentence and tokenize it into IDs using a Unigram LM tokenizer (potentially using SentencePiece).

Step 2

Input Layer

Feed the token IDs into the model’s input layer and run it through the deep layers of the transformer.

Step 3

Pooling

At the final layer, we read the values assigned to each token and pool/average them out.

Step 4

Vectorization

The pooled output translates into a single mathematical point in high-dimensional space.

Phase 2: Deployment & Retrieval

Search & Discovery

Once trained, the model is deployed for search, clustering, and topic modeling on Dzongkha documents. Here is how a query is processed in real-time.

Embed the Query

Run the English query through the model (e.g. "medicinal herbs") to get its high-dimensional point in space.

Embed Documents

Run the target Dzongkha documents through the model to get the spatial points for those documents.

Sort by Distance

Sort the documents by their vector distance to the query point to get an ordered list from closest to furthest.

Ranked Results

Display the retrieved documents. The closest documents are the top results, and the furthest are the least related at the bottom.