Making complex information accessible with Mayura
Making complex information accessible with Mayura

In the diverse linguistic landscape of India, English to Indic language translations play a crucial role in knowledge sharing and consumption. While translation technologies have existed for years, a significant gap has persisted between formal translations and the way Indians actually communicate in their daily lives, leaving most knowledge lost in translation. Existing translation models, have long struggled with the nuances of colloquial language, regional expressions, and the unique phenomenon of code-mixing that characterizes Indian multilingualism.
In India, where most people are bilingual, spoken language often mixes words from English and regional languages. Colloquial language differs vastly from formal language in the Indian context and varies across dialects. Existing translation models, trained primarily on formal sources like newspapers and books, don't represent how Indians actually communicate. Sarvam's translation model is trained on real-world formal and colloquial communication data, making translations closer to everyday speech.
Conventional models fall short in several aspects:
- Everyday Relevance: They focus on standardized language, while most real-world interactions occur in casual, colloquial settings.
- Linguistic Flexibility: Rigid adherence to grammatical rules fails to capture the fluid structures of natural speech. For complex sentences, these models cannot simplify to produce accurate translations in Indian languages.
- Vocabulary Richness: Limited "proper" vocabulary excludes slang, dialects, and code-mixing that color genuine conversations.
- Cultural Nuance: Stripping away cultural context (e.g., the difference between formal and spoken Tamil) results in technically correct but culturally tone-deaf translations.
- Authenticity: Pursuit of "universal" pronunciation often comes at the cost of regional accents and variations that give language its character.
- Gender: Most Indic languages are gendered, unlike English. Translations need to represent these gender differences.
These shortcomings have real-world consequences. Social media conversations, local news, and person-to-person exchanges – the fabric of daily digital life – remain largely inaccessible across language barriers. This hinders personal communication and limits the reach and effectiveness of critical information dissemination, e-commerce, and digital services.
Sarvam AI's new translation model aims to address these challenges by embracing colloquial language and code-mixing. It represents a significant step towards making translation technology truly reflective of how people communicate in their daily lives. This approach has the potential to expand the reach of digital services, enhance cross-cultural understanding, and make the internet's vast resources more accessible to a broader audience.
How we built Sarvam Translate
Sarvam Translate was developed with a practical, application-first approach, recognizing that conventional translation models often fail to dissolve the significant information asymmetry that exists when translating from English to Indian languages, particularly in specialized or context-rich content. We also extended the capability further by investing in training our models on code-mixed, colloquial Indic data to really empower people to consume knowledge in the language they speak.
Our approach to building this model was multi-faceted:
- Diverse Data Collection: We gathered a wide-variety of data such as conversational, domain specific & technical documents, narrational from high quality sources to ensure diversity of linguistic contexts and use cases.
- Real-World Language Patterns: We focused on understanding how people actually communicate in the 10 supported languages in formal and colloquial settings. This meant recognizing and incorporating code-mixing patterns, where English words (particularly difficult verbs, technical nouns, and domain-specific terms) are naturally integrated into Indian language speech.
- Contextual Awareness: We used Indian context data to accurately handle the complexities of second and third person respectful forms in various languages. This ensures that translations maintain appropriate levels of formality and respect based on the context of the sentence.
- Gender Sensitivity: Recognizing that English is often gender-neutral in the first person, we developed our model to support appropriate gendering in Indian languages where it's grammatically necessary.
This approach allows Sarvam Translate to produce translations that not only convey the meaning of the original English text but do so in a way that sounds natural and familiar to Indian language speakers.
Real life applications of Sarvam Translate
1. Conversational Translation
Create LLM Powered voice chatbots that are colloquial and preserve consistent voice:
Translating LLM outputs into Indic languages to create bots that can communicate in the language of your customers.
Challenge: The LLM firstly generates responses that are more compatible with written word and not spoken english example use of bullets, long form & multi-clause sentences. Secondly, the issue of gender is lost in english and make it challenge to adjust the salutations, and preserve the gender in the context of a real time conversation.
Sarvam Approach: The simplification approach in Sarvam Translate firstly helps in converting this LLM text into spoken Indic language format that sounds natural when read aloud. This includes appropriate translation of filler words, slangs, proverbs etc. Second, it also provides a gender toggle for first-person translations to ensure that the gender of the translated output is maintained through the conversation.
2. Domain-specific, technically complex document translation
Most real-world, domain specific documents contain a lot of complexity, technical terminologies that make translation to different languages while preserving the meaning a tough challenge. The fundamental requirement often is to translate complex, specialized content often written in longer, multi-clause sentences while preserving accuracy, context, and often the technical jargon. This involves not just linguistic translation, but also cultural adaptation and domain-specific expertise.
Legal Content:
Translating intricate legal terminology into clear, accurate Indic language versions.
Challenge: While 30 word sentences with multiple clauses and subclauses are common in legal documents, Indic language is comprised of simple smaller sentences. Conventional translation models lead to literal word by word translation of such complex paragraphs that make the outputs quite abstruse.
Approach: Sarvam Translate uses a unique approach to break down these complex English syntactic structures and translate it into Indic languages resulting in more easy-to-read sentences that feel natural to read.
Scientific and Technical Content:
Creating educational and scientific content in all major Indian languages by converting complex mathematical, scientific, or technical material into accessible Indic language formats.
Challenge: Most scientific and technical documentation comprises of lots of technical terms, inline-code and equations/formulas that need to be retained as it is after translation.
Approach: Sarvam Translate employs a dual-stream architecture where one stream processes the textual content while the other analyzes and preserves formatting elements (HTML tags, Markdown syntax, code blocks etc.). These streams are merged post-translation, ensuring that the structural and stylistic elements of the original text are maintained in the output.
Government Communication
Translating government notices and bills into multiple Indian languages, enhancing access to crucial information.
Challenge: Translate into formal language and maintain accuracy and consistency in communication across all languages. While good formal translation models might exist for languages like Hindi, the performance of these models rapidly degrades as you move to languages like Odia.
Approach: We have ensured that we train on data for resource constrained languages like Odia. This helps ensure that the performance of our models is consistently good across all languages and we can bridge the gaps for various people.
Medical Communication
Translating medical instructions, prescriptions, and healthcare information for patients who are more comfortable in their native language is critical for public health.
Challenge: Translating medical content involves handling domain-specific terms and proper nouns, such as medicine names, which require code-mixing for formal translation. It is crucial to retain commonly used technical terms and proper nouns like medicine names, as well as ensure consistency in dosages and units.
Sarvam's Approach: In the critical field of healthcare, Sarvam Translate's ability to retain technical terms, proper nouns, and ensure consistency in dosages and units is not just a feature – it's a lifesaving necessity. It transforms complex medical instructions into clear, accurate translations that patients can easily understand in their native language.
3. Content localization for promotional and ads messaging
Brands operating in India often face the friction of not being able to reach beyond the non-English speaker/reader and truly connect with their consumer base as effectively and consistently as they do in English. Translating promotional content while maintaining brand voice and local relevance becomes critical.
Challenge: The promotional messages often have a colloquial and casual tone with many emojis and specific spacing. They also include URLs, discount codes, and numbers. It is crucial for brands to retain this formatting and tone to preserve their unique voice as well as call to action.
Approach: Sarvam Translate retains the formatting, spacing, special entities like URLs and codes, and emojis to ensure the translated content maintains the original style and effectiveness.
News
Translating global news articles into Indian languages while preserving the expression, urgency, and facts intact across languages is key. With the growing young reader base, it also becomes imperative to match the language of news to the evolving linguistic needs of this user base.
Challenge: Ensuring formal translations while incorporating code-mixing and retaining proper nouns is complex. Formal translations must also be closer to real-life formal communication and not use archaic words that are no longer understood.
Sarvam's Approach: Sarvam translate seamlessly integrates code-mixed patterns and proper noun preservation to match up news delivery with the evolving communication and comprehension patterns of the country.
The Path Forward
Sarvam Translate represents a significant step towards making India an equal participant in the global knowledge economy. By breaking down language barriers, we're not just translating content – we're opening doors to opportunities, education, and global connectivity for millions of Indians.
As we continue to refine and expand Sarvam Translate, we invite developers, businesses, and content creators to join us in this mission. Together, we can ensure that language is no longer a barrier to learning, growth, and success in the digital age.
Welcome to a future where every Indian, regardless of their linguistic background, can access, understand, and contribute to the world's knowledge.