Sarvam - M Copy | Sarvam AI

‍Explorations in Post Training and Inferencing Optimizations for a Hybrid Indic LLM

Download the model from Hugging Face, try it on our playground, and build with our APIs.

Towards Building a Sovereign AI Ecosystem in India, we plan to have regular model drops and share our detailed technical findings. This is the first in this series of technical blogs, wherein we share our findings on post-training. We look forward to hearing feedback and suggestions for collaborations.

In this blog, we share our explorations on post-training and inference optimization of an open pre-trained model to create a cutting-edge hybrid reasoning model specialized for Indic languages. The blog spans three broad steps: (i) supervised fine-tuning (SFT), (ii) reinforcement learning with verifiable rewards (RLVR), and (iii) inference optimizations.

On SFT, we detail how we (a) curate a diverse set of prompts with quality and hardness scoring, clustering, and sampling, (b) create prompt completions from permissible models filtered by our custom scoring process, (c) character train by debiasing completions which have a political slant and re-biasing towards culturally relevant outputs, and (d) create a curriculum to train a hybrid model with both 'non-think' and 'think' modes.

On RLVR, we outline (a) a curriculum to train on a series of datasets combining instruction following, math, and programming, (b) prompt sampling strategies based on different hardness proxies, (c) custom reward engineering across different tasks, and (d) hyper-parameter settings for GRPO which is the algorithm we choose. We apply our SFT and RLVR steps on Mistral Small which is a 24B Apache 2.0 licensed model.

The resultant model, which we call Sarvam-M (M stands for Mistral), significantly improves on the base model with large relative increases: +20% average improvement on Indian language benchmarks, +21.6% on math benchmarks, and +17.6% on programming benchmarks. The gains in tasks in the intersectionality of Indian languages and math are even higher, e.g., +86% improvement in a romanized Indian language GSM-8K benchmark. In most benchmarks, our advanced Sarvam-M outperforms Llama-4 Scout, is comparable to larger dense models like Llama-3.3 70B, and models like Gemma 3 27B which are pre-trained on significantly more tokens. One area where we leave room for improvement is knowledge related benchmarks in English such as MMLU where Sarvam-M drops about 1% points over the baseline model. The key results are shown in the below table, and we discuss more about these numbers in the results section (Note: Sarvam-M numbers are with think mode enabled, unless specified otherwise).
‍

Sarvam - M Copy

‍Explorations in Post Training and Inferencing Optimizations for a Hybrid Indic LLM

Latest Posts

Sarvam - M

Sarvam Translate Copy

Sarvam Translate