Launched by Sarvam AI, Sarvam 1 LLM is Trained in English and Ten Indic Languages

Launched by Sarvam AI, Sarvam 1 LLM is Trained in English and Ten Indic Languages
Sarvam AI Launches Sarvam 1 LLM Trained in English & Ten Indic Languages

On October 24, Sarvam AI, an artificial intelligence (AI) firm supported by Lightspeed, unveiled Sarvam 1, a Large Language Model (LLM). According to a tweet on X (previously Twitter), the business says it is India's first indigenous multilingual LLM, trained from scratch on domestic AI infrastructure in ten Indian languages and English.

Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu are among the ten major Indian languages that Sarvam 1 supports in addition to English. The LLM uses a two-billion-parameter language model and is trained on Nvidia's H100 Graphics Processing Unit (GPU).

Sarvam AI uses Nvidia services and AI4Bharat's open-source technology

In order to optimise and implement conversational AI agents with sub-second latency, Sarvam AI also makes use of a variety of Nvidia services and products, including its microservice, conversational AI, LLM software, and inference server.

In addition to Nvidia, the LLM made use of AI4Bharat's open-source technology and language resources, as well as Yotta's data centres for computational infrastructure. According to a blog post by the AI startup, Sarvam-1's strong performance and computational efficiency make it especially well-suited for real-world uses, such as deployment on edge devices.

In specifics, Sarvam 1 clearly beats Gemma-2-2B and Llama-3.2-3B on a number of common benchmarks, such as MMLU, Arc-Challenge, and IndicGenBench, while attaining comparable results to Llama 3.1 8B, the company stated.

Functioning of Various LLM Models Launched by the Company

India's first Hindi LLM, Open Hathi, was introduced by the AI firm in December 2023. The Llama2-7B architecture from Meta AI, which has 48,000 token extensions, served as the foundation for the model. However, a training corpus of two trillion tokens is used to develop Sarvam.

Because of its effective tokeniser and unique data pipeline, which can produce diversified and high-quality text while preserving factual correctness, the LLM has two trillion tokens of synthetic Indic data. In addition to being four to six times faster during inference, Sarvam claimed that the most recent model from their stable meets or surpasses much larger models like Llama 3.1 8B.

The process by which a trained model predicts or deduces from fresh data using the patterns it discovered during training is known as inference in artificial intelligence. Compared to current Indic datasets, the companies' pretraining corpus, Sarvam-2T, supports eight times as much scientific material, three times as high quality, and two times as long documents. The total number of Indic tokens stored by Sarvam-2T is around 2 trillion. Apart from Hindi, which makes up over 20% of the data, the data is distributed nearly evenly among the ten supported languages.


AI Firm Sarvam Unveils Blend of Open Source and Enterprise Products
AI firm Sarvam unveils a new GenAI platform featuring a mix of open source and enterprise products, with support for 10 Indian languages.

Must have tools for startups - Recommended by StartupTalky

Read more