India Advances Multilingual AI to Bridge Language Gaps...

India’s government-backed Bhashini initiative is building multilingual artificial intelligence (AI) systems to address the dominance of English in AI training data and expand access to AI technologies across the country’s diverse language landscape. Since its launch in July 2022 under the Ministry of Electronics and Information Technology, Bhashini has created over 350 open-source AI models and 4,500 language training datasets covering more than 22 Indic languages.

Building AI for Indic Languages

Bhashini’s approach involves assembling large volumes of speech, text, and translation data for various languages and dialects, a process described by CEO Amitabh Nag as “stitching together by brute force.” This extensive data collection enables development of AI models capable of voice-first solutions and neural machine translation, which are already integrated into multiple government services.

With only about 10% of India’s 1.4 billion people proficient in English, the predominance of English in AI training data creates a significant language gap. This disparity limits the reach and utility of AI tools, disproportionately affecting non-English speakers, including rural populations and tribal communities.

Community-Driven Data Collection

Bhashini prioritizes a community-based methodology through initiatives like BhashaDaan, which crowdsources language data such as texts, voice samples, and linguistic knowledge from Indian citizens. This civic participation allows users to directly shape AI development suited to their native languages. The initiative also partners with linguistic experts and academic institutions to ensure the datasets are curated responsibly and ethically, avoiding unpermitted web scraping.

Private Sector and Global South Efforts

Indian startups like Sarvam and Krutrim complement government efforts by building full-stack AI systems that function across multiple Indian languages. These companies develop foundational AI models and conversational agents that recognize the country’s linguistic diversity.

Similar multilingual AI initiatives exist across the Global South. In Africa, organizations such as Lelapa AI, Masakhane, and the African Language Lab are generating language datasets and training AI models tailored to African languages by collecting regional speech and text data. These projects highlight the global movement to create inclusive AI that respects cultural contexts.

Why it matters

The dominance of English-based AI systems risks deepening digital divides by limiting access to the majority of non-English speakers. By developing multilingual AI models that support India’s numerous languages and dialects, initiatives like Bhashini enable broader, more equitable access to digital services. This approach not only improves usability for underserved populations but also ensures AI technologies are culturally relevant and practically useful across diverse communities.

Read more Artificial Intelligence stories on Goka World News.

Sources

This article is based on reporting and publicly available information from the following source:

techpolicy.press

India Advances Multilingual AI to Bridge Language Gaps in Technology

Building AI for Indic Languages

Community-Driven Data Collection

Private Sector and Global South Efforts

Why it matters

Sources

Aisha Rahman

Building AI for Indic Languages

Community-Driven Data Collection

Private Sector and Global South Efforts

Why it matters

Sources

More Artificial Intelligence coverage

Aisha Rahman

Share this article