AI Regulation

India Advances Multilingual AI to Bridge Language Gaps in Technology

India’s government-backed Bhashini initiative is building multilingual artificial intelligence (AI) systems to address the dominance of English in AI training data and expand access to AI technologies across the country’s diverse language landscape. Since its launch in July 2022 under the Ministry of Electronics and Information Technology, Bhashini has created over 350 open-source AI models and 4,500 language training datasets covering more than 22 Indic languages.

Building AI for Indic Languages

Bhashini’s approach involves assembling large volumes of speech, text, and translation data for various languages and dialects, a process described by CEO Amitabh Nag as “stitching together by brute force.” This extensive data collection enables development of AI models capable of voice-first solutions and neural machine translation, which are already integrated into multiple government services.

With only about 10% of India’s 1.4 billion people proficient in English, the predominance of English in AI training data creates a significant language gap. This disparity limits the reach and utility of AI tools, disproportionately affecting non-English speakers, including rural populations and tribal communities.

Community-Driven Data Collection

Bhashini prioritizes a community-based methodology through initiatives like BhashaDaan, which crowdsources language data such as texts, voice samples, and linguistic knowledge from Indian citizens. This civic participation allows users to directly shape AI development suited to their native languages. The initiative also partners with linguistic experts and academic institutions to ensure the datasets are curated responsibly and ethically, avoiding unpermitted web scraping.

Private Sector and Global South Efforts

Indian startups like Sarvam and Krutrim complement government efforts by building full-stack AI systems that function across multiple Indian languages. These companies develop foundational AI models and conversational agents that recognize the country’s linguistic diversity.

Similar multilingual AI initiatives exist across the Global South. In Africa, organizations such as Lelapa AI, Masakhane, and the African Language Lab are generating language datasets and training AI models tailored to African languages by collecting regional speech and text data. These projects highlight the global movement to create inclusive AI that respects cultural contexts.

Why it matters

The dominance of English-based AI systems risks deepening digital divides by limiting access to the majority of non-English speakers. By developing multilingual AI models that support India’s numerous languages and dialects, initiatives like Bhashini enable broader, more equitable access to digital services. This approach not only improves usability for underserved populations but also ensures AI technologies are culturally relevant and practically useful across diverse communities.

Read more AI Regulation stories on Goka World News.

Sources

This article is based on reporting and publicly available information from the following source:

Giorgio Kajaia
About the author

Giorgio Kajaia

Giorgio Kajaia is a writer at Goka World News covering world news, U.S. news, politics, business, climate, science, technology, health, security, and public-interest stories. He focuses on clear, factual, and reader-first reporting based on credible reporting, official statements, publicly available information, and relevant source material.

View all posts by Giorgio Kajaia