Voice Technologies for Indian Languages: Best Practices & Recommendations for Responsible & Open AI

Voice-based technologies enable systems to process and respond to human speech. These technologies are playing an increasingly important role in enabling digital access, especially for users with limited literacy proficiency and visual impairments. In India’s socio-culturally and linguistically diverse context, such technologies are critical to digital inclusion, particularly as more users come online speaking underrepresented local languages.

Several initiatives around the world are currently working to build open speech datasets for Indian languages. These include global initiatives such as Mozilla Common Voice, and closer home, Bhashini, India’s flagship multilingual AI platform under MeitY, and an umbrella initiative that supports academic and non-government projects such as -

IndicVoices, a large multilingual dataset developed by AI4Bharat, IIT Madras and Sarvam AI,
Project Vaani, which collects speech and text data from native speakers, and
IndicASR, the first ASR model covering all 22 Scheduled Languages.

These efforts are supported by tools such as Kathbath and Shoonya, which facilitate data collection and labelling.

However, building high-quality and inclusive voice technologies in India requires sustained efforts across the full speech data lifecycle. Current datasets are often under-representative in terms of dialectal coverage, speaker diversity, or transcription quality. Challenges such as code-switching, limited documentation, discoverability, and licensing complexity affect dataset usability, especially for small developers. Community-led contributions, while important, are hard to sustain due to financial and infrastructural constraints. Ethical concerns such as bias, misuse (e.g., voice mimicry), and unclear ownership rights further complicate open speech dataset use.

This project seeks to identify barriers and enablers for open-source voice technologies in India and develop best practices and recommendations for their responsible development and use. It will examine legal, ethical, and technical issues such as dataset representativeness, licensing limitations, risks of misuse, and regulatory compliance, including under the Digital Personal Data Protection Act, 2023.

Our Approach

The research will follow a lifecycle approach and be carried out across three tracks:

Track 1: Ethical and Responsible AI Considerations (Led by DFL) Focus on risks such as exclusion and misuse, and the role of community participation and cultural sensitivity in transcription and moderation.
Track 2: Technical and Infrastructure Dimensions (Led by ARTPARK) Explore dataset accessibility, discoverability, hosting infrastructure, and safeguards for low-resource environments.
Track 3: Legal and Regulatory Frameworks (Led by Trilegal) Analyse licensing, ownership, IP, liability, and alignment with emerging data protection regulations.

The project will be supported by an Advisory Board and a Working Group which include stakeholders from government, industry, and civil society. Together, they will guide research, provide feedback, and co-develop practical and inclusive recommendations.

The final output will be a set of actionable guidelines on responsible and open voice technologies for Indian languages, supported by stakeholder consultations, literature review, interviews, and two co-creation workshops. The project runs from April 2025 to February 2026, and concludes with a public launch and targeted outreach to ecosystem stakeholders.

Voice Technologies for Indian Languages: Best Practices & Recommendations for Responsible & Open AI

Voice-based technologies enable systems to process and respond to human speech. However, building high-quality and inclusive voice technologies in India requires sustained efforts across the full speech data lifecycle.

This project seeks to identify barriers and enablers for open-source voice technologies in India and develop best practices and recommendations for their responsible development and use. It will examine legal, ethical, and technical issues such as dataset representativeness, licensing limitations, risks of misuse, and regulatory compliance, including under the Digital Personal Data Protection Act, 2023.

The project is being undertaken by Digital Futures Lab in partnership with ARTPARK and Trilegal, and is supported by Bhashini and GIZ.