Voice Technologies for Indian Languages: Best Practices & Recommendations for Responsible & Open AI
Voice-based technologies enable systems to process and respond to human speech. These technologies are playing an increasingly important role in enabling digital access, especially for users with limited literacy proficiency and visual impairments. In India’s socio-culturally and linguistically diverse context, such technologies are critical to digital inclusion, particularly as more users come online speaking underrepresented local languages.
Several initiatives around the world are currently working to build open speech datasets for Indian languages. These include global initiatives such as Mozilla Common Voice, and closer home, Bhashini, India’s flagship multilingual AI platform under MeitY, and an umbrella initiative that supports academic and non-government projects such as -
- IndicVoices, a large multilingual dataset developed by AI4Bharat, IIT Madras and Sarvam AI,
- Project Vaani, which collects speech and text data from native speakers, and
- IndicASR, the first ASR model covering all 22 Scheduled Languages.
These efforts are supported by tools such as Kathbath and Shoonya, which facilitate data collection and labelling.
However, building high-quality and inclusive voice technologies in India requires sustained efforts across the full speech data lifecycle. Current datasets are often under-representative in terms of dialectal coverage, speaker diversity, or transcription quality. Challenges such as code-switching, limited documentation, discoverability, and licensing complexity affect dataset usability, especially for small developers. Community-led contributions, while important, are hard to sustain due to financial and infrastructural constraints. Ethical concerns such as bias, misuse (e.g., audio deepfakes), and unclear ownership rights further complicate open speech dataset use.
This project seeks to identify barriers and enablers for open-source voice technologies in India and develop best practices and recommendations for their responsible development and use. It examined legal, ethical, and technical issues such as dataset representativeness, licensing limitations, risks of misuse, and regulatory compliance, including under the Digital Personal Data Protection Act, 2023.
Our Approach
The research followed a lifecycle approach across three tracks:
- Track 1: Ethical and Responsible AI Considerations (Led by DFL) Focus on risks such as exclusion and misuse, and the role of community participation and cultural sensitivity in transcription and moderation.
- Track 2: Technical and Infrastructure Dimensions (Led by ARTPARK) Explore dataset accessibility, discoverability, hosting infrastructure, and safeguards for low-resource environments.
- Track 3: Legal and Regulatory Frameworks (Led by Trilegal) Analyse licensing, ownership, IP, liability, and alignment with emerging data protection regulations.
Supported by an Advisory Board and a Working Group with stakeholders from government, industry, and civil society, the outputs of this study were derived from a collaborative effort, with expert guidance, workshop sprints and co-development of practical recommendations.
Our Work
Launched at the India AI Summit Expo 2026, the final outputs - a Policy Report and Developers’ Toolkit address key challenges in India’s multilingual voice technology ecosystem.
The Policy Report examines key barriers to building open and responsible speech systems in India, from data collection and model development to reliable compute infrastructure. It proposes targeted policy recommendations to strengthen the voice-technology ecosystem, including treating foundational speech datasets as digital public goods, improving openness and representativeness of models, investing in sustainable public infrastructure, and embedding safeguards to prevent misuse while enabling innovation.
The Developers’ Toolkit highlights challenges developers face when using Indian-language voice datasets and building voice applications. It identifies structural gaps in India’s speech and language technology ecosystem, including uneven data representation, weak quality assurance, limited evaluation practices, and fragmented governance. Recognising that exclusionary outcomes are often embedded throughout the development lifecycle and cannot be corrected solely through post-deployment fixes, the toolkit proposes a layered, lifecycle-oriented approach to creating inclusive and robust speech AI systems. The toolkit presents the practical approaches commonly adopted within India’s voice-technology ecosystem to address these challenges across the product conceptualisation and development lifecycle.
Voice Technologies for Indian Languages: Best Practices & Recommendations for Responsible & Open AI
Voice-based technologies enable systems to process and respond to human speech. However, building high-quality and inclusive voice technologies in India requires sustained efforts across the full speech data lifecycle.
This project identified barriers and enablers for open-source voice technologies in India and developed best practices and recommendations for their responsible development and use. It examined legal, ethical, and technical issues such as dataset representativeness, licensing limitations, risks of misuse, and regulatory compliance, including under the Digital Personal Data Protection Act, 2023.
The project was undertaken by Digital Futures Lab in partnership with ARTPARK and Trilegal, supported by Bhashini and GIZ.
Project Lead: Urvashi Aneja
Research Manager: Aarushi Gupta
Researcher & Project Coordinator: Harleen Kaur
Researcher: Dona Mathew



