In a noteworthy advancement towards strengthening the presence of Indian languages in the domain of Language Model (LLM) training, AI4Bharat has rolled out the IndicLLMSuite. This extensive collection of resources is specially designed to tackle the challenges confronted by low and mid-resource languages in the development of LLMs. The objective behind this initiative is to broaden accessibility to advanced NLP (Natural Language Processing) technologies across India’s linguistically diverse landscape.
Empowering Linguistic Diversity
The IndicLLMSuite is an all-encompassing resource that boasts a vast data repository containing 251 billion tokens and 74.8 million instruction-response pairs, spanning over 22 Indian languages. Meticulously gathered from diverse sources such as curated URLs, multilingual corpora, and large-scale translations, this extensive corpus ensures a strong representation of various Indian languages within the realm of language model training.
Diverse Data Collection
The suite’s cornerstone component, Sangraha, offers a vast pre-training dataset amassed from numerous linguistic resources. This substantial dataset of 251 billion tokens and 22 languages forms the foundation for effectively training language models. Setu, a Spark-based distributed pipeline customized for Indian languages, facilitates the extraction of content from various sources, including websites, PDFs, and videos. Its built-in functionalities, such as cleaning, filtering, toxicity removal, and deduplication, ensure the data’s integrity and quality.
Enhancing Language Model Instruction
IndicAlign-Instruct introduces a substantial collection of 74.7 million prompt-response pairs across 20 Indian languages. These pairs have been carefully curated using multiple methods, including the compilation of existing Instruction Fine-Tuning (IFT) datasets, translation of English datasets, generation of discussions from India-centric Wikipedia articles, and crowd-sourcing through the Anudesh platform. Additionally, a novel IFT dataset derived from IndoWordNet further enriches the suite’s resources by promoting enhanced language and grammar learning for models.
Fostering Safety in Language Models
IndicAlign-Toxic addresses a crucial aspect of safety alignment within language models by providing a curated dataset consisting of 123K pairs of toxic prompts and non-toxic responses. By leveraging open-source English LLMs and translation to 14 Indian languages, IndicAlign-Toxic enhances the safety and reliability of Indian Language Models.
Collaborative Efforts in Language Technology
The launch of IndicLLMSuite signifies a collaborative effort among stakeholders within the Indian ai landscape to bolster the development of language technologies. Partnering with Sarvam ai and IIT Madras, AI4Bharat recently introduced IndicVoices, an extensive speech dataset intended to foster inclusivity and diversity in speech recognition applications. With 7348 hours of natural speech from 16237 speakers across 145 Indian districts and 22 languages, IndicVoices complements the efforts of IndicLLMSuite in enriching India’s linguistic ecosystem.
A Pivotal Moment in Inclusive Language Technology
The unveiling of IndicLLMSuite marks a significant milestone in the endeavor towards inclusive language technology development in India. By democratizing access to resources and fostering collaboration, AI4Bharat reaffirms its dedication to promoting linguistic diversity and empowering Indian languages in the digital era. As the landscape of NLP continues to evolve, initiatives like IndicLLMSuite act as catalysts for innovation and progress, paving the way for a more inclusive and accessible linguistic future.