Data Platform Engineer - TextSpeech and Language Models
Data Platform Engineer - TextSpeech and Language Models
Job Summary:
BharatGen is on a mission to create AI that truly represents the diversity, culture, and unique context of India. At the
heart of this mission lies the need for robust, scalable infrastructure to build multilingual and multimodal datasets that
power foundational AI models. We’re seeking a skilled Data Platform Engineer to build scalable tools, platforms, and
pipelines tailored for processing large-scale, multilingual, multimodal datasets critical for foundational AI models.
In this role, you will build scalable data pipelines to ingest, transform, and prepare data from diverse sources—text,
speech, images, and video—making it ready for Generative AI model training. Your work will involve developing and
managing the underlying platform while addressing challenges like governance, security, observability, lineage, and
scalability. The outcomes of your work will include efficient tools for data processing, a reliable data platform, and
high-quality datasets tailored to the evolving needs of large-scale AI and LLM training.
Collaborating closely with researchers and ML engineers, you will play a pivotal role in enabling BharatGen to deliver
state-of-the-art AI models, contributing to the advancement of India’s AI ecosystem through innovative data
engineering solutions.
Key Responsibilities:
● Design and Build Scalable Platforms: Develop distributed infrastructure for ingesting, processing, and
transforming diverse datasets (text, speech, images, video) at terabyte to petabyte scale.
● Develop Robust Data Pipelines: Create reliable, scalable pipelines to prepare datasets for Generative AI and LLM
training.
● Implement Governance and Observability: Build frameworks for data lineage, monitoring, and access control to
ensure data quality and operational reliability.
● Optimize Performance and Cost: Enhance platform performance and resource utilization using cost-effective
strategies, including GPU-accelerated preprocessing.
● Collaborate and Innovate: Work closely with researchers and ML engineers to adapt platforms and data
pipelines to evolving LLM requirements, addressing various data challenges.
● Drive Innovation: Stay updated on emerging tools, frameworks, and best practices to implement cutting-edge
solutions for large-scale dataset creation.
Skills:
1. Technical:
● Proficiency in distributed systems and frameworks (e.g., Kafka, Ray, PySpark) for scalable data workflows.
● Exposure to end-to-end data lifecycle management, including DataOps.
● Strong programming skills in Python, Scala, or Go, with a focus on high-performance pipeline development. ●
Experience with building and optimizing data pipelines, including ETL processes, data modeling, and integration
into scalable workflows.
● Expertise in data scraping, crawling frameworks, and modern dataset development techniques such as synthetic
data generation techniques.
● Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Docker, Kubernetes). ●
Deep understanding of data platform design, including data architecture, metadata tracking, data lineage,
observability, monitoring, and scalability best practices.
● Familiarity with Infrastructure-as-Code tools (e.g., Terraform, CloudFormation), CI/CD pipelines, relational/NoSQL
databases, and GPU-accelerated workflows.
● Familiarity with visualization and monitoring tools for lifecycle management and pipeline performance tracking.
2. Soft Skills:
● Adaptability and innovation in fast-paced, dynamic environments.
● Strong collaboration skills for interdisciplinary teamwork.
● Proactive problem-solving and a growth mindset to thrive in a mission-driven organization.
Other terms:
● The position is contractual, full time in nature and subject to periodic performance reviews
Location of work: