Site Reliability Engineer
Site Reliability Engineer
About Rheo
Rheo is an intelligent industrial AI platform that utilizes sensors and machine learning to
optimize operational processes.
Rheo fosters the right harmony between people and technology through data-led focus
and transparency, thereby supercharging manufacturing/operations teams into a cohesive unit. At
Rheo, we apply the same principles we advocate to our customers by creating effective lean
solutions.
Job Summary
We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) with at least
2 years of experience in maintaining AWS Cloud infrastructure and working with Kubernetes. The
primary objective of this role is to ensure near-zero downtime for our services and applications by
proactively identifying and addressing issues. The ideal candidate will have a strong background
in troubleshooting, bug tracking, reporting, and resolution, as well as a willingness to participate
in an on-call schedule.
Key Responsibilities:
1. AWS Cloud Maintenance: Maintain and optimize AWS Cloud infrastructure to ensure
scalability, reliability, and performance. Monitor AWS resources and services to identify
and rectify potential issues before they impact the system.
2. Kubernetes Management: Manage and maintain Kubernetes clusters, ensuring high
availability and performance.Implement best practices for container orchestration and
scaling.
3. Incident Response: Participate in an on-call rotation to provide 24/7 support and respond
to critical incidents promptly. Collaborate with cross-functional teams to troubleshoot and
resolve system issues efficiently.
4. Bug Tracking and Resolution: Identify and document software and infrastructure bugs,
working closely with development teams to prioritize and resolve them. Continuously
improve monitoring and alerting systems to proactively detect issues.
12/1, Second Floor, Raghava Building,
Bashyam Basheer Ahmed Rd, Alwarpet,
Chennai, Tamil Nadu 600018
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or related field. (or
equivalent work experience)
- Proven experience as a Devops Engineer or Site Reliability Engineer or similar role, with
at least 2 years.
- Strong hands-on experience with infrastructure-as-code tools like Terraform, configuration
management tools like Ansible, and version control systems like Git.
- Proficiency in scripting languages such as Python, Bash, or Ruby for automation tasks.
- In-depth knowledge of CI/CD concepts and experience with CI/CD tools like Jenkins,
GitLab CI/CD, CircleCI or GitHub Actions.
- Extensive experience working with cloud platforms like AWS, Azure, or GCP.
- Solid understanding of containerization technologies such as Docker and container
orchestration tools like Kubernetes.
- Familiarity with monitoring and logging solutions like Prometheus, Grafana, ELK stack, etc.
- Excellent problem-solving skills and the ability to troubleshoot complex issues across
different technology stacks.
- Strong communication and interpersonal skills to effectively collaborate with
cross-functional teams.
Preferred qualifications:
- Relevant certifications in cloud platforms (AWS Certified DevOps Engineer, Azure DevOps
Engineer, etc.).
- Experience with infrastructure as code (e.g., Terraform, CloudFormation).
- Experience with serverless architectures and services.
12/1, Second Floor, Raghava Building,
Bashyam Basheer Ahmed Rd, Alwarpet,
Chennai, Tamil Nadu 600018
Work Environment:
We value a culture of continuous improvement, collaboration, and innovation. Our team is
dedicated to maintaining high availability for our services and ensuring a seamless experience for
our users. As an SRE, you will play a crucial role in achieving these goals and driving the
company's success. If you are a proactive, detail-oriented, and experienced SRE who is
passionate about minimizing downtime and improving system reliability, we encourage you to
apply and join our dynamic team. Together, we will ensure the highest level of service for our
customers.