Quick Summary
Site Reliability Engineer (SRE) - Cloud Contact Center Software
Five9 is a leading provider of cloud contact center software, committed to bringing the power of cloud innovation to customers worldwide. We foster a team-first culture that celebrates diversity and empowers employees to thrive.
We are seeking a Site Reliability Engineer (SRE) to join our team and ensure the maintenance of highly reliable, scalable systems. This role balances approximately 50% software development with 50% operational expertise, focusing heavily on automation, monitoring, and system reliability rather than manual operations. You will collaborate closely with platform, application, and database teams to deliver reliable and available service.
Key Responsibilities: SRE Focus Areas
Observability & Monitoring
- Design and implement comprehensive dashboards covering OS/platform and application-level monitoring (using primary RED and secondary USE indicators).
- Establish and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
- Build alerting systems and performance monitoring to proactively identify and resolve issues.
- Participate in 24/7 on-call rotations, lead incident response efforts, including post-mortem analysis and remediation, and maintain official on-call routing.
Infrastructure Automation & Deployment
- Maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines, working with cloud and on-premise deployment teams.
- Develop and maintain Infrastructure as Code (IaC) using tools like Terraform or Ansible.
- Automate system configuration, ensuring consistency across environments, and implementing configuration control best practices.
Security & Compliance
- Implement security automation, ensuring scanning systems are in place and reviewing escalated vulnerabilities.
- Maintain proper authentication, authorization, and audit logging systems.
- Ensure systems meet regulatory requirements and industry standards through compliance reporting.
- Participate in security incident response and remediation efforts.
Cost Optimization
- Monitor and optimize cloud resource usage and costs, looking for planned and unplanned resource changes.
- Analyze usage patterns for capacity planning.
- Provide recommendations for cost-effective architecture and implement automated scaling and resource optimization strategies (right-sizing).
Common Services & Platform Engineering
- Build and maintain shared infrastructure such as notification systems, caching layers, message queues, or third-party software stacks.
- Manage database reliability, performance, and scaling (where not handled by dedicated DB teams).
- Implement and maintain service discovery, load balancing, and network policies (Service Mesh & Networking).
- Create and maintain tools and platforms that improve developer productivity and system reliability.
Required Qualifications
Operational Experience
- 3+ years managing large-scale production environments.
- Comfortable with 24/7 on-call responsibilities and incident response.
- Strong Linux/Unix system administration skills.
- Understanding of networking concepts: TCP/IP, DNS, load balancing, and network security.
- Experience with SQL and NoSQL databases in production environments.
Technical Skills
- Proficiency in at least two programming languages: Python, Shell, PHP, Java, or similar.
- Experience with one major cloud platform infrastructure and services (AWS, GCP, or Azure).
- Hands-on experience with Docker, Kubernetes, and container orchestration.
- Experience with Monitoring & Observability tools (Prometheus, Grafana, ELK stack, or similar).
- Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, or similar).
- Expert-level Git usage and collaborative development practices.
SRE-Specific Knowledge
- Experience defining and maintaining SLI/SLO.
- Understanding of error budget concepts and implementation.
- Track record of identifying and eliminating repetitive manual work (toil reduction).
- Experience with performance testing and capacity management.
Preferred Qualifications
- Bachelor's degree in Computer Science, Engineering, or equivalent experience.
- Experience with microservices architecture and distributed systems.
- Knowledge of security best practices and compliance frameworks.
- Experience with chaos engineering and reliability testing.
- Previous experience in an SRE or DevOps role at a technology company.
- Contributions to open-source projects or technical communities.
Five9 is an equal opportunity employer.


