Atera is looking for a Senior and motivated Senior Site Reliability Engineer to join us and build the
framework for the engineering ops to scale.
This is a full-time and onsite (hybrid-remote) role at our Tel Aviv office.
Responsibilities:
โ Build tools and automation to monitor system health, performance, and reliability, ensuring quick
detection and resolution of any anomalies or issues.
โ Write high-quality infrastructure-as-code that automates the provisioning, deployment, scaling,
and effective monitoring, alerting, and logging solutions.
โ Work with other engineers to ensure that new services are well-designed, properly monitored, and have well-defined SLIs and achievable SLOs
โ Maintain runbooks for manual tasks and replace those runbooks with automation whenever possible.
โ Proactively track our capacity, quotas, and other performance limits to plan for growth.
โ Participate in a 24x7 on-call rotation to handle product availability issues as well as urgent
customer support escalations.
โ Investigate and resolve incidents and outages, performing root cause analysis to identify systemic issues and implement preventive measures.
โ Develop and maintain disaster recovery plans and perform regular testing to ensure data integrity
and business continuity.
Requirements:
โ 3+ years of experience as an SRE in large-scale production environments
โ Previous experience as DevOps Engineer- a big plus
โ Strong experience in designing, implementing, and managing Azure cloud infrastructure
โ Proficient in at least one scripting language (Python, Ruby, Perl) and infrastructure as code technologies (e.g., Terraform, CloudFormation).
โ Strong abilities to lead, design, and execute cross-organization projects
โ Experience in managing container and infrastructure orchestration tools (e.g., Kubernetes, Terraform)
โ Hands-on experience administering public clouds (Azure)
โ Experience with building CI/CD pipelines for applications and microservices
โ Excellent English communication skills
Advantages:
Knowledge of advanced monitoring and observability tools beyond basic logging and alerting.
Experience with tools like Prometheus, Grafana, ELK stack, or similar.