Atera is looking for a Senior and motivated Senior Site Reliability Engineer to join us and build the
framework for the engineering ops to scale.
This is a full-time and onsite (hybrid-remote) role at our Tel Aviv office.
Responsibilities:
● Build tools and automation to monitor system health, performance, and reliability, ensuring quick
detection and resolution of any anomalies or issues.
● Write high-quality infrastructure-as-code that automates the provisioning, deployment, scaling,
and effective monitoring, alerting, and logging solutions.
● Work with other engineers to ensure that new services are well-designed, properly monitored, and have well-defined SLIs and achievable SLOs
● Maintain runbooks for manual tasks and replace those runbooks with automation whenever possible.
● Proactively track our capacity, quotas, and other performance limits to plan for growth.
● Participate in a 24x7 on-call rotation to handle product availability issues as well as urgent
customer support escalations.
● Investigate and resolve incidents and outages, performing root cause analysis to identify systemic issues and implement preventive measures.
● Develop and maintain disaster recovery plans and perform regular testing to ensure data integrity
and business continuity.
Requirements:
● 3+ years of experience as an SRE in large-scale production environments
● Previous experience as DevOps Engineer- a big plus
● Strong experience in designing, implementing, and managing Azure cloud infrastructure
● Proficient in at least one scripting language (Python, Ruby, Perl) and infrastructure as code technologies (e.g., Terraform, CloudFormation).
● Strong abilities to lead, design, and execute cross-organization projects
● Experience in managing container and infrastructure orchestration tools (e.g., Kubernetes, Terraform)
● Hands-on experience administering public clouds (Azure)
● Experience with building CI/CD pipelines for applications and microservices
● Excellent English communication skills
Advantages:
Knowledge of advanced monitoring and observability tools beyond basic logging and alerting.
Experience with tools like Prometheus, Grafana, ELK stack, or similar.