Elastic logo

Control Plane - Site Reliability Engineer (Hosted Infrastructure)

Elastic
On-site
Australia
Wait & Guest Services

What is The Role

We integrate, scale, and evolve multi-cloud infrastructure across 4 CSPs, over 70 globally distributed regions, and tens of thousands of compute to power Elastic Cloud. We scale our capabilities through automation, Infrastructure as Code (IaC), Configuration Management, and developing software that minimizes toil while improving reliability and efficiency for our customers. From provisioning to termination, the complete lifecycle of a host is our focus- and we want it to live its best life.

If this kind of work gives you positive vibes, we would love your experience to help us continue offering a truly outstanding customer experience across a diverse suite of cloud infrastructure!

What you will be doing

  • Applying software engineering methods to automate large scale systems administration.
  • Optimizing the lifecycle and reliability of compute across multiple cloud providers.
  • Ensuring proactive monitoring and alerting to prevent incidents before they happen.
  • Growing our global infrastructure to meet the increasing scaling demands by developing and maintaining software, tooling, and automations.
  • Collaborating in an inclusive environment- focusing on Operational Excellence and uplifting each other with constructive feedback.
  • Being part of an SRE on-call rotation responding to operational needs and incidents.

What you bring

  • 2+ years in software engineering using Golang.
  • 2+ years operating hundreds (or more) of Cloud Compute via automated solutions.
  • 2+ years with Linux systems - you are proficient with terminal and shell.
  • 2+ years working with containerized services (such as Docker).
  • A customer-first approach in solving operational problems from an SRE perspective.
  • Comfortable with working remotely on distributed teams.

Bonus Points

  • Worked with any of the following in a production environment: Terraform, Puppet, Ansible, Argo CD, Argo Wokflows, CUE, Kubernetes, or a programming language other than Golang.
  • Experience being on-call during incidents and using observability tools (e.g. Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues, quantify impact, and confirm mitigations.
  • Designed, implemented, and engineered solutions with the Elastic Stack.