Site Reliability Engineering | Specialist

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

Proficiency in public cloud platforms (Azure, AWS, or GCP)., Experience with CI/CD, deployment automation, and GitOps practices., Familiarity with observability tools like Prometheus, Grafana, and Datadog., Knowledge of infrastructure as code using Terraform or similar tools..

Key responsibilities:

  • Act as a guardian of service reliability in production, aiming to reduce MTTR and increase availability.
  • Define and monitor SLOs, SLIs, and SLAs in collaboration with business and development areas.
  • Develop automations for failure mitigation, deployments, and incident recovery.
  • Conduct root cause analyses and document incident retrospectives.

Compass.uol logo
Compass.uol XLarge
5001 - 10000 Employees
See all jobs

Job description

JOB DESCRIPTION

.


RESPONSIBILITIES AND ASSIGNMENTS


  • Atuar como guardião da confiabilidade dos serviços em produção, buscando reduzir o MTTR (Mean Time to Recovery) e aumentar a disponibilidade;
  • Definir e monitorar SLOs, SLIs e SLAs junto às áreas de negócio e desenvolvimento;
  • Desenvolver automações para mitigação de falhas, deploys, testes de resiliência e recuperação de incidentes;
  • Promover práticas de observabilidade com dashboards, alertas e tracing distribuído;
  • Participar ativamente do ciclo de vida dos sistemas, desde a concepção até a sustentação em produção;
  • Conduzir e documentar análises de causa raiz (RCA) e retrospectivas de incidentes;
  • Trabalhar com infraestrutura como código para provisionamento seguro e escalável;
  • Influenciar cultura de engenharia focada em confiabilidade, performance e operações sustentáveis;
  • Garantir a implementação de ferramentas de monitoração nos ambientes.

REQUIREMENTS AND QUALIFICATIONS


  • Proficiência em cloud pública (Azure, AWS ou GCP);
  • Conhecimento em CI/CD, automação de deploys e práticas de GitOps;
  • Experiência com observabilidade: Prometheus, Grafana, Loki, Elastic Stack, Datadog, New Relic ou similares;
  • Gerenciamento e orquestração de containers (Kubernetes, Helm, Istio/Linkerd);
  • Experiência com infraestrutura como código (Terraform, Pulumi ou similares);
  • Habilidade com linguagens de scripting (Bash, Python ou Go);
  • Conhecimento em redes, DNS, balanceadores de carga, TLS, failover e escalabilidade;
  • Capacidade de liderar análises de incidentes e ações preventivas (blameless postmortems);
  • Diferenciais: Certificações relevantes (CKA, AZ-400, AWS DevOps Pro, GCP SRE, experiência com Chaos Engineering e testes de falha (Gremlin, LitmusChaos, etc.), vivência em ambientes com alto volume de requisições e arquitetura distribuída (microserviços, serverless) e familiaridade com práticas de FinOps e otimização de custos em nuvem.


Não possui todos os requisitos para a vaga?


Está tudo bem! Na Compass UOL, estimulamos o desenvolvimento contínuo de novos talentos e transformamos desafios em oportunidades.


ADDITIONAL INFORMATION


#remote

"remote"


DREAM BIG WHEN IT COMES TO TECHNOLOGY. BE A COMPASSER! 🚀

Compass UOL is a global company that is part of AI/R, which drives the transformation of organizations through Artificial Intelligence, Generative AI, and Digital Technologies.


We design and build digitally native platforms using cutting-edge technologies to help companies innovate, transform businesses, and drive success in their markets. With a focus on attracting and developing the best talent, we create opportunities that improve lives and highlight the positive impact of disruptive technologies on society.


That's why our selection process goes beyond technical skills. Our goal is to find unique individuals with the potential to make an extraordinary impact on our clients.


We empower talent without borders and promote knowledge and opportunities in the latest market trends, driving significant results.


Join us and be part of the AI-driven digital revolution in the technology universe.


HOW OUR SELECTION PROCESS WORKS

1. ONLINE APPLICATION
Choose the opportunity that best fits your goals. Remember: having a well-detailed profile with your experiences and knowledge can make all the difference!
2. INTERVIEWS
Learn about our culture and company! During interviews, be present and do your best to share your expertise in a chronological and structured way.
3. EVALUATION
Our tests and assessments focus on finding talent with the cultural and technical fit for the position applied for.
4. FEEDBACK

Wait for our response regardless of the result! We have Gupy platform feedback certification.


Required profile

Experience

Spoken language(s):
PortugueseEnglish
Check out the description to know which languages are mandatory.

Other Skills

  • Problem Solving
  • Analytical Skills

Site Reliability Engineer (SRE) Related jobs