Offer summary

Qualifications:

Experience with monitoring and observability tools like New Relic, Prometheus, and Grafana., Proficiency in GitHub and GitOps practices using GitHub Actions., Strong experience with AWS and infrastructure as code using Terraform and Terragrunt., Solid knowledge in SRE, resilience, performance, and automation..

Key responsabilities:

Define and implement monitoring and observability standards to ensure system reliability.

Analyze metrics and alerts to anticipate failures and optimize performance.

Propose and implement architectural improvements to enhance efficiency and availability.

Collaborate with development and infrastructure teams to ensure service resilience and scalability.

Job description

Come and impact millions of Brazilians!!

Want to make a difference in the lives of millions of Brazilians? At RecargaPay, we create accessible and innovative financial solutions that transform the way people interact with money. Be part of this impactful and innovative journey, connecting people with opportunities that truly make a difference in their daily lives.

Our purpose is to deliver the best mobile payment experience for Brazilians, addressing real-world challenges with smart solutions like Pix Parcelado, while staying attentive to market trends and our customers' needs. Here, we value collaboration, ownership, and a relentless pursuit of results, delivering excellence in every interaction.

If you’re looking to join a dynamic environment that challenges the status quo and puts people at the center of decision-making, RecargaPay is the perfect place for you to grow, co-create, and make a difference!

Responsibilities

We are looking for a Senior Site Reliability Engineer (SRE) to define and implement monitoring and observability standards, ensuring the reliability and efficiency of our environment. This professional will be responsible for analyzing metrics and alerts, anticipating failures, identifying infrastructure and application bottlenecks, and proposing architectural improvements to enhance efficiency and availability. They will also play a key role in post-mortems, sharing knowledge and contributing to effective action plans.

Define and enhance monitoring and observability standards;
Support the definition and monitoring of SLIs/SLOs and other key performance indicators to ensure alignment with reliability goals;
Analyze metrics and alerts to anticipate failures and optimize performance;
Identify bottlenecks and areas for improvement in infrastructure and applications;
Propose and implement software architecture and infrastructure improvements to increase efficiency and availability;
Lead and support post-mortems, promoting best practices and lessons learned;
Document best practices, incident learnings, and technical solutions to foster knowledge sharing and accelerate problem resolution;
Work in a GitOps environment, using GitHub Actions for automation;
Collaborate with development and infrastructure teams to ensure service resilience and scalability;
Conduct troubleshooting and performance optimization in containers and Kubernetes (EKS);
Serve as a technical reference for reliability, supporting the adoption of SRE practices across squads and contributing to the evolution of engineering culture;
Work alongside Security, Platform, and Data teams to ensure a holistic approach to reliability and scalability;
Demonstrate the ability to influence technical decisions and drive improvements, even in teams where they are not directly involved;
Maintain a mindset focused on continuous learning, resilience in handling incidents, and a strong emphasis on prevention and automation.

Requirements

Experience with monitoring and observability tools, including New Relic, Prometheus, and Grafana;
Proficiency in GitHub and GitOps practices with GitHub Actions;
Strong experience with AWS and infrastructure as code using Terraform and Terragrunt;
Experience with microservices architecture and Kubernetes;
Solid knowledge in SRE, Resilience, Performance, and Automation;
Hands-on experience with troubleshooting and performance tuning in complex environments;
Expertise in infrastructure and problem analysis in containers and Kubernetes (EKS);
Knowledge of languages such as Python, Ansible, and Shell Script (preferred);
Experience with distributed environments, high availability, and scalability;
Familiarity with post-mortems and incident response.