Offer summary

Qualifications:

7+ years in DevOps, Site Reliability Engineering, or a related role., Advanced hands-on experience with Datadog and Splunk., Strong background in cloud providers like AWS and Azure, and container orchestration., Familiarity with Terraform and strong scripting abilities in languages such as Python or Bash..

Key responsabilities:

Provide technical expertise on SRE best practices focusing on reliability and performance.

Collaborate with teams to design monitoring solutions and maintain observability tools.

Participate in incident resolution and conduct postmortems to improve processes.

Mentor peers on observability tools and promote a culture of shared responsibility.

Job description

We are looking for a Lead Site Reliability Engineer (SRE) to join our team and spearhead our observability and monitoring initiatives. You will collaborate with cross-functional teams to implement, maintain, and refine our systems on modern cloud infrastructures. Your expertise with tools like Datadog and Splunk will be critical in driving visibility into system performance, reliability, and security. This is a hands-on, high-impact role where you will lead the effort to create and maintain a state-of-the-art observability strategy, fostering a culture of proactive monitoring and continuous improvement.

Key Responsibilities

Provide deep technical expertise and guidance on SRE best practices, focusing on system reliability, performance, and scalability.
Serve as a subject matter expert for Datadog and Splunk, influencing the organization’s observability strategy and tooling decisions.
Collaborate with developers and senior engineers to drive high-level design and technology roadmaps.
Design and implement end-to-end monitoring solutions in Datadog and Splunk, ensuring comprehensive visibility into system performance and availability.
Develop and maintain dashboards, alerts, and analytics tools for proactive detection and rapid incident response.
Continually evaluate and refine monitoring practices, ensuring best-in-class observability and minimal noise in alerting.
Partner with development, infrastructure, and operations teams to design architectures that prioritize uptime, fault tolerance, and disaster recovery. o Implement Infrastructure-as-Code (e.g., Terraform) to automate provisioning and scaling of cloud services (AWS, Azure).
Participate in incident triage and resolution during critical events, providing advanced troubleshooting expertise and guidance.
Conduct thorough postmortems to identify root causes, facilitate improvements, and document lessons learned.
Ensure continuous improvement of incident management processes, driving down Mean Time to Recovery (MTTR).
Work closely with developers, QA, and other stakeholders to integrate reliability principles into every stage of the software development lifecycle.
Share knowledge and mentor peers on observability tools, performance optimization, and SRE methodologies.
Promote a culture of shared responsibility, encouraging teams to adopt and adhere to SRE best practices.
Identify and prioritize improvements in reliability, performance, and cost efficiency.
Evaluate emerging technologies and tools, recommending adoption where they enhance system observability or reliability.
Contribute to internal documentation, ensuring best practices are easily accessible and understood across the organization.

Required Qualifications

7+ years in DevOps, Site Reliability Engineering, or a related role.
Proven track record as a technical lead, or subject matter expert (no direct people management required).
Advanced hands-on experience with Datadog (metrics, logs, APM) and Splunk (log management, queries, dashboards).
Strong background in cloud providers (AWS, Azure) and container orchestration (e.g., Kubernetes).
Familiarity with Terraform, or equivalent technologies.
Experience with Jenkins, GitLab CI, GitHub Actions, or similar tools; strong scripting/programming abilities (Python, PowerShell, Bash, etc.).
Solid understanding of Linux/UNIX fundamentals, networking, and common distributed system patterns.

Required profile