Covetrus is dedicated to advancing the world of veterinary medicine and empowering veterinary healthcare teams across the companion, equine, and large-animal health markets. We provide a comprehensive suite of products, software, and services to help drive improved patient health, strong client relationships, and successful financial outcomes for veterinary professionals.
SUMMARY
The role of the Site Reliability Manager is responsible for the stability and performance of our production environment. Our customers’ experience is critical to our success, which makes the health of our platform the highest priority for our SRE team. Additionally, this role will be responsible for software release management and establishing monitoring and alerting criteria in keeping with best practices for a high availability platform. The candidates for this role should have extensive knowledge of Dynatrace for monitoring and alerting, the ability to create/modify dashboards and leverage synthetic monitoring to alert degradation of the user experience. A strong familiarity with Azure, the Azure portal and Kafka is preferred and a good working knowledge of Splunk, Kong for the application gateway and Cloudflare for the firewall. We depend heavily on pager duty for alerting the team and maintain a 24/7 on call rotation. Since this is considered a leadership position, the ability to mentor and manage the continual improvement of the team will be essential.
ESSENTIAL DUTIES AND RESPONSIBILITIES include the following. Other duties may be assigned.
· Develop methodologies for monitoring and operating highly available and scalable services.
· Work with the DevOps team to create more scalable and resilient infrastructure.
· Proactively monitor and review application performance.
· Monitor specific metrics, set thresholds, and trigger alerts based on those thresholds.
· Collect and analyze logging and diagnostic information.
· Help develop better monitoring and incident resolution practices.
· Troubleshoot business and production issues.
· Properly document all incident responses.
· Provide updates and documentation to runbooks and operational manuals.
· Document mean time to recover (MTTR) and mean time to failure (MTTF).
· Participate in on-call rotations.
· Evaluate, build and modify automation for deploying and operating production services.
· Provide leadership in reducing and resolving production incidents.
· Mentor and develop site reliability engineers.
· Create a culture of reliability and high availability in the Information Technology department to improve customer satisfaction.
· Train application development resources to build more resilient applications.
· Identify opportunities to improve all operations processes.
· Facilitate effective transition of services into production ensuring that all requirements have been met in accordance with our Change Management standards. (release management)
· Regularly reviews deployment configurations and makes recommendations for optimal performance in terms of hardware and scaling
SUPERVISORY RESPONSIBILITIES
· Technical leadership for a nine-person team
QUALIFICATIONS:
EDUCATION AND/OR EXPERIENCE
· Bachelor’s degree in software engineering or computer science and/or related years of experience
· Minimum 3 years in an SRE role for a highly available environment
· Minimum 1 year in a similar leadership role.
· Experience with Kafka, Kong, Elastic Search is a plus
COMPETENCIES (SKILLS AND ABILITIES)
· Strong problem-solving and troubleshooting skills with a sense of urgency to restore services for our customers
· Versatile with a passion to learn
· Ability to understand the ‘big picture’
· Strong skills with Dynatrace and Azure Portal
· History of self-improvement and keeping up-to-date with current technologies
Salary may vary depending on factors such as confirmed job-related skills, experience, and location.
However, the pay range for this position is as follows. Sales Positions are eligible for a Variable Incentive
$102,400.00 - $190,100.00
We offer the following benefits for you to take advantage of while you are here provided you meet the eligibility requirements under each governing program:
• 401k savings & company match
• Paid time off
• Paid holidays
• Maternity leave
• Parental leave
• Military leave
• Other leaves of absence
• Health, dental, and vision benefits
• Health savings accounts
• Flexible spending accounts
• Life & disability benefits
• Identity theft protection
• Pet insurance
• Certain positions may include eligibility for a short term incentive plan
Covetrus is an equal opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law.