Match score not available

Senior Reliability Engineer

Remote: 
Full Remote
Contract: 
Experience: 
Mid-level (2-5 years)
Work from: 

Offer summary

Qualifications:

5+ years of experience with .NET Framework (C#) for production system stability., Strong coding, debugging, and troubleshooting skills, especially in performance optimization., Expertise in incident management and resolving live production issues., Knowledge of Kubernetes, Docker, and cloud platforms, with proficiency in monitoring tools like Prometheus and Grafana..

Key responsabilities:

  • Develop and maintain code to resolve product issues and ensure system stability.
  • Provide operational support across client applications, monitoring services to detect critical failures.
  • Own and troubleshoot complex incidents, conducting root cause analyses and implementing long-term solutions.
  • Collaborate with cross-functional teams to deliver fixes for production issues and mentor reliability engineers.

Flinks logo
Flinks Financial Services Scaleup https://flinks.com
51 - 200 Employees
See all jobs

Job description

About Flinks 

Flinks is where financial data moves—with purpose, trust, and impact.

We’re on a mission to simplify access to financial data and help businesses build better, faster, and more secure financial products and experiences. Since 2016, we’ve been bridging the gap between fintechs, financial institutions, and consumers by enabling seamless, secure data connectivity.

From instant account funding to smarter lending, our solutions help power some of the most innovative financial products in North America. We partner with lenders, banks, and fintechs to streamline onboarding, prevent fraud, and fuel real-time decision-making with enriched, reliable data.

As pioneers in Canada’s open banking movement, we're not waiting for the future—we're building it. If you're bold, curious, and ready to help shape the future of finance, we’d love to meet you.

About the Reliability Team 🚒

As a Senior Reliability Engineer, you will play a pivotal role in ensuring the stability, performance, and reliability of Flinks Fintech product platforms, and monitoring & alerting systems. You will serve as an expert in both software development and system support, working closely with engineering, operations, and product teams to troubleshoot complex issues, resolve incidents, and continuously improve the technical foundation of our products. This role demands a combination of advanced coding skills, incident management experience, and an understanding of the fin-tech industry.

What You’ll Do

  • Develop and maintain code to quickly resolve product issues, ensuring fast recovery and long-term system stability.
  • Provide live operational support across multiple client applications, monitoring services and alerts to detect and resolve critical failures with minimal downtime.
  • Own and troubleshoot complex incidents, conduct root cause analyses, and implement long-term solutions—adhering to SLAs and internal SLOs.
  • Build monitoring dashboards and alerting systems to proactively detect and address issues, supporting system scalability and stability.
  • Analyze operational metrics and KPIs to identify trends, surface client pain points, and drive improvements.
  • Automate tooling and processes to improve efficiency and reduce manual work across LiveOps.
  • Collaborate with cross-functional teams to deliver lasting fixes for production issues and contribute to technical analyses of product gaps.
  • Lead and mentor reliability engineers, providing guidance and ensuring consistent delivery of high-quality work.
  • Participate in post-incident reviews, documenting outcomes and driving preventative action items.
  • Support after-hours on-call coverage as part of the LiveOps rotation

Who You Are 💪

  • 5+ years of experience with .NET Framework (C#), ensuring production system stability
  • Strong coding, debugging, and troubleshooting skills, particularly in performance optimization of large-scale applications
  • Operationally focused with expertise in incident management and resolving live production issues
  • Proven experience in building and maintaining reliable monitoring and alerting systems in high-demand environments, with a focus on production support
  • Strong knowledge of Kubernetes, Docker, and cloud platforms (GCP preferred)
  • Proficiency with monitoring tools like Prometheus, Grafana, and Kibana
  • Experience with incident ticketing/documentation tools like FreshDesk and Confluence
  • Critical thinker who can identify system weaknesses and find innovative solutions
  • Strong project management skills with a focus on scalability and system stability

Nice to haves

  • ITIL Service Management certification (or equivalent) is highly desired, such as ITIL v3, ITIL v4, or other equivalent certifications.
  • Experience with PowerBI, web scraping, or Golang

The Interview Process 🏗

  1. Head of People Ops
  2. Case Assignment & Presentation
  3. Team Lead Interview
  4. Director Interview



Required profile

Experience

Level of experience: Mid-level (2-5 years)
Industry :
Financial Services
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Incident Reporting
  • Troubleshooting (Problem Solving)
  • Critical Thinking

Site Reliability Engineer (SRE) Related jobs