Senior Inference Engineer

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

Experience optimising complex models for real-time systems, particularly in speech and audio processing., Strong programming skills in Python, with expertise in C++, CUDA, or Rust., Hands-on experience with inference frameworks like Triton, TensorRT, or ONNX Runtime., Deep understanding of GPU architecture and memory optimisation techniques..

Key responsibilities:

  • Design high-performance model serving solutions for speech recognition.
  • Develop and implement encoder optimisation strategies and batching techniques.
  • Establish robust benchmarking systems to improve latency metrics.
  • Research and implement emerging acceleration methods for model efficiency.

techire ai logo
techire ai http://www.techire.ai
2 - 10 Employees
See all jobs

Job description

Ready to drive next-generation performance for real-time speech AI?


Join a well-funded speech technology company that's rapidly establishing itself as a performance leader in real-time audio processing. Their platform already serves hundreds of enterprise clients across 100+ languages, consistently outperforming established competitors in both accuracy and latency.


With significant investment secured and a growing client base, they're now focusing on scaling their inference infrastructure to maintain their competitive edge. This creates an exceptional opportunity for an inference specialist to work with substantial GPU resources and make a direct impact on a platform that's already deployed at scale.


The role

As Senior Inference Engineer, you'll bridge cutting-edge research and production-ready speech systems. You'll take ownership of the GPU infrastructure, ensuring exceptional performance for real-time applications. Your work at the intersection of high-performance computing and speech processing will push the boundaries of what's possible with modern hardware.


What you'll do

  • Design high-performance model serving solutions for speech recognition
  • Develop encoder optimisation strategies including sliding window approaches
  • Implement batching strategies to maximise throughput
  • Create advanced model efficiency techniques through quantisation and pruning
  • Establish robust benchmarking systems to improve latency metrics
  • Develop internal tools for streamlined model deployment
  • Research and implement emerging acceleration methods


What you'll bring

  • Experience optimising complex models (seq2seq, speech/audio, real-time systems)
  • Strong Python skills plus expertise in C++, CUDA, or Rust
  • Hands-on experience with inference frameworks like Triton, TensorRT, or ONNX Runtime
  • Experience with flash attention, CUDA, or equivalent optimisation techniques
  • Deep understanding of GPU architecture and memory optimisation
  • Track record implementing model quantisation for real-time production systems


Ideal additions

  • Open-source inference project contributions (vLLM, TGI, Triton)
  • Experience with ASR or audio processing workloads
  • Background in high-performance computing with large GPU clusters
  • Knowledge of model serving architectures at scale


Your package

  • Competitive salary up to €150K depending on experience
  • Remote with team gatherings in Europe
  • Comprehensive health coverage for you and your family
  • Unlimited PTO


If you're passionate about squeezing every millisecond from speech models and want the technical freedom to implement cutting-edge optimisation techniques, this is your opportunity to make a significant impact.

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Teamwork
  • Problem Solving

Related jobs