Offer summary

Qualifications:

Experience optimising complex models for real-time systems, particularly in speech and audio processing., Strong programming skills in Python, with expertise in C++, CUDA, or Rust., Hands-on experience with inference frameworks like Triton, TensorRT, or ONNX Runtime., Deep understanding of GPU architecture and memory optimisation techniques..

Key responsibilities:

Design high-performance model serving solutions for speech recognition.

Develop and implement encoder optimisation strategies and batching techniques.

Establish robust benchmarking systems to improve latency metrics.

Research and implement emerging acceleration methods for model efficiency.

Job description

Ready to drive next-generation performance for real-time speech AI?

Join a well-funded speech technology company that's rapidly establishing itself as a performance leader in real-time audio processing. Their platform already serves hundreds of enterprise clients across 100+ languages, consistently outperforming established competitors in both accuracy and latency.

With significant investment secured and a growing client base, they're now focusing on scaling their inference infrastructure to maintain their competitive edge. This creates an exceptional opportunity for an inference specialist to work with substantial GPU resources and make a direct impact on a platform that's already deployed at scale.

The role

As Senior Inference Engineer, you'll bridge cutting-edge research and production-ready speech systems. You'll take ownership of the GPU infrastructure, ensuring exceptional performance for real-time applications. Your work at the intersection of high-performance computing and speech processing will push the boundaries of what's possible with modern hardware.

What you'll do

Design high-performance model serving solutions for speech recognition
Develop encoder optimisation strategies including sliding window approaches
Implement batching strategies to maximise throughput
Create advanced model efficiency techniques through quantisation and pruning
Establish robust benchmarking systems to improve latency metrics
Develop internal tools for streamlined model deployment
Research and implement emerging acceleration methods

What you'll bring

Experience optimising complex models (seq2seq, speech/audio, real-time systems)
Strong Python skills plus expertise in C++, CUDA, or Rust
Hands-on experience with inference frameworks like Triton, TensorRT, or ONNX Runtime
Experience with flash attention, CUDA, or equivalent optimisation techniques
Deep understanding of GPU architecture and memory optimisation
Track record implementing model quantisation for real-time production systems

Ideal additions

Open-source inference project contributions (vLLM, TGI, Triton)
Experience with ASR or audio processing workloads
Background in high-performance computing with large GPU clusters
Knowledge of model serving architectures at scale

Your package

Competitive salary up to €150K depending on experience
Remote with team gatherings in Europe
Comprehensive health coverage for you and your family
Unlimited PTO

If you're passionate about squeezing every millisecond from speech models and want the technical freedom to implement cutting-edge optimisation techniques, this is your opportunity to make a significant impact.

Required profile