Senior Research Engineer - Performance Optimization

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

Significant problem-solving experience in PyTorch, CUDA, and distributed systems., Experience training large models using Python & PyTorch, including the entire development pipeline., Familiarity with high-performance parallel C++ and optimization techniques for CUDA and Triton., Good understanding of deep learning concepts such as Transformers and Multimodal Generative models..

Key responsibilities:

  • Ensure efficient implementation of models and systems for data processing, training, inference, and deployment.
  • Identify and implement optimization techniques for massively parallel and distributed systems.
  • Work closely with the research team to ensure systems are planned for maximum efficiency.
  • Build tools to visualize, evaluate, and filter datasets.

Luma AI logo
Luma AI https://lumalabs.ai/dream-machine
11 - 50 Employees
See all jobs

Job description

We are looking for engineers with significant problem solving experience in PyTorch, CUDA and distributed systems. You will work with Research Scientists to build & train cutting edge foundation models on thousands of GPUs. 

Responsibilities

  • Ensure efficient implementation of models & systems for data processing, training, inference and deployment

  • Identify and implement optimization techniques for massively parallel and distributed systems

  • Identify and remedy efficiency bottlenecks (memory, speed, utilization) by profiling and implementing high-performance CUDA, Triton, C++ and PyTorch code

  • Work closely together with the research team to ensure systems are planned to be as efficient as possible from start to finish

  • Build tools to visualize, evaluate and filter datasets

  • Implement cutting-edge product prototypes based on multimodal generative AI

Experience

  • Experience training large models using Python & Pytorch, including practical experience working with the entire development pipeline from data processing, preparation & data loading to training and inference.

  • Experience optimizing and deploying inference workloads for throughput and latency across the stack (inputs, model inference, outputs, parallel processing etc.)

  • Experience with profiling CPU & GPU code in PyTorch, including Nvidia Nsight or similar.

  • Experience writing & improving highly parallel & distributed PyTorch code, with familiarity in DDP, FSDP, Tensor Parallel, etc.

  • Experience writing high-performance parallel C++. Bonus if done within an ML context with PyTorch, like for data loading, data processing, inference code.

  • Experience with high-performance Triton / CUDA and writing custom PyTorch kernels. Top candidates will be able to utilize tensor cores; optimize performance with CUDA memory and other similar skills.

  • Good to have experience working with Deep learning concepts such as Transformers & Multimodal Generative models such as Diffusion Models and GANs.

  • Good to have experience building inference / demo prototype code (incl. Gradio, Docker etc.)


Compensation

  • The pay range for this position in California is $180,000 - $250,000yr; however, base pay offered may vary depending on job-related knowledge, skills, candidate location, and experience. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan. 

Your applications are reviewed by real people.

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Training And Development
  • Collaboration
  • Problem Solving

Related jobs