Senior Distributed Systems Engineer

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

5+ years of work experience in distributed systems and machine learning., Strong proficiency in Python and experience with Pytorch., Familiarity with high performance computing and multi-modal ML pipelines., Experience with C++ or CUDA is a plus..

Key responsibilities:

  • Collaborate with researchers to scale systems for training large models on GPU clusters.
  • Optimize model training code for hardware efficiency.
  • Develop systems for efficient workload distribution across GPU clusters.
  • Implement robust training methods to handle hardware failures.

Luma AI logo
Luma AI https://lumalabs.ai/dream-machine
11 - 50 Employees
See all jobs

Job description

We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.

Responsibilities

  • Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.

  • Profile and optimize our model training code-base to achieve best in class hardware efficiency.

  • Build systems to distribute work across massive GPU clusters efficiently.

  • Design and implement methods to robustly train models in the presence of hardware failures.

  • Build tooling to help us better understand problems in our largest training jobs.

Experience

  • 5+ years of work experience.

  • Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.

  • Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.

  • Experience building stable and highly efficient distributed systems.

  • Strong generalist Python and Software skills including significant experience with Pytorch.

  • Good to have experience working with high performance C++ or CUDA.


Compensation

  • The pay range for this position in California is $180,000 - $250,000yr; however, base pay offered may vary depending on job-related knowledge, skills, candidate location, and experience. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan. 

Your application is reviewed by real people.

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration
  • Problem Solving

System Engineer Related jobs