Hyphen Connect Limited
LLM Pre-training & Distributed Engineer (AI Infrastructure)
Be an Early Applicant
Lead orchestration and optimization of large-scale LLM pretraining across 1,000+ GPUs. Manage distributed training with PyTorch/DeepSpeed/Megatron-LM, tune networking and memory (InfiniBand/RDMA), and implement checkpointing and robust failure recovery for long-running jobs.
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities:
- Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
- Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
- Automate checkpointing and failure recovery during month-long training runs.
Required Skills:
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Experience managing SLURM or Kubernetes-based GPU clusters.
- Strong systems engineering background (C++, CUDA, Python).
Similar Jobs
Greentech • Hardware • Internet of Things • Machine Learning • Software • Business Intelligence • Agriculture
Drive sales growth and customer success across a designated territory in the beef industry. Prospect, close deals, manage onboarding, maintain accounts, gather field feedback, and collaborate with Product and Support to improve Halter's virtual fencing solutions.
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Generate and qualify leads for the ANZ SME & Growth sales pipeline. Conduct outreach (email, calling), qualify prospects, arrange meetings for AEs, maintain CRM data, support reporting, and collaborate with marketing and cross-functional teams to improve targeting and handoffs.
Top Skills:
CRM
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead the design team, ensuring alignment with business objectives and fostering innovation. Oversee design initiatives, mentor designers, and advocate for user needs in product development.
Top Skills:
Information ArchitectureInteraction DesignUser TestingUx Methodologies
What you need to know about the Melbourne Tech Scene
Home to 650 biotech companies, 10 major research institutes and nine universities, Melbourne is among one of the top cities for biotech. In fact, some of the greatest medical advancements were conceptualized and developed here, including Symex Lab's "lab-on-a-chip" solution that monitors hormones to predict ovulation for conception, and Denteric's vaccine for periodontal gum disease. Yet, the thousands of people working in the city's healthtech sector are just getting started, to say nothing of the tech advancements across all other sectors.

.png)
