Senior HPC and AI Networking Performance Research and Analysis Engineer
Indexed description
What You'll Be Doing
- Experience and research AI workloads and DL models specifically tailored for large-scale deep learning LLM training on NVIDIA supercomputers with a focus on High-performance networking.
- Benchmarking, Profiling, and Analyzing the performance to find bottlenecks and identify areas of improvement and optimizations, with a strong emphasis on networking aspects.
- Implement performance analysis tools.
- Collaborating with many teams from HW to SW to provide performance analysis insights.
- Define performance test planning, set performance expectations for new technologies and solutions, and work to reach the performance targets limits.
- B.Sc in Computer Science or Software Engineering
- 6+ years of experience with high-performance Networking (RDMA, MPI, NCCL)
- Demonstrated Performance Analysis skills and methodologies.
- Experience with NVIDIA GPUs, CUDA library, deep learning frameworks like TensorFlow or PyTorch,
- Combined with expertise in networking collective communication libraries (such as NCCL) and protocols (such as RoCE and RDMA).
- Fast and self-learning capabilities with strong analytical and problem solving skills
- Programming Languages: Python, Bash and C languages
- Experience with Linux OS distros
- Team player with good communication and interpersonal skills
- In-depth knowledge and experience with AI workloads benchmarking for distributed LLM training, CUDA, and NCCL libraries.
- In-depth System knowledge and understanding (Intel / AMD / ARM CPUs, NVIDIA GPUs, HCA, Memory, PCI)
- Knowledge in Congestion Control algorithms
, , JR2011934
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search