KMC Solutions Inc Himalayas · Posted 1mo ago

XTN-DAD3686 | SITE RELIABILITY ENGINEER

USD Full time Remote

Continue to application Add your email once, then Caio opens the original posting.

Indexed description

You will be working on validating and testing GPU clusters prior to production release, ensuring hardware integrity, system reliability, and optimal performance. This role involves provisioning clusters, executing performance benchmarks, maintaining automated validation frameworks, and troubleshooting Linux-based systems in high-performance compute environments. You will collaborate closely with engineering and operations teams to ensure seamless handovers and production readiness.

.• Health Insurance/HMO

• Enjoy unlimited MadMax Coffee

• Diverse learning & growth opportunities
• Accessible Cloud HR platform (Sprout)

• Above standard leaves

Cluster Validation & Testing

Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release
Perform functional and reliability testing of GPUs, servers, and associated components
Verify network connectivity and performance, including InfiniBand where applicable

Orchestration & Benchmarking

Provision and configure GPU clusters using automated workflows
Execute and analyse performance and stability benchmarks orchestrated via Slurm
Validate results against expected performance and reliability thresholds

Test Framework & Automation

Maintain and extend the automated validation framework built using Python and Ansible
Integrate new test cases to support additional hardware platforms and GPU generations
Improve test reliability, coverage, and execution efficiency

Remediation & System Integrity

Diagnose and remediate unhealthy nodes through configuration changes or software fixes
Coordinate with on-site support and Smart Hands teams for hardware replacements when required
Ensure all issues are resolved and documented prior to handover to production operations

Documentation & Handover

Produce clear, accurate documentation of test results, hardware states, and remediation actions
Ensure smooth handovers to operations and engineering teams
Maintain up-to-date runbooks and validation procedures

Essential

• Strong hands-on experience administering and troubleshooting Linux systems (Prio)
• Confident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system

services

• Excellent written and verbal English communication skills
• High standards for system reliability, consistency, and documentation
Preferred / Desirable
• Experience working with GPU-based or high-performance compute environments
• Familiarity with Slurm or other workload schedulers
• Understanding of datacenter hardware lifecycle and server validation processes
• Exposure to InfiniBand or high-speed networking technologies
• Experience working with distributed or remote infrastructure teams
• Proficiency in Python for automation, test execution, and parsing results (Preferred)
• Proven experience writing and maintaining Ansible playbooks (Preferred)

.

Originally posted on Himalayas

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search

Want help applying to roles like this? Search Caio for free. If repetitive applications get heavy, Managed Job Search adds supervised execution for $99/month.

View Managed Job Search

KMC Solutions Inc Company profile preview

Source: Himalayas
Location
Compensation: USD
Open on Caio: 97 roles

Salary insight

USD

Caio highlights salary ranges whenever the original posting exposes them. Compare similar roles as the index fills in.

Similar role details

Full time roles Remote matches Himalayas postings

Company stats

Current index details for KMC Solutions Inc, based on roles Caio has indexed from public sources.

97open roles 3sources 1markets Posted yesterdaylatest role