- 10+ years of experience building and operating large-scale distributed systems and infrastructure.
- Deep, hands-on GPU expertise at the machine management layer and above: GPU host provisioning, driver and firmware lifecycle, GPU health and reliability, and the realities of running accelerators in production.
- A track record as an expert for compute, not just fleet management, with the scars to prove you have scaled GPU or accelerator infrastructure that other teams depend on.
- Strong proficiency in Go or other well-structured programming languages.
- Experience operating GPU and AI workloads in production, including familiarity with CUDA, GPU scheduling, and high-performance networking (NVLink, InfiniBand, RoCE).
- Familiarity with Kubernetes for GPU workloads and with bare-metal concepts (firmware, BMC/IPMI/Redfish, OS imaging) is a strong plus.
- A history of being the anchor expert that an organization relies on for its hardest GPU and compute problems, and the leadership to up-level the engineers around you.