- Strong experience building and operating production-grade online ML inference systems.
- Experience with model serving frameworks such as NVIDIA Triton Inference Server, TorchServe, Ray Serve, TensorFlow Serving, or similar systems.
- Experience optimizing inference workloads using techniques such as dynamic batching, model compilation, quantization, GPU acceleration, GPU kernel optimization, caching, or runtime tuning.
- Strong experience with distributed systems, Kubernetes, autoscaling, service reliability, and production observability.
- Strong programming skills in Python, with practical experience working on production ML systems and high-scale services.
- Experience with PyTorch and modern model deployment workflows, including model packaging, validation, and serving lifecycle management.
- Experience designing infrastructure for safe model rollout, canary testing, A/B experimentation, and automated rollback.
- Strong systems thinking, with the ability to reason about latency, throughput, reliability, scalability, and cost tradeoffs in online systems.
- Proven ability to lead technical direction and influence architectural decisions across teams without formal authority.