- Experience working with distributed computing frameworks such as Ray, Spark, Flink and familiarity in the Ray ecosystem (Ray Data, Ray Train) for distributed data processing and model training
- Experience building and optimizing large-scale distributed ML training pipelines with Torch Compilation, Quantization, CUDA, GPU kernel optimization etc.
- Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
- Deep experience designing and operating production-grade data pipelines
- Strong programming skills in Python and experience working with large-scale distributed workloads
- Experience with modern data infrastructure (data lakes, warehouses, orchestration systems, streaming platforms)
- Strong systems thinking, with the ability to reason about performance, scalability, reliability, and cost tradeoffs in distributed systems
- Proven ability to lead technical direction and influence architectural decisions across teams without formal authority