- Experience working with distributed computing frameworks such as Flink, Spark, Ray for distributed data processing
- Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines
- Experience optimizing big data pipelines and infrastructure for cost efficiency
- Strong programming skills in Python and experience working with large-scale distributed workloads
- Experience with modern data infrastructure (data lakes, warehouses, orchestration systems, streaming platforms)
- Strong systems thinking, with the ability to reason about performance, scalability, reliability, and cost tradeoffs in distributed systems
- Proven ability to lead technical direction and influence architectural decisions across teams without formal authority