- Help define the roadmap for leveraging Machine Learning Engineering to improve Production Systems Reliability at Roblox.
- Improve realtime anomaly detection capabilities by leveraging various state of the art ML techniques, thereby directly contributing to improving Mean Time to Detect Production issues.
- Develop methods to build pipelines to consume various streams of data (metrics, logs, traces, change management systems etc.).
- Build a reasoning layer that interacts with the streams of data to find possible root causes of problems happening in production.
- Build time-series models to predict capacity exhaustion and seasonal traffic spikes to drive automated scaling.
- Knowledge of various modeling techniques, ability to go deep and fine tune models to fit our use cases.
- Ability to propose and architect the infrastructure that allows us to implement systems that learn from user and/or automated feedback.
- Good distributed systems fundamentals and understanding of large scale high throughput systems.