- Strong background in software engineering, with experience applying SRE or platform practices to improve system reliability, scalability, and performance.
- Experience owning or operating systems in production, including incident response, troubleshooting, and driving improvements based on operational learnings.
- Demonstrated ability to take ownership of complex systems in production and improve them over time.
- Ability to quickly navigate and understand unfamiliar codebases, proactively identifying and implementing improvements that enhance reliability, observability, and overall system health with minimal supervision.
- Experience debugging complex distributed systems and analyzing issues across service boundaries.
- Strong communication skills and the ability to collaborate effectively with both technical and non-technical stakeholders.
- Passion for continuous improvement and staying current with emerging cloud, security, and automation trends.
- Infrastructure as Code: Solid experience with Terraform or similar IaC tools.
- Containerized Platforms: Experience operating containerized workloads on cloud-native platforms (Kubernetes/EKS, ECS, or equivalent).
- Cloud Architecture: Familiarity with the AWS Well-Architected Framework or equivalent cloud architecture standards.
- Observability & Logging: Experience designing observability strategies and implementing solutions for metrics, logs, and traces.
- Software Engineering & Automation: Strong programming skills (e.g. Go, Python, or similar) with experience building systems, tooling, or services that improve reliability and developer workflows.