About the Job
We are currently seeking an ambitious, driven, and experienced DevOps/SRE to help us scale up our infrastructure to meet increasing demand.
You will play a key role in our growing engineering team on the conceptualization, design, deployment, and continuous improvement of our infrastructure to support our developer platform.
Your experience should guide us in architecting our network to deliver products with the highest standards. Your knowledge of metrics, logs, and traces will play a huge role in the management of our infrastructure and backend applications to ensure vital services are in top-notch performance for our users. You should be self-driven, conscientious, and have a keen eye to identify and automate high-impact tasks.
Requirements
- Own and manage the production systems from an operational standpoint. (i.e deployment, data logging, monitoring, alerts, etc.)
- Using key metrics and usage data to continuously design and implement solutions to improve the reliability, security, and scalability of our infrastructure.
- Develop and own best practices for managing production infrastructure: provisioning, application scaling, configuration management, capacity planning, monitoring, etc.
- Provide key updates and operational support to our users via our engagement channels.
- Provide input and fresh ideas into long-term platform requirements and operational guidelines with a key focus on reliability
- Continuously raise our standard of engineering excellence by implementing best practices for coding, testing, and deployment
- Build and maintain documentation around processes and workflows
We're looking out for candidates with experience in
- DevOps or Site Reliability Engineer
- Designing and operating large-scale, multi-region production systems
- Working with GCP or cloud service providers like AWS, DO, Azure
- Real-time telemetry and tracing tools like Prometheus, Stackdriver, and DataDog
- Building deployment pipelines leveraging common CI/CD tools