Senior Site Reliability Engineer (SRE) (GCP)
— RemoteRequired Qualifications
- 3+ years of experience in DevOps, Cloud Engineering, Site Reliability Engineering, or similar roles.
- Hands-on experience with Google Cloud Platform (GCP).
- Strong understanding of core GCP services, including:
- Compute Engine
- Cloud Run
- App Engine
- Google Kubernetes Engine (GKE)
- Production experience managing Kubernetes environments.
- Experience configuring Kubernetes resources such as Deployments, Services, Ingress, ConfigMaps, Secrets, and Autoscaling.
- Solid understanding of Kubernetes health checks, including readiness and liveness probes.
- Experience with Infrastructure as Code using Terraform.
- Understanding of Terraform state management and multi-environment infrastructure design.
- Strong Linux administration and troubleshooting skills.
- Good understanding of networking concepts, including:
- VPCs
- Subnets
- Firewall rules
- Load balancing
- Private networking
- Experience with monitoring, logging, and observability platforms.
- Experience investigating and resolving production incidents.
- Understanding of reliability concepts such as SLA, SLO, and SLI.
- Strong verbal and written English communication skills.
Preferred Qualifications
Preferred Qualifications
- Experience designing highly available and globally distributed applications in GCP.
- Knowledge of zero-downtime deployment strategies.
- Experience supporting large-scale production environments.
- Experience with multi-tenant architectures.
- Scripting experience using Python, Bash, or similar languages.
- Experience working in hybrid cloud/on-premise environments.
- Experience participating in SEV incident management.
- Familiarity with capacity planning and performance tuning.
Technology Stack
Technology Stack
- Cloud: Google Cloud Platform (GCP)
- Containers: Kubernetes, GKE
- Infrastructure as Code: Terraform
- Monitoring & Observability: Grafana, Prometheus, Logging Platforms
- Operating Systems: Linux
- Incident Management: PagerDuty, ServiceNow, Slack (or equivalent tools)
Working Requirements
Working Requirements
- Availability to work within CT business hours.
- Participation in an on-call rotation that includes coverage for one weekend day when scheduled.
What Success Looks Like
What Success Looks Like
- Reliable operation of production systems during periods of high traffic and critical business activity.
- Fast and effective incident response and troubleshooting.
- Well-automated, maintainable infrastructure managed through Infrastructure as Code.
- Strong collaboration with development teams to improve reliability, scalability, and operational efficiency.
- Deploy, maintain, and improve cloud infrastructure in Google Cloud Platform (GCP).
- Operate and support Kubernetes environments, including GKE.
- Build and maintain Infrastructure as Code using Terraform.
- Monitor production systems and proactively identify reliability risks.
- Troubleshoot infrastructure, networking, application, and performance issues.
- Participate in incident response, root cause analysis, and postmortem activities.
- Implement and maintain observability solutions, dashboards, and alerting systems.
- Collaborate with software engineering teams to improve deployment processes and operational excellence.
- Support highly available and scalable production environments.
- Contribute to automation initiatives that reduce operational overhead and improve reliability.
At Devsu, you'll work alongside top-tier professionals, with the opportunity for continuous learning and participation in challenging, high-impact projects for global clients. Our team is present in more than 18 countries, collaborating on a variety of software products and solutions.
We are looking for a hands-on Semi Senior DevOps Engineer to join a high-impact project supporting a global-scale sports event. This role is ideal for someone who enjoys working close to production systems, troubleshooting complex issues, automating infrastructure, and ensuring platform reliability in mission-critical environments.
- A stable, long-term contract with opportunities for career growth
- Private health insurance
- A remote-friendly culture that promotes work-life balance
- Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
- Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
- A flexible Paid Time Off (PTO) policy as well as paid holiday days
- Challenging, world-class software projects for clients in the US and LatAm
- Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment
Published 30 days ago