About the Role
We're building a new SRE team and looking for founding members to help shape how we operate. As a Lead SRE, you’ll be a technical and operational leader for reliability across Develocity. You’ll help define our SRE vision, set standards for how we operate production services, and mentor other SREs as the team grows. This is a hands-on role with broad influence across engineering, cloud platform, and customer-facing teams.
The SRE team will be responsible for the reliability, performance, and availability of Develocity instances serving paying customers, open-source projects, and public-facing services, plus supporting infrastructure like artifact registries.
You'll work on our internally-built Cloud Application Platform, Kubernetes on AWS, and develop deep expertise in it. When incidents happen, you'll troubleshoot issues across the stack, from application to infrastructure. You'll collaborate with the Cloud Platform team to improve the tooling you depend on, and with engineering teams to build reliability into how we ship software. If you like automating things and hate doing the same task twice, you'll fit in well.
You'll be part of a distributed, remote-first team that values asynchronous communication and written documentation. Strong self-direction and clear communication across time zones are essential.
Responsibilities
Operate and maintain all Develocity instances and supporting services in production.
Define and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOs.
Participate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidents.
Lead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvements.
Set reliability priorities using risk, customer impact, business goals, SLOs, and error budgets.
Identify systemic reliability risks and continuously evolve Develocity’s SaaS operations as the platform and customer base grow.
Lead and influence architectural and design reviews to ensure reliability, scalability, and operability.
Drive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflows.
Build and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alerting.
Own disaster recovery, backups, and business continuity planning and execution.
Partner with engineering leadership to balance feature delivery with reliability and operational excellence.
Mentor and coach SREs, supporting technical growth and strong operational practices.
Help onboard new SREs and contribute to hiring by defining and assessing SRE excellence at Develocity.
Communicate clearly with customers during incidents and maintenance windows.
Optimize performance, resource utilization, and operational costs.
Minimum qualifications
7+ years in SRE, DevOps, or an equivalent role operating production services at scale.
Experience leading reliability initiatives across multiple teams or services.
Demonstrated ability to influence technical direction without direct authority.
Experience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and cost.
Strong Kubernetes experience in production environments.
Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2).
Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform).
Track record of incident management and response in a 24/7 on-call environment.
Scripting proficiency (Python, Bash) for automation.
Strong written and verbal English communication skills.
Preferred qualifications
Experience as a founding or early SRE establishing practices in a growing SaaS organization.
Familiarity with Develocity.
JVM language experience (Java, Kotlin).
Experience with customer-facing and executive-level incident communications.
Tech Stack
SREDevOpsKubernetesAWSEKSRDSS3EC2PrometheusGrafanaTerraformPythonBash