Staff Site Reliability Engineer

United Statesusvia direct

// Job Type

Full Time

// Salary

Not disclosed

// Posted

4 months ago

// Seniority

lead

// Experience

5+ years

About the Role

Why Lytx – Staff Site Reliability Engineer At Lytx, our engineering culture is built around being hungry, low-ego, and highly capable. We are pragmatic engineers who take ownership, collaborate openly, and focus on delivering measurable operational impact. Our mission is to design, operate, and continuously improve the cloud infrastructure and operational platforms that power mission-critical SaaS and IoT services at scale. As our platform grows in scale and complexity, we are investing in next-generation observability, intelligent automation, and data-driven operations to improve reliability, reduce operational noise, and enable faster detection and recovery. We are also expanding the use of AI and advanced analytics to move toward more proactive and automated operations. The Site Reliability Engineering (SRE) team is responsible for the availability, reliability, observability, and resilience of our cloud-native environments. This includes building automation, improving operational intelligence, and partnering across engineering to ensure systems are designed and operated for reliability and scale. As a Staff SRE, you will operate as a technical leader across multiple teams and services. You will drive reliability initiatives, influence architecture and operational practices, and lead efforts that reduce operational risk, improve system visibility, and increase the effectiveness of engineering through automation and intelligent operations. If you enjoy solving complex distributed systems challenges, building scalable operational solutions, and leading improvements that have broad impact across the organization, this role is an excellent fit. Responsibilities / You’ll get to Technical Leadership Across Services - Lead reliability, performance, and operational improvements across multiple services or platform domains, working with engineering teams to ensure systems meet availability and scalability goals. Observability Architecture & Strategy (Team-Level) - Design and drive improvements to monitoring, logging, tracing, and alerting. Establish patterns and reusable solutions that improve signal quality, reduce alert noise, and enable faster detection and diagnosis. Operational Automation & AIOps - Lead initiatives that reduce operational toil through automation, including runbook automation, self-healing workflows, event correlation, anomaly detection, and automated remediation. Incident Leadership & Systemic Improvement - Provide technical leadership during high-severity incidents and drive blameless postmortems that identify systemic issues and result in durable reliability improvements. Reliability Engineering & Resilience - Partner with product and platform teams to embed reliability, performance, and fault tolerance into system design, including capacity planning, scaling strategies, and failure-mode analysis. Infrastructure & Cloud Engineering - Design and implement scalable AWS infrastructure using Infrastructure-as-Code and cloud-native best practices, enabling consistent and reliable service operation. Standards & Best Practices Influence - Influence SRE practices such as SLO/SLI design, alerting standards, operational readiness, and reliability reviews. Contribute to evolving operational standards and engineering guidelines. Cross-Functional Collaboration - Work closely with developers, platform engineers, architects, and operations teams to drive reliability, observability, and operational maturity across the engineering organization. Mentorship & Technical Guidance - Mentor Senior and mid-level engineers, provide technical guidance, and act as a subject matter expert for reliability, observability, and operational excellence. Innovation & Tooling Evaluation - Evaluate and introduce new tools, AWS-native capabilities, and emerging observability or AI-enabled operational technologies that improve reliability and engineering efficiency. Requirements / You’ll Need Experience 6 - 8 years of experience in SRE, DevOps, platform engineering, or cloud infrastructure roles supporting large-scale production environments. Demonstrated experience leading reliability or infrastructure initiatives across multiple teams or services. Strong experience operating 24/7 production systems, including incident leadership, root cause analysis, and proactive reliability improvement. Cloud & Infrastructure Deep hands-on experience designing and operating production workloads in AWS, including services such as EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, VPC, IAM, and CloudWatch. Strong experience building and managing infrastructure using Terraform, CloudFormation, or similar Infrastructure-as-Code tools. Observability Strong experience designing and implementing observability solutions using tools such as New Relic, Datadog, Prometheus/Grafana, CloudWatch, or similar. Experience with OpenTelemetry or modern telemetry standards. Experience improving telemetry quality, alert tuning, dashboard design, and operational visibility across multiple services. Automation & Engineering Strong programming or scripting skills (Python, Go, Bash, or similar) for building automation, operational tooling, and integrations. Experience building reusable automation frameworks or shared tooling preferred. Systems & Platform Expertise Strong understanding of Linux systems, networking fundamentals (TCP/IP, DNS, TLS), and distributed system behavior. Experience with Kubernetes and cloud-native architectures. Operational Intelligence (Preferred) Experience improving operational signal quality through alert noise reduction, event correlation, anomaly detection, or automated remediation. Experience with AIOps concepts or AI-assisted operational tooling. Leadership & Influence Demonstrated ability to influence technical decisions without direct authority. Experience mentoring engineers and driving cross-team technical initiatives. Ability to operate effectively in complex, high-impact production environments. Innovation Lives Here You go all in no matter what you do, and so do we. At Lytx, we’re powered by cutting-edge technology and Happy People. You want your work to make a positive impact in the world, and that’s what we do. Join our diverse team of hungry, humble and capable people united to make a difference. Together, we help save lives on our roadways. Find out how good it feels to be a part of an inclusive, collaborative team. We’re committed to delivering an environment where everyone feels valued, included and supported to do their best work and share their voices. Lytx, Inc. is proud to be an equal opportunity/affirmative action employer and maintains a drug-free workplace. We’re committed to attracting, retaining and maximizing the performance of a diverse and inclusive workforce. EOE/M/F/Disabled/Vet. Lytx® is a leading provider of video telematics, analytics, safety and productivity solutions for commercial and public sector fleets. Our unrivaled Driver Safety Program, powered by our best-in-class DriveCam® Event Recorder, is proven to help save lives and reduce risk. We harness the power of video to help clients see what happened in the past, manage their operations more efficiently in the present and improve driver behavior to change the future. Our customizable services and programs span driver safety, risk detection, fleet tracking, compliance and fuel management. Using the world’s largest driving database of its kind, along with proprietary machine vision and artificial intelligence technology, we help protect and connect thousands of fleets and more than one million drivers worldwide. For more information, visit www.lytx.com, @lytx on X, LinkedIn, our Facebook page or YouTube channel. Private Notice: Please see Lytx’s Global Human Resources Privacy Statement for more information related to Personal Information we process and store related to our applicants.

Tech Stack

SREDevOpsAWSEC2EKSECSRDSDynamoDBS3ALBNLBVPCIAMCloudWatchTerraformCloudFormationNew RelicDatadogPrometheusGrafanaOpenTelemetryPythonGoBashLinuxTCP/IPDNSTLSKubernetes

View on Original Source

Interested in this job?

Use our AI to tailor your resume for this Staff Site Reliability Engineer position at Lytx.