Operations-Focused Engineer

Houston, Texas, United Statesusvia direct

// Job Type

Full Time

// Salary

Not disclosed

// Posted

5 months ago

About the Role

<h2>REMOTE</h2> <h4><u>About TTC</u></h4> <p>The Testing Consultancy (TTC) is a global specialist software testing company with a focus on helping organizations transform the way they deliver quality software. We have broad capabilities across a wide range of testing areas that enable our clients to increase the speed and quality of software development while reducing risk and cost.</p> <h4><u>Perks of working for TTC</u></h4> <ul> <li>Competitive Base Salary</li> <li>Medical, Dental, Vision Benefits</li> <li>401K w/ company match</li> <li>Paid Time Off</li> <li>Paid Holidays</li> <li>Work Life Balance</li> <li>Relaxed Work Environment</li> <li>Growth and Development Opportunities</li> </ul> <h4><span style="text-decoration: underline;">Role summary</span></h4> <p>We’re looking for an <strong>operations-focused engineer</strong> to join our team. This role owns the <strong>day-to-day reliability and operational excellence</strong> of a portfolio of business-critical third-party enterprise platforms and integrations, partnering closely with engineering and cross-functional infrastructure teams to keep systems healthy, scalable, and secure. </p> <h4><span style="text-decoration: underline;">Responsibilities</span></h4> <ul> <li>Serve in an <strong>on-call rotation</strong> and lead incident response for production issues: triage, mitigation, escalation, and restoration.</li> <li>Drive <strong>operational excellence</strong>: improve alert quality, reduce toil, document runbooks, and create repeatable operational processes.</li> <li>Perform <strong>root cause analysis</strong> for incidents and recurring issues; drive corrective and preventive actions to completion.</li> <li>Execute and coordinate <strong>maintenance activities</strong> (upgrades, patching, configuration changes) with minimal risk and downtime.</li> <li>Build and maintain <strong>monitoring, dashboards, and health checks</strong> to detect issues early and reduce mean time to recovery.</li> <li>Automate routine operational workflows using scripts and small tools; improve reliability through safe incremental change.</li> <li>Partner cross-functionally (security, networking, storage, compute, vendor/third-party partners) to resolve complex issues.</li> <li>Maintain accurate system documentation, operational standards, and service ownership practices across supported platforms.</li> </ul> <h4><span style="text-decoration: underline;">Minimum qualifications</span></h4> <ul> <li>3+ years experience in <strong>production operations</strong>, SRE, systems engineering, or production support for enterprise services.</li> <li>Strong <strong>Linux/systems troubleshooting</strong> skills (processes, logs, performance, networking basics).</li> <li>Experience participating in or leading <strong>on-call</strong> and handling production incidents with clear communication.</li> <li>Proficiency in <strong>scripting/automation</strong> (e.g., Python and/or shell) and comfort with change management / peer review workflows.</li> <li>Strong written and verbal communication; able to write clear runbooks and incident summaries.</li> </ul> <h4><span style="text-decoration: underline;">Preferred qualifications</span></h4> <ul> <li>Experience operating <strong>third-party enterprise platforms</strong> (integration middleware, identity/auth systems, web/app tiers, databases, batch/scheduled jobs).</li> <li>Familiarity with <strong>vulnerability remediation</strong> and patch management practices in production environments.</li> <li>Demonstrated track record reducing operational toil and improving reliability metrics (MTTR, alert noise, incident recurrence).</li> <li>Experience coordinating complex incidents across multiple teams and stakeholders.</li> <li>Experience using Capirca for network provisioning, Chef for configuration management, and Infrastructure as Code and Containers for deployment. </li> </ul> <h4><span style="text-decoration: underline;">Success in the first 60–90 days</span></h4> <ul> <li>Ramp to <strong>primary on-call</strong> ownership for supported systems.</li> <li>Demonstrate ability to independently troubleshoot common failure modes and follow operational playbooks.</li> <li>Deliver at least 1–2 measurable reliability improvements (toil reduction, alert cleanup, monitoring gap closure, recurring issue fix).</li> </ul> <h4><span style="text-decoration: underline;">Working style</span></h4> <ul> <li>Calm under pressure, structured problem-solver, prioritizes reliability and safety.</li> <li>Proactive communicator who keeps stakeholders informed during incidents and planned work.</li> <li>“Automate and document” mindset: reduces repeated manual work and makes operations scalable.</li> </ul> <p> </p> <p>If your experience or qualifications is similar to our ideal of a successful candidate, please consider applying. Experience comes in many ways; skills may be transferred, but passion for your career can't be substituted. At TTC, we understand the importance of diversity and how much value it brings to the table. Diversity brings about creativity and new perspectives, which is why we beckon everyone to apply. </p>

View on Original Source

Interested in this job?

Use our AI to tailor your resume for this Operations-Focused Engineer position at TTC Global.