Unlimited Job Postings Subscription - $99/yr!

Job Details

Senior Software Engineer - Reliability (Remote)

  2025-11-30     Jobgether     all cities,AK  
Description:

Senior Software Engineer - Reliability (Remote)

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer - Reliability (Remote) in California (USA).

We are seeking a Senior Software Engineer specializing in Reliability to help design, implement, and operate systems that ensure cloud?based production environments remain secure, compliant, and highly available. In this role, you will be a foundational member of a new Site Reliability Engineering (SRE) team, building processes and infrastructure to support mission?critical workloads in regulated environments. You will collaborate with engineering, product, and operational teams to define service?level objectives, develop monitoring and automation, and improve overall system reliability. The ideal candidate is experienced in cloud infrastructure, automation, and observability, and enjoys solving complex distributed system challenges. This role offers the opportunity to shape the SRE culture and practices from the ground up, while contributing to high?impact projects that support regulated and commercial operations.

Accountabilities

  • Design and implement observability practices including metrics, traces, dashboards, logs, and alerting for production systems
  • Partner with engineering, product, and lab teams to define SLIs/SLOs, error budgets, and incident response procedures
  • Develop and maintain operational playbooks and runbooks for reliability and compliance
  • Participate in on?call rotations, championing automation and self?healing for production systems
  • Contribute to deployment processes and infrastructure automation using Infrastructure as Code (IaC)
  • Collaborate on incident reviews, postmortems, and disaster recovery exercises to improve system reliability
  • Mentor peers, promote best practices, and help establish the SRE culture and strategy

Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent experience
  • 5+ years of experience in software engineering, SRE, or DevOps roles (Python or Go preferred)
  • Hands?on experience deploying and operating production workloads in cloud environments (AWS, GCP, or Azure)
  • Expertise in Infrastructure as Code (Terraform, Pulumi, Bicep/ARM)
  • Experience with incident management platforms (e.g., Incident.io, ServiceNow, Opsgenie, PagerDuty)
  • Strong knowledge of Kubernetes (AKS, GKE, EKS) and cloud networking
  • Proficiency with observability platforms such as DataDog, Prometheus/Grafana, or OpenTelemetry
  • Excellent troubleshooting, root?cause analysis, and automation skills
  • Ability to work autonomously and collaborate effectively with cross?functional teams
  • Experience in regulated environments (healthcare, biotech) and familiarity with compliance?driven change management is a plus

Benefits
  • Competitive salary: $131,325-$201,000 USD, with potential for pre?IPO equity and cash bonuses
  • Comprehensive medical, dental, and vision coverage
  • Paid time off and holidays
  • Remote work flexibility
  • Opportunities for professional growth, mentorship, and leadership in a foundational SRE team
  • Participation in shaping processes for high?reliability systems in regulated environments

Seniority Level
  • Mid?Senior level

Employment Type
  • Full?time

Job Function
  • Information Technology

Industries
  • Non?profit Organizations and Primary and Secondary Education


#J-18808-Ljbffr


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search