Availability Engineer
Who we are
DigiCert is a global leader in intelligent trust. We protect the digital world by ensuring the security, privacy, and authenticity of every interaction. Our AI-powered DigiCert ONE platform unifies PKI, DNS, and certificate lifecycle management, to secure infrastructure, software, devices, messages, AI content and agents. Learn why more than 100,000 organizations, including 90% of the Fortune 500, choose DigiCert to stop today’s threats and prepare for a quantum-safe future at www.digicert.com
Job summary
We are seeking a highly skilled Observability & Incident Response Site Reliability Engineer (SRE) to own incident management practices across all production systems. In this role, you will be the subject matter expert for monitoring, alerting, tracing, and logging and lead incident response efforts. You will work at the intersection of product engineering, platform, and security teams to ensure our systems are observable, resilient, and compliant with SLA/SLO commitments.
What you will do
- Excellent knowledge on Kubernetes clusters and container workloads for production reliability.
- Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments, repeated manual tasks (Harness, GitHub Actions, etc.)
- Act as the primary Incident Manager for high priority production incidents — coordinating swift resolution across engineering, infrastructure, and business teams.
- Own and continuously improve incident response runbooks, escalation matrices, and on-call schedules.
- Drive root cause analysis for all major incidents — ensuring root cause analysis, action item tracking, and long-term resolution.
- Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through proactive alerting and automated remediation.
- Establish and enforce SLA/SLO/SLI frameworks across all production services.
- Build automated runbooks and self-healing mechanisms to reduce manual intervention during incidents.
- Implement synthetic monitoring to proactively detect customer-facing issues.
- Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability across production systems.
- Exceptional communication skills — able to lead high-pressure incident bridges calmly and clearly.
- Detail-oriented with a strong sense of ownership and accountability.
- Ability to manage multiple concurrent incidents and priorities without losing composure.
What you will have
- 4+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.
- Hands-on experience leading incident response for high-severity production incidents.
- Strong background in Linux systems administration and distributed systems troubleshooting.
- Experience defining and managing SLOs, SLIs, and Error Budgets in production.
Nice to have
- Monitoring & alerting: New Relic, Nagios, or equivalent.
- Log management: Splunk.
- Incident management: PagerDuty, OpsGenie, VictorOps, or equivalent.
- Container orchestration: Kubernetes, Helm, Docker — with deep observability integration experience.
- Scripting & automation: Python, Bash or similar for building tooling and automations.
- Infrastructure as Code: Terraform or Salt.
- CI/CD pipelines: GitHub Actions, Harness.
Benefits
- Generous time off policies.
- Top shelf benefits.
- Education, wellness and lifestyle support.
To protect candidate information and maintain a secure hiring process, all applications must be submitted through our careers portal. Resumes or CVs sent directly via email will not be reviewed or considered.
#LI-SS1