Staff Software Engineer, Core Reliability
You will build and launch reliability projects and features that improve resiliency across the service environment in partnership with reliability teams. You will work closely with critical T0/T1 services to understand architecture, improve scalability and reliability, and reduce operational toil. You will build and enhance systems that securely manage service configurations and secrets at scale. You will improve canary based release systems to make deployments safer and reduce incidents. You will expand deployment capabilities to support thousands of services and hundreds of daily deployments. You will partner across teams to promote reliability best practices and strengthen reliability culture.
Responsibilities
- Build and launch reliability projects and features that improve resiliency across the service environment in partnership with reliability teams.
- Work closely with critical T0/T1 services to understand architecture, improve scalability and reliability, and reduce operational toil.
- Build and enhance systems that securely manage service configurations and secrets at scale.
- Improve canary-based release systems to make deployments safer and reduce incidents.
- Expand deployment capabilities to support thousands of services and hundreds of daily deployments.
- Partner across teams to promote reliability best practices and strengthen reliability culture.
Requirements
- 7+ years of software engineering experience.
- Experience designing, building, scaling, and maintaining production services in service-oriented architectures.
- Strong system design and coding skills, with a track record of writing high-quality, well-tested code.
- Strong observability, debugging, and performance tuning skills.
- Excellent written and verbal communication skills, with the ability to explain technical concepts clearly.
- Sound judgment under pressure and a willingness to debug and improve any layer of the stack.
- Ability to participate in an on-call rotation and respond to issues outside normal business hours.
- Experience building reliable, high-throughput, low-latency systems.
- Experience with observability tools such as Kibana and Datadog.
- Familiarity with rapid-growth environments.
- Experience with Ruby, Go, Terraform, and cloud platforms such as AWS, GCP, or Azure.
- Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality
Benefits
- Equity
- Bonus eligibility
- Medical insurance
- Dental insurance
- Vision insurance
- 401(k)