Site Reliability Developer 5

As a Principal Site Reliability Developer (IC5), you will provide technical leadership for the reliability, availability, and operational strategy of OCI's Japan Sovereign Cloud platform. You will lead large-scale reliability initiatives, influence architecture decisions, develop advanced automation frameworks, and drive operational excellence across multiple cloud services. You will define SRD reliability strategy, operational standards, and improvement roadmaps that connect business requirements with technical execution across Alert, Incident Response, Availability, and Reliability.

This position requires deep expertise in distributed systems, cloud infrastructure, and software engineering, combined with the ability to collaborate effectively with senior engineering leaders across global OCI organizations. You will align operational practices with JP Sovereign Cloud, EU Sovereign Cloud, and global OCI reliability teams, and drive standardization where it improves service resiliency. The role includes participation in a 24x7 operational support model while serving as a key escalation point for high-severity incidents and strategic reliability improvements. You will sponsor improvements raised from shift operations, ensure recurring issues are addressed through durable fixes, and mentor senior engineers on Plan + Execution ownership.

Qualifications


- Native-level Japanese language proficiency and business-level English communication skills
- 8+ years of experience in Site Reliability Engineering, Cloud Infrastructure Engineering, Software Development, or large-scale distributed systems operations
- Extensive experience designing, operating, and improving highly available cloud platforms and mission-critical services
- Expert-level proficiency in software development and automation using languages such as Java, Go, Python, or similar
- Deep understanding of distributed systems architecture, networking, storage, observability, and service resiliency principles
- Proven track record leading major incident response efforts, reliability programs, and cross-organizational technical initiatives
- Ability to influence architecture, operational standards, and engineering best practices across multiple teams
- Willingness to participate in a 24x7 shift rotation and act as a senior escalation resource for critical production events
- Proven ability to define reliability strategy and operational standards across multiple teams or services
- Experience translating business and operational requirements into prioritized reliability roadmaps with measurable outcomes
- Ability to drive cross-sovereign collaboration, including shared operational practices and tooling alignment across JP Sovereign Cloud and EU Sovereign Cloud

Career Level - IC5

Similar jobs