Site Reliability Developer 4

As a Senior Site Reliability Developer (IC4), you will play a key role in ensuring the availability, scalability, and operational excellence of OCI's Japan Sovereign Cloud services. You will design and implement automation, drive service reliability improvements, lead complex incident investigations, and partner with development teams to improve operational readiness. You will own and prioritize an SRD operational improvement backlog based on shift feedback, incident reviews, alert quality reviews, and business reliability requirements.

The role combines software engineering expertise with large-scale cloud operations and requires participation in a 24x7 shift rotation supporting critical cloud infrastructure. You will translate operational and business requirements into reliability plans, then execute improvements through tooling, automation, runbook updates, process changes, and cross-team coordination. You will also serve as a technical mentor for less experienced engineers and contribute to continuous improvement initiatives across the organization. You will collaborate with JP Sovereign Cloud and EU Sovereign Cloud teams to share operational practices and align reliability improvements where appropriate.

Qualifications

- Native-level Japanese language proficiency and business-level English communication skills
- 5+ years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, DevOps, or related technical disciplines
- Proficiency in one or more programming languages such as Java, Python, Go, C++, or similar
- Experience with cloud platforms, infrastructure automation, observability, monitoring, and incident response practices
- Strong understanding of Linux systems administration, networking, storage, and performance optimization
- Demonstrated ability to troubleshoot complex cross-functional production issues and drive root cause analysis
- Ability to participate in a 24x7 shift rotation and provide technical leadership during critical service events
- Demonstrated ability to intake, triage, and prioritize operational issues raised by shift teams and convert them into executable improvement plans
- Experience improving alert quality, reducing alert noise, increasing actionability, and ensuring operational documentation supports timely incident response
- Ability to balance business requirements, technical feasibility, and operational risk when planning reliability improvements

Career Level - IC4

Similar jobs