Senior Site Reliability (DevOps) Engineer

Job Description:

Rakuten Group, Inc. is a global leader in internet services and has a diverse ecosystem spanning across e-commerce, fintech, communications and more serving approximately 1.8 billion members worldwide. Founded in Tokyo in 1997, the Group operates in over 30 countries and regions with more than 30,000 employees.

Based in Singapore's Central Business District, Rakuten Asia Pte. Ltd. serves as the regional headquarters for Asia, driving value through areas such as advertising product development, product strategy, and data management to support Rakuten Group's global ecosystem. Learn more at: https://global.rakuten.com/corp/

The Marketing Cloud Platform Department (MCPD) drives Rakuten's marketing product strategy, executes product development, and ensures successful implementation. We empower Rakuten's internal marketing teams by creating engaging, respectful, and cost-efficient marketing platforms that prioritize our customers. Leveraging the Rakuten Ecosystem, we offer comprehensive marketing solutions, including campaign management, multichannel communication, and personalization. As a team of over 150 experts across Japan, India, and Singapore, we pride ourselves on being a technology-driven organization that shares knowledge within the Rakuten Tech community.

As an Senior Site Reliability (DevOps) Engineer in MCPD, you will drive operational excellence by implementing best practices in observability, incident management, and automation. This role bridges engineering and operations, requiring both strong technical expertise and people management skills to build and maintain highly available systems that serve millions of Rakuten's customers globally.

Main Responsibilities:

Define and drive SRE strategy, including SLO/SLI frameworks, error budgets, and reliability targets aligned with business objectives and customer expectations
Establish and improve incident management processes, including on-call rotations, escalation procedures, and blameless post-mortem practices to minimize MTTR and prevent recurring issues
Collaborate with development teams to embed reliability practices into the software development lifecycle, advocating for design reviews, chaos engineering, and production readiness reviews
Design and implement comprehensive observability solutions (monitoring, logging, tracing, alerting) to provide actionable insights into system health and performance
Drive automation initiatives to reduce toil, improve deployment reliability, and enable self-service capabilities for engineering teams
Partner with Architecture and Platform teams to ensure infrastructure decisions support scalability, fault tolerance, and cost optimization goals
Manage capacity planning and performance optimization for critical marketing platforms handling high-volume campaign executions and real-time personalization
Report on reliability metrics, incident trends, and operational health to leadership, translating technical insights into business impact assessments

Required Qualifications:

8+ years of experience in software engineering, DevOps, or site reliability engineering, with at least 3 years in a people management role
Proven track record of building and leading high-performing SRE or platform engineering teams in a distributed, multi-timezone environment
Deep expertise in cloud platforms (GCP preferred, AWS/Azure acceptable) including compute, networking, storage, and managed services
Strong knowledge of containerization and orchestration technologies (Kubernetes, Docker) and Infrastructure as Code (Terraform, Ansible)
Hands-on experience with observability tools and practices (Prometheus, Grafana, Datadog, ELK Stack, or similar) and defining meaningful SLOs/SLIs
Experience with CI/CD pipelines, deployment strategies (blue-green, canary), and release engineering best practices
Strong programming/scripting skills in languages such as Python, Go, or Java for automation and tooling development
Excellent communication skills with the ability to collaborate effectively across engineering, product, and business stakeholders
Strong incident management experience with demonstrated ability to lead high-pressure situations calmly and effectively

Rakuten is an equal opportunities employer and welcomes applications regardless of sex, marital status, ethnic origin, sexual orientation, religious belief, or age.