Senior Site Reliability (DevOps) Engineer

Job Description:

Rakuten Group, Inc. is a global leader in internet services and has a diverse ecosystem spanning across e-commerce, fintech, communications and more serving approximately 1.8 billion members worldwide. Founded in Tokyo in 1997, the Group operates in over 30 countries and regions with more than 30,000 employees.

Based in Singapore's Central Business District, Rakuten Asia Pte. Ltd. serves as the regional headquarters for Asia, driving value through areas such as advertising product development, product strategy, and data management to support Rakuten Group's global ecosystem. Learn more at: https://global.rakuten.com/corp/

The Marketing Cloud Platform Department (MCPD) drives Rakuten's marketing product strategy, executes product development, and ensures successful implementation. We empower Rakuten's internal marketing teams by creating engaging, respectful, and cost-efficient marketing platforms that prioritize our customers. Leveraging the Rakuten Ecosystem, we offer comprehensive marketing solutions, including campaign management, multichannel communication, and personalization. As a team of over 150 experts across Japan, India, and Singapore, we pride ourselves on being a technology-driven organization that shares knowledge within the Rakuten Tech community.

As an Senior Site Reliability (DevOps) Engineer in MCPD, you will drive operational excellence by implementing best practices in observability, incident management, and automation. This role bridges engineering and operations, requiring both strong technical expertise and people management skills to build and maintain highly available systems that serve millions of Rakuten's customers globally.

Main Responsibilities:

  • Define and drive SRE strategy, including SLO/SLI frameworks, error budgets, and reliability targets aligned with business objectives and customer expectations

  • Establish and improve incident management processes, including on-call rotations, escalation procedures, and blameless post-mortem practices to minimize MTTR and prevent recurring issues

  • Collaborate with development teams to embed reliability practices into the software development lifecycle, advocating for design reviews, chaos engineering, and production readiness reviews

  • Design and implement comprehensive observability solutions (monitoring, logging, tracing, alerting) to provide actionable insights into system health and performance

  • Drive automation initiatives to reduce toil, improve deployment reliability, and enable self-service capabilities for engineering teams

  • Partner with Architecture and Platform teams to ensure infrastructure decisions support scalability, fault tolerance, and cost optimization goals

  • Manage capacity planning and performance optimization for critical marketing platforms handling high-volume campaign executions and real-time personalization

  • Report on reliability metrics, incident trends, and operational health to leadership, translating technical insights into business impact assessments

Required Qualifications:

  • 8+ years of experience in software engineering, DevOps, or site reliability engineering, with at least 3 years in a people management role

  • Proven track record of building and leading high-performing SRE or platform engineering teams in a distributed, multi-timezone environment

  • Deep expertise in cloud platforms (GCP preferred, AWS/Azure acceptable) including compute, networking, storage, and managed services

  • Strong knowledge of containerization and orchestration technologies (Kubernetes, Docker) and Infrastructure as Code (Terraform, Ansible)

  • Hands-on experience with observability tools and practices (Prometheus, Grafana, Datadog, ELK Stack, or similar) and defining meaningful SLOs/SLIs

  • Experience with CI/CD pipelines, deployment strategies (blue-green, canary), and release engineering best practices

  • Strong programming/scripting skills in languages such as Python, Go, or Java for automation and tooling development

  • Excellent communication skills with the ability to collaborate effectively across engineering, product, and business stakeholders

  • Strong incident management experience with demonstrated ability to lead high-pressure situations calmly and effectively

Rakuten is an equal opportunities employer and welcomes applications regardless of sex, marital status, ethnic origin, sexual orientation, religious belief, or age.