Site Reliability Engineer - Insurance Platform (Remote, China)

BJAK’s automation systems power end-to-end insurance journeys across quote generation, policy issuance, renewals, endorsements, claims, payments and insurer integrations. These systems are business-critical, where uptime, reliability and performance directly impact customers and operations.

We're looking for a Site Reliability Engineer based in China to ensure the stability, scalability and resilience of BJAK’s insurance automation platform, bridging software engineering and infrastructure operations to keep systems running reliably at scale.

This is a fully remote position where you'll collaborate closely with our Malaysia-based engineering, product and operations teams to operate and improve production systems.

The Mission

Ensure BJAK’s insurance automation platform is reliable, scalable and observable by building strong operational systems, improving incident response and driving engineering practices that prevent failures before they happen.

What You’ll Own

  • Own reliability and operational stability of BJAK’s production systems.

  • Design and improve monitoring, alerting, logging and observability across services.

  • Lead incident response, troubleshooting and structured root cause analysis.

  • Improve system resilience through redundancy, failover and recovery strategies.

  • Work with engineers to design systems that are reliable, scalable and operable in production.

  • Improve deployment safety through CI/CD pipelines, release strategies and automation.

  • Reduce recurring incidents by identifying root causes and driving long-term fixes.

  • Manage and optimize cloud infrastructure supporting business-critical workflows.

  • Strengthen operational practices including on-call processes, incident playbooks and SLAs.

  • Continuously improve system uptime, performance and operational maturity.

What We're Looking For

  • Experience in Site Reliability Engineering, DevOps, platform engineering or infrastructure roles.

  • Strong understanding of distributed systems, cloud infrastructure and production operations.

  • Experience with monitoring, alerting and observability tools.

  • Strong troubleshooting skills for production incidents and system failures.

  • Ability to design for reliability, scalability and fault tolerance.

  • Experience working with CI/CD pipelines and deployment automation.

  • Strong understanding of system performance, capacity planning and risk management.

  • Hands-on ownership mindset during incidents and operational issues.

  • Calm, structured and disciplined approach to production environments.

  • Strong collaboration with engineering teams in fast-paced environments.

Bonus Points

  • Experience with AWS, GCP, Azure or similar cloud platforms.

  • Experience with Kubernetes, Docker or container orchestration systems.

  • Experience with infrastructure-as-code tools (Terraform, Ansible, etc.).

  • Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).

  • Experience with incident management tools and on-call systems.

  • Experience with zero-downtime deployments and progressive delivery strategies.

  • Experience working in fintech, insurance or regulated industries.

  • Experience building reliability frameworks or SRE best practices in scaling systems.

  • Contributions to platform reliability or infrastructure resilience initiatives.

The Kind of Builder We Want

  • Calm and structured under pressure, especially during production incidents.

  • Hands-on engineer who understands both code and infrastructure deeply.

  • Thinks in failure modes, system risks and recovery strategies.

  • Strong focus on reliability, observability and long-term system health.

  • Proactive in preventing incidents, not just responding to them.

  • Careful and deliberate when making production changes.

  • Builds systems engineers can trust in high-pressure environments.

This Role Is Not For

  • Engineers who only react to incidents instead of preventing them.

  • People who are careless with production systems or access control.

  • Individuals who ignore monitoring, alerting or operational discipline.

  • Engineers who make risky changes without proper analysis or safeguards.

  • Candidates who cannot stay calm during incidents or outages.

Success in This Role

You'll be successful if you can:

  • Improve platform uptime, reliability and operational stability.

  • Reduce production incidents and recurring system failures.

  • Strengthen observability, monitoring and incident response maturity.

  • Enable engineers to deploy safely with minimal operational risk.

  • Improve overall resilience of BJAK’s insurance automation platform.

Why Join BJAK

  • Build Reliable Insurance Systems – Support mission-critical automation at scale.

  • High-Impact Engineering – Solve real-world reliability and distributed systems challenges.

  • Global Engineering Team – Work with experienced engineers across multiple countries.

  • Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.

  • International Exposure – Build systems used across Southeast Asia markets.

  • Learning & Development Budget – Support continuous technical growth and certifications.

  • High Ownership Environment – Strong autonomy over reliability and operational design.

  • Modern Engineering Culture – Focus on stability, observability and engineering excellence.

  • Competitive Compensation – Attractive salary package based on experience and impact.

Interview Process

We assess reliability engineering depth, incident handling capability and production systems thinking. The process usually includes application review, two interviews and a technical scenario or systems discussion.