Senior Site Reliability Engineer (SRE)

Responsibilities

We are looking for an experienced Site Reliability Engineer who is passionate about building reliable, scalable, and automated infrastructure to support mission-critical platform services.

What You'll Do

Ensure the reliability, availability, and operational excellence of critical platform services and infrastructure.
Design, deploy, maintain, and optimize cloud-native infrastructure based on Kubernetes and Docker.
Build and improve observability systems including monitoring, alerting, logging, and distributed tracing.
Participate in architecture reviews and provide reliability-focused recommendations for high-concurrency, low-latency distributed systems.
Develop and maintain CI/CD pipelines to improve engineering productivity and deployment quality.
Lead capacity planning, performance tuning, disaster recovery planning, and resilience engineering initiatives.
Drive Infrastructure as Code (IaC) adoption and automation to reduce operational overhead and human error.
Define and continuously improve SLI/SLO/SLA frameworks across critical services.
Participate in incident response, root cause analysis (RCA), and postmortem reviews for production issues.
Collaborate closely with engineering, QA, product, and security teams to continuously improve platform reliability, scalability, and efficiency.
Leverage AI-powered tools (e.g., Cursor, Claude Code, ChatGPT) to enhance operational automation, troubleshooting, and productivity.

Requirements

Must-Have Skills

Bachelor's degree or above in Computer Science or a related field.
5+ years of experience in SRE, DevOps, Infrastructure Engineering, or related roles.
Strong knowledge of Linux systems and performance optimization.
Proficiency in at least one programming language such as Go, Python, Java, or Rust.
Hands-on experience with Kubernetes, Docker, and cloud-native ecosystems.
Experience with CI/CD tools such as GitHub Actions, GitLab CI, or Jenkins.
Solid understanding of networking fundamentals including TCP/IP, HTTP, and WebSocket.
Strong troubleshooting, performance analysis, and capacity planning skills.
Experience building automation tools and operational platforms.
Demonstrated proficiency in AI-assisted development and operations tools such as Cursor and Claude Code.

Technical Stack

Container Platforms

Kubernetes
Docker

Observability

Prometheus
Grafana
Loki
ELK
OpenTelemetry

Messaging Systems

Kafka
RocketMQ
Redis

Databases

MySQL
PostgreSQL
ClickHouse
Time-Series Databases

Infrastructure Automation

Terraform
Ansible
Helm

Cloud Platforms

AWS
GCP
Alibaba Cloud
Tencent Cloud

CI/CD

GitHub Actions
GitLab CI
Jenkins

Preferred Experience

Experience in large-scale internet, SaaS, fintech, e-commerce, or mission-critical platform environments.
Experience supporting high-concurrency distributed systems.
Strong understanding of distributed system architecture, scalability, and reliability engineering principles.
Experience operating multi-region or multi-datacenter infrastructure.

Nice to Have

Experience managing large-scale Kubernetes clusters (1,000+ nodes).
Hands-on experience with Service Mesh technologies (e.g., Istio) and OpenTelemetry.
Expertise in Kafka, ClickHouse, and large-scale distributed system optimization.
Experience implementing Chaos Engineering practices.
Strong background in incident management and large-scale production recovery.
Experience with AIOps, intelligent alerting, and automated fault diagnosis systems.