Senior Site Reliability Engineer (SRE)
Responsibilities
We are looking for an experienced Site Reliability Engineer who is passionate about building reliable, scalable, and automated infrastructure to support mission-critical platform services.
What You'll Do
- Ensure the reliability, availability, and operational excellence of critical platform services and infrastructure.
- Design, deploy, maintain, and optimize cloud-native infrastructure based on Kubernetes and Docker.
- Build and improve observability systems including monitoring, alerting, logging, and distributed tracing.
- Participate in architecture reviews and provide reliability-focused recommendations for high-concurrency, low-latency distributed systems.
- Develop and maintain CI/CD pipelines to improve engineering productivity and deployment quality.
- Lead capacity planning, performance tuning, disaster recovery planning, and resilience engineering initiatives.
- Drive Infrastructure as Code (IaC) adoption and automation to reduce operational overhead and human error.
- Define and continuously improve SLI/SLO/SLA frameworks across critical services.
- Participate in incident response, root cause analysis (RCA), and postmortem reviews for production issues.
- Collaborate closely with engineering, QA, product, and security teams to continuously improve platform reliability, scalability, and efficiency.
- Leverage AI-powered tools (e.g., Cursor, Claude Code, ChatGPT) to enhance operational automation, troubleshooting, and productivity.
Requirements
Must-Have Skills
- Bachelor's degree or above in Computer Science or a related field.
- 5+ years of experience in SRE, DevOps, Infrastructure Engineering, or related roles.
- Strong knowledge of Linux systems and performance optimization.
- Proficiency in at least one programming language such as Go, Python, Java, or Rust.
- Hands-on experience with Kubernetes, Docker, and cloud-native ecosystems.
- Experience with CI/CD tools such as GitHub Actions, GitLab CI, or Jenkins.
- Solid understanding of networking fundamentals including TCP/IP, HTTP, and WebSocket.
- Strong troubleshooting, performance analysis, and capacity planning skills.
- Experience building automation tools and operational platforms.
- Demonstrated proficiency in AI-assisted development and operations tools such as Cursor and Claude Code.
Technical Stack
Container Platforms
- Kubernetes
- Docker
Observability
- Prometheus
- Grafana
- Loki
- ELK
- OpenTelemetry
Messaging Systems
- Kafka
- RocketMQ
- Redis
Databases
- MySQL
- PostgreSQL
- ClickHouse
- Time-Series Databases
Infrastructure Automation
- Terraform
- Ansible
- Helm
Cloud Platforms
- AWS
- GCP
- Alibaba Cloud
- Tencent Cloud
CI/CD
- GitHub Actions
- GitLab CI
- Jenkins
Preferred Experience
- Experience in large-scale internet, SaaS, fintech, e-commerce, or mission-critical platform environments.
- Experience supporting high-concurrency distributed systems.
- Strong understanding of distributed system architecture, scalability, and reliability engineering principles.
- Experience operating multi-region or multi-datacenter infrastructure.
Nice to Have
- Experience managing large-scale Kubernetes clusters (1,000+ nodes).
- Hands-on experience with Service Mesh technologies (e.g., Istio) and OpenTelemetry.
- Expertise in Kafka, ClickHouse, and large-scale distributed system optimization.
- Experience implementing Chaos Engineering practices.
- Strong background in incident management and large-scale production recovery.
- Experience with AIOps, intelligent alerting, and automated fault diagnosis systems.