Sr. Manager, Site Reliability & Innovation, IT

Position Description
We are seeking a Senior Site Reliability Engineer who will be responsible for both build and shared services operations, including monitoring, site reliability engineering (SRE), and ensuring the stability, scalability, and performance of critical systems.
The ideal candidate is a strong technical problem-solver and capable of delivering end-to-end monitoring and reliability solutions while diagnosing complex issues during critical incidents.


Key Areas of Responsibilities

  • Own monitoring, Kubernetes platform reliability, and SRE operations to ensure highly reliable, available, and performant systems

  • Build, enhance, and maintain monitoring solutions using ITRS Geneos, Prometheus, Victoria-Metrics, Elasticsearch, and Grafana

  • Develop, optimize, and maintain alerting rules, dashboards, and observability pipelines

  • Troubleshoot Linux servers (RHEL 7/8/9), including upgrades, configurations, patching, and maintenance, while determining appropriate monitoring requirements for system changes

  • Analyze logs, investigate issues, and perform fault finding to identify performance exceptions

  • Collaborate with engineering, application, and infrastructure teams to improve system resilience, stability, security, efficiency, and scalability.

  • Operate, maintain, and optimize Kubernetes environments, including cluster health, workload reliability, capacity planning, and platform observability

  • Continuously research and adopt modern monitoring and SRE tools and practices.

Requirements

  • Bachelor’s degree or higher in Computer Science / Engineering

  • Around 8-10 years of experience within IT, preferably in site reliability engineering, production support, platform engineering, or investment banking environments

  • Strong experience configuring and maintaining monitoring and observability platforms, including:
    ITRS Geneos, Prometheus, Victoriametrics, Elasticsearch, Grafana, and Kibana

  • Experience with automation (e.g., Bash, Python, Ansible, CI/CD tools) is a must

  • Hands-on experience building and implementing Prometheus pipelines, including exporters, scraping configurations, relabelling, metric routing, and integrations with long-term storage (e.g., Victoriametrics)

  • Experience building and maintaining Logstash pipelines, including ingestion, parsing, filtering, enrichment, and routing of logs into Elasticsearch

  • Ability to design, build, and maintain Grafana and Kibana dashboards for metrics, logs, and performance analytics across distributed systems

  • Understanding of metrics, logging, alerting, dashboards, and observability pipelines

  • Strong Linux administration skills (RHEL 7/8/9), including troubleshooting, upgrades, configuration, patching, and performance optimization.

  • Good understanding of SRE principles, high availability, scalability, incident management and Disaster Recovery / Business Continuity Planning) activities

  • Experience managing GPU-enabled infrastructure for AI or machine learning platforms is preferred.

  • Strong hands-on experience with Kubernetes, including cluster operations, workload orchestration, troubleshooting, scaling, and production support

  • Understanding of networking fundamentals, performance tuning, and troubleshooting distributed systems

  • Operations with participation in on-call rotations, including after-hours and weekend support

  • Self-motivated, adaptable and able to prioritize, learn continuously and manage multiple responsibilities effectively

  • Excellent in English, with Chinese will be advantage

Stay informed on CITIC CLSA Job Opportunities

Not the right fit? You can create a job alert to receive our latest job openings that meet your interest.

Similar jobs