Senior Software Engineer, Cloud Reliability

Zilliz is a fast-growing startup developing the industry’s leading vector database for enterprise-grade AI. Founded by the engineers behind Milvus, the world’s most popular open-source vector database, the company builds next-generation database technologies to help organizations quickly create AI applications. On a mission to democratize AI, Zilliz is committed to simplifying data management for AI applications and making vector databases accessible to every organization.

We're entering our next phase of 10x growth; more customers, larger datasets, and far higher expectations for reliability. You'll join a small, fast-moving Cloud Platform team that operates large-scale, multi-cloud, distributed database systems in production. This is a high-ownership role for engineers who want to move fast, build automation instead of toil, and take real responsibility for production stability.

What you will do:

Own the reliability, availability, and production stability of Zilliz Cloud as we scale through the next stage of growth

Debug complex production issues across Kubernetes, cloud infrastructure, networking, storage, and distributed database systems

Build automation and diagnostic tooling; log analysis, alert correlation, incident investigation, runbook automation, and remediation workflows so problems get solved once, not repeatedly

Turn recurring incidents into reusable tools, automation, documentation, and product improvements

Improve observability across latency, availability, throughput, and resource efficiency

Partner with database and infrastructure engineers to make Zilliz Cloud more reliable, scalable, and automated

What we are looking for:

3+ years building or operating production cloud systems, infrastructure platforms, database systems, or large-scale online services

Bachelor's degree in Computer Science, Software Engineering, or a related field, or equivalent practical experience

Strong hands-on experience with Kubernetes, Docker, and at least one major cloud platform (AWS, GCP, or Azure)

Solid understanding of distributed systems; availability, scalability, performance, failure recovery, and operational tradeoffs

Experience with distributed databases, storage systems, search systems, or large-scale online systems is a strong plus

Experience operating highly multi-tenant systems or large infrastructure fleets; thousands of nodes, clusters, tenants, or customer deployments is especially valuable

Familiarity with modern cloud operations tooling such as Terraform, Helm, Argo CD, Prometheus, Grafana, and CI/CD systems

Strong bias for action, and the drive to thrive in a fast-paced, rapidly scaling environment

How we operate:

High ownership: You own production reliability end-to-end. The whole system, not a slice of it. High autonomy, high trust, minimal process.

Fast and focused: We ship often and keep a high bar. This team suits engineers who want velocity and a steep growth curve over red tape.

Globally distributed: We work closely with our core engineering teams across APAC. Occasional early morning or evening syncs in exchange for an on-call setup designed around timezone coverage, not overnight pages.

Zilliz is an Equal Opportunity Employer and welcomes people from all backgrounds, experiences, abilities, and perspectives. All qualified applicants will receive consideration for employment regardless of race, color, national origin, religion, sexual orientation, gender, gender identity, age, physical disability, or length of time spent unemployed.

Senior Software Engineer, Cloud Reliability

What you will do:

What we are looking for:

How we operate:

Similar jobs

Staff Software Engineer, Database Systems

Site Reliability Engineer-SkillBridge Intern

Site Reliability Engineer Federal- SkillBridge Intern

Site Reliability Engineer - India

Solutions Architect - Japan

Sr. Staff Site Reliability Engineer-Federal, Security Clearance