Zilliz is a fast-growing startup developing the industry’s leading vector database for enterprise-grade AI. Founded by the engineers behind Milvus, the world’s most popular open-source vector database, the company builds next-generation database technologies to help organizations quickly create AI applications. On a mission to democratize AI, Zilliz is committed to simplifying data management for AI applications and making vector databases accessible to every organization.
We're entering our next phase of 10x growth; more customers, larger datasets, and far higher expectations for reliability. You'll join a small, fast-moving Cloud Platform team that operates large-scale, multi-cloud, distributed database systems in production. This is a high-ownership role for engineers who want to move fast, build automation instead of toil, and take real responsibility for production stability.
What you will do:
Own the reliability, availability, and production stability of Zilliz Cloud as we scale through the next stage of growth
Debug complex production issues across Kubernetes, cloud infrastructure, networking, storage, and distributed database systems
Build automation and diagnostic tooling; log analysis, alert correlation, incident investigation, runbook automation, and remediation workflows so problems get solved once, not repeatedly
Turn recurring incidents into reusable tools, automation, documentation, and product improvements
Improve observability across latency, availability, throughput, and resource efficiency
Partner with database and infrastructure engineers to make Zilliz Cloud more reliable, scalable, and automated
What we are looking for:
3+ years building or operating production cloud systems, infrastructure platforms, database systems, or large-scale online services
Bachelor's degree in Computer Science, Software Engineering, or a related field, or equivalent practical experience
Strong hands-on experience with Kubernetes, Docker, and at least one major cloud platform (AWS, GCP, or Azure)
Solid understanding of distributed systems; availability, scalability, performance, failure recovery, and operational tradeoffs
Experience with distributed databases, storage systems, search systems, or large-scale online systems is a strong plus
Experience operating highly multi-tenant systems or large infrastructure fleets; thousands of nodes, clusters, tenants, or customer deployments is especially valuable
Familiarity with modern cloud operations tooling such as Terraform, Helm, Argo CD, Prometheus, Grafana, and CI/CD systems
Strong bias for action, and the drive to thrive in a fast-paced, rapidly scaling environment
How we operate:
High ownership: You own production reliability end-to-end. The whole system, not a slice of it. High autonomy, high trust, minimal process.
Fast and focused: We ship often and keep a high bar. This team suits engineers who want velocity and a steep growth curve over red tape.
Globally distributed: We work closely with our core engineering teams across APAC. Occasional early morning or evening syncs in exchange for an on-call setup designed around timezone coverage, not overnight pages.
Zilliz is an Equal Opportunity Employer and welcomes people from all backgrounds, experiences, abilities, and perspectives. All qualified applicants will receive consideration for employment regardless of race, color, national origin, religion, sexual orientation, gender, gender identity, age, physical disability, or length of time spent unemployed.