Cloud Platform Engineer (Agentic AI)

Project description

The project is for one of the world's famous science and technology companies in pharmaceutical industry, supporting initiatives in AWS, AI and data engineering, with plans to launch over 20 additional initiatives in the future. We are seeking a highly skilled Cloud Engineer to lead the infrastructure design, deployment, and operations of the AI agent orchestration platform on AWS. This role is responsible for building and managing a Kubernetes-native, enterprise-grade platform that supports scalable AI agent workloads across development, QA, and production environments.

Responsibilities

  • 1. AWS Infrastructure & Architecture
  • Design, provision, and manage AWS infrastructure using Terraform, aligned with the AWS Well-Architected Framework
  • Core services include: - Amazon EKS - VPC - IAM - Application Load Balancer (ALB) - Route 53 - AWS Certificate Manager (ACM) 2. Kubernetes (EKS) Platform Operations
  • Own and operate EKS clusters end-to-end: - Managed node group lifecycle management - Karpenter-based autoscaling - Cluster add-on lifecycle upgrades - IRSA (IAM Roles for Service Accounts) configuration - Multi-AZ high availability and resilience 3. CI/CD & GitOps
  • Build and maintain automated deployment pipelines using: - GitHub Actions - ArgoCD (GitOps)
  • Enable multi-environment deployments: - Dev → QA → Production
  • Implement release strategies: - Blue/Green deployments - Canary releases 4. Security & Compliance
  • Integrate AWS-native security and governance controls: - AWS WAF - GuardDuty - Security Hub - KMS (encryption) - Secrets Manager - External Secrets Operator
  • Enforce policy controls using: - OPA / Kyverno (admission controllers) 5. Observability & Monitoring
  • Implement and manage observability stack: - Amazon Managed Prometheus - Amazon Managed Grafana - CloudWatch Container Insights - AWS X-Ray (distributed tracing) 6. AI/ML Integration
  • Leverage AWS AI/ML services to support agent orchestration: - Amazon Bedrock (model inference, agent APIs) - SageMaker (model hosting, endpoints) - Comprehend (NLP, PII detection) 7. Cost Optimization (FinOps)
  • Implement cost-efficient architecture practices: - Spot Instances - Savings Plans - Karpenter bin-packing strategies - Scheduled scale-to-zero for non-production environments 8. Platform & Engineering Collaboration
  • Partner with platform and ML teams to: - Onboard new AI agent workloads - Integrate MCP servers and execution frameworks - Support extensibility of the agent ecosystem

SKILLS

Must have

  • Experience & Certifications
  • 4+ years of hands-on AWS experience
  • AWS Certifications: - Required: AWS Solutions Architect (Associate or Professional) - Preferred: DevOps Engineer, Security Specialty Kubernetes & EKS Expertise
  • Strong hands-on experience with: - EKS cluster provisioning and operations - Managed node groups and Karpenter - Helm chart management - Kubernetes RBAC and network policies Infrastructure as Code (Terraform)
  • Advanced Terraform capabilities: - Modular design - Remote state management (S3 + DynamoDB) - Multi-environment configuration - Security scanning (Checkov, tfsec) AWS Services Proficiency - Deep knowledge of: - EKS, ECR, ALB, Route 53, ACM - IAM, KMS, Secrets Manager - IAM Identity Center - CloudTrail, AWS Config - GuardDuty, Security Hub, AWS WAF AI/ML Exposure
  • Practical experience with: - Amazon Bedrock (model invocation, agent APIs) - SageMaker (model deployment and endpoints) - Comprehend (NLP and PII detection) DevOps & Identity
  • Experience with: - GitOps tools (ArgoCD or Flux) - CI/CD pipelines for container workloads - OIDC federation: - GitHub Actions → AWS - EKS OIDC provider integration Observability & Debugging
  • Familiarity with: - Prometheus, Grafana - OpenTelemetry - AWS X-Ray - CloudWatch Logs Insights Kubernetes Security
  • Strong understanding of: - Pod Security Standards - Network Policies - Admission webhooks - Service account least-privilege principles

Nice to have

• Experience with AI agent frameworks: - LangChain, Claude Agent SDK, or similar • Knowledge of emerging protocols: - A2A (Agent-to-Agent) - MCP (Model Context Protocol) • Familiarity with: - Amazon Bedrock Agents, Knowledge Bases, Guardrails • Chaos engineering exposure: - AWS Fault Injection Service (FIS) • Multi-tenant platform design: - Namespace isolation - Self-service provisioning • Programming/debugging skills: - Python, Go, or Node.js • FinOps experience: - AWS Cost Explorer - Compute Optimizer - Tagging governance - Savings Plan management