Sr. SRE RunOps Engineer

Sr. SRE RunOps Engineer
Location: Irving Texas
Length: 6-month contract
Pay: $60/hr - $67/hr (DOE)

The Planet Group is seeking a highly skilled Site Reliability Engineer (SRE) to join our Production Support team. This role is responsible for ensuring the reliability, performance, and stability of our production systems across AWS, MongoDB, and related application services. The ideal candidate has strong operational instincts, deep troubleshooting skills, and a passion for building resilient systems.

Qualifications:

Must have requirements:
- Experience with on-call rotations and 24/7 production environments.
- EDUCATION: Bachelors/4 Yr Degree
- CERTIFICATIONS/LICENSES: Certification in AWS is plus.
- Work cross-functionally with the various teams in the organization and help establish SLOs and achieve those SLOs.
- 5+ years of experience in SRE, DevOps, Production Support, or similar operational roles.
- Strong hands-on experience with AWS services and cloud-native architectures.
- Proficiency with MongoDB administration and troubleshooting.
- Experience with New Relic or similar APM/observability platforms.
- Experience using additional tools like Postman, Intune, and Firebase, Service Now, Cloudwatch.
- Strong understanding of Linux systems, networking, and distributed systems.
- Solid scripting skills (Python, Bash, or similar).
- 5+ years Monitoring and Alarming in all environments and familiar with tools like Mongo Charts, New Relic, Cloudwatch, Service Now.
- Proven experience managing high-severity incidents and driving RCA processes.
- Familiarity with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, etc.).
Additional skills and other requirements:
- Experience with container orchestration (ECS, EKS, Kubernetes).
- Knowledge of message queues (Kafka, SQS, RabbitMQ).
- Exposure to microservices architectures.
- Certifications such as AWS Solutions Architect, AWS SysOps, or MongoDB DBA
- Working experience with IoT devices, and Microsoft Intune.

Responsibilities:

Production Support & Incident Management
- Serve as a primary responder for production incidents, ensuring rapid triage, mitigation, and resolution.
- Lead root cause analysis (RCA) and drive long term corrective actions.
- Maintain and improve incident response processes, runbooks, and escalation paths.
- Collaborate with engineering, QA, and product teams to prevent recurrence of issues.

AWS Infrastructure Operations
- Support and optimize AWS services such as EC2, ECS/EKS, Lambda, S3, CloudWatch, IAM, RDS, and VPC networking.
- Monitor system health, performance, and capacity across cloud environments.
- Implement infrastructure best practices around reliability, scalability, and cost efficiency.
- Assist with deployments, environment configuration, and CI/CD pipelines.

Database & Storage Support
- Manage and troubleshoot MongoDB clusters, including performance tuning, replication, backups, and failover.
- Diagnose query performance issues and collaborate with developers on schema optimization.
- Ensure data integrity, availability, and recovery readiness.

Monitoring, Observability & Alerting
- Use New Relic, CloudWatch, and other observability tools to monitor application and infrastructure performance.
- Build dashboards, alerts, and telemetry that provide actionable insights.
- Continuously refine monitoring thresholds to reduce noise and improve signal quality

Sr. SRE RunOps Engineer

Similar jobs

Sr. Technical Scrum Master

Lab Technician

Field Technical Analyst

Assoc Engineering Technician

Engineering Technologist II

Asst Engineering Technician