Apex Systems · 1 week ago
AWS SRE
Apex Systems is seeking an experienced AWS Cloud Site Reliability Engineer (SRE) to join a high-performing team supporting critical cloud infrastructure within a secure federal environment. The ideal candidate will have deep AWS expertise, strong Infrastructure-as-Code skills, and a passion for automation, observability, and continuous improvement.
Human ResourcesInformation TechnologyRecruiting
Responsibilities
Design, implement, and manage Infrastructure-as-Code (IaC) using tools such as AWS CloudFormation, Terraform, or Helm
Automate deployment, scaling, and configuration of cloud resources
Develop and maintain CI/CD pipelines (AWS CI/CD, GitLab CI/CD, Jenkins)
Implement robust monitoring and alerting solutions using CloudWatch, Datadog, Prometheus, Grafana, Dynatrace, or similar tools
Analyze logs, metrics, and system performance to proactively resolve issues and optimize reliability
Support incident response, participate in on-call rotations, and conduct post-incident reviews
Ensure AWS environments meet security standards and compliance requirements
Coordinate release planning and communication between development, QA, and operations
Create and submit change records and participate in Technical Change Advisory/Review Boards as required
Continuously evaluate and improve release processes, tooling, and cloud infrastructure
Collaborate with QA teams to validate releases and support quality assurance practices
Qualification
Required
Bachelor's degree and 5+ years of relevant experience
OR 9 years of experience in lieu of a degree
Proven experience as a Site Reliability Engineer or similar cloud operations role
Expertise with AWS services and cloud architecture
Advanced programming/scripting in at least three: Python, Ansible, Helm, Playwright, Bash, JavaScript, Terraform, Java
Strong understanding of DevOps principles and CI/CD pipelines
Hands-on experience with Terraform, CloudFormation, Helm, or similar IaC tools
Experience creating configuration standards and automating workflows with Ansible
Proficiency with GitLab, AWS CodeCommit, or SVN and modern branching strategies
Experience with containers and orchestration tools: ECS, EKS, Docker, Kubernetes
Familiarity with monitoring/logging: Datadog, CloudWatch, Prometheus, Grafana, Dynatrace
Understanding of Agile methodologies and release management practices
Strong verbal and written communication skills
Excellent problem-solving and troubleshooting abilities
Ability to collaborate across teams and manage competing priorities
Must be a U.S. Citizen
Must be able to obtain and maintain a 6C Public Trust clearance
No dual citizenship
Preferred
Relevant DevOps/SRE certifications
Existing High Risk Public Trust or Secret clearance
3+ years supporting highly available, mission-critical platforms
Experience managing distributed container platforms (capacity, provisioning, workload management)
Experience leading major incident response across multiple vendors using tools such as Datadog and ServiceNow
Benefits
Medical
Dental
Vision
Life
Disability
Other insurance plans
ESPP (employee stock purchase program)
401K program
HSA (Health Savings Account on the HDHP plan)
SupportLinc Employee Assistance Program (EAP) with up to 8 free counseling sessions
Corporate discount savings program
Certification prep
Library of technical and leadership courses/books/seminars
Certification discounts and other perks to associations that include CompTIA and IIBA
Company
Apex Systems
Apex Systems, a division of On Assignment, provides organizations with IT staffing solutions to address gaps in their current workforce.
Funding
Current Stage
Late StageLeadership Team
Recent News
Company data provided by crunchbase