Clarity · 1 month ago
Principal Platform Engineer (Site Reliability Engineer)
Clarity Innovations is a trusted national security partner, dedicated to safeguarding our nation’s interests and delivering innovative solutions. The Principal Platform Engineer (Site Reliability Engineer) will oversee daily operations of the classified cloud development environment, ensuring system security and availability while managing incident response and performance reporting.
AppsDatingFitnessInformation TechnologyMobile AppsSoftwareWellness
Responsibilities
Carry out day-to-day operations of the classified NOC, ensuring adherence to service level agreements and system uptime requirements
Perform monitoring and support of cloud-based systems, networks, and containerized applications in Kubernetes clusters
Coordinate incident response, troubleshooting, and escalation procedures
Ensure timely detection, resolution, and documentation of service-impacting events
When NOC lead is absent, act as the primary point of contact for cloud system alerts, outages, and classified network incidents; communicate status to stakeholders and leadership
Ensure 24/7 observability of network, platform, and container-level components using tools such as Prometheus, Grafana, Fluentd, and Elastic Stack
Draft technical guidance for NOC staff and collaborate with engineering, cybersecurity, and cloud teams
Maintain situational awareness of the system through dashboards, logs, and proactive monitoring tools
Develop and maintain standard operating procedures, incident response plans, runbooks, and shift logs
Assist NOC lead conducting daily stand-ups, shift handovers, and weekly ops reviews
Generate operational metrics and performance reports
Ensure compliance with federal security policies and contribute to continuous accreditation of the cloud system under RMF
Perform readiness drills, after-action reviews, and contribute to lessons-learned activities
Qualification
Required
Must be able to obtain and maintain a TS/SCI security clearance (note, only US Citizens are eligible for security clearances)
Expertise in cloud infrastructure (AWS GovCloud, Azure Government, or C2S/C2E/JWCC), virtualization, and hybrid environments
Understanding of secure networking, load balancers, DNS in cloud-native architectures, and inter-cluster communication
Operational experience with Kubernetes, containerized workloads, and supporting technologies (Docker, Helm, Fluentd, Kustomize)
Strong understanding of monitoring tools (e.g., Prometheus, Grafana, ELK Stack) and ticketing systems (e.g., osTicket, Jira)
Familiarity with GitOps workflows and infrastructure as code using Terraform or Flux
Familiarity with DoD/IC cybersecurity compliance standards, ATO processes, and classified system governance
Excellent communication skills and the ability to clearly brief complex operational topics to leadership and mission partners
Preferred
Active US TS/SCI security clearance with CI polygraph or higher
5+ years of experience in IT operations or network/system administration