INSPYR Solutions · 18 hours ago
Site Reliability Engineering Architect (SRE Architect)
INSPYR Solutions is a national expert in delivering flexible technology and talent solutions, and they are seeking a Site Reliability Engineering Architect. This role is responsible for designing automation-first, AI-augmented reliability platforms for large-scale cloud environments, ensuring systems can operate with minimal human intervention while improving resilience and delivery velocity.
Information TechnologyProfessional ServicesStaffing Agency
Responsibilities
Design reliability architectures that prioritize automation and intelligent decision-making over manual processes. Define patterns for fault isolation, graceful degradation, and recovery that assume automated and AI-assisted execution. Ensure reliability, security, and governance requirements are embedded directly into operational systems and workflows. Establish architectural standards that reduce complexity, human dependency, and operational risk
Architect event-driven automation platforms that span detection, decisioning, and execution. Design and implement workflow orchestration systems capable of handling both low-risk autonomous actions and higher-risk human-approved operations. Replace ticket-driven and static runbook processes with executable, testable automation. Standardize automation patterns across incident response, change execution, and platform operations. Ensure automation systems are resilient, observable, and auditable
Design and own internal AI-driven operational platforms that act as a centralized interface for reliability and automation workflows. Build systems that allow intelligent components to retrieve operational context, reason over signals, and invoke controlled actions across infrastructure and services. Establish architectures for agent coordination, capability discovery, and safe execution in production environments. Define guardrails, approval paths, observability, and auditability for AI-initiated actions. Integrate AI-driven decisioning directly into operational workflows rather than treating it as an external enhancement
Architect observability systems that feed automation and intelligent decision-making rather than static dashboards. Design signal pipelines that correlate metrics, logs, traces, and events into actionable context. Reduce alert fatigue through context-aware, noise-resistant detection and prioritization. Ensure every operational signal has a defined automated or AI-assisted response path. Drive continuous improvement through trend analysis and systemic remediation
Define governance-backed use of enterprise low-code automation platforms to accelerate operational workflows. Enable secure, scalable automation for approvals, communications, enrichment, and orchestration while preventing platform sprawl. Establish clear boundaries between low-code automation and code-first systems. Integrate enterprise automation tools with cloud-native automation and AI-driven operational platforms
Serve as the architectural authority for reliability, automation, and AI-driven operations. Mentor senior engineers and raise organizational maturity in automation and intelligent systems. Partner with engineering, security, and compliance teams to deliver safe, scalable operational platforms. Own reference architectures, operational standards, and long-term technical direction. Challenge designs that increase operational risk, toil, or manual dependency
Qualification
Required
5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering supporting complex distributed systems
Proven experience designing and operating automation-heavy or autonomous operational platforms
Strong programming and automation skills using modern languages and frameworks
Hands-on experience with workflow orchestration and event-driven systems
Practical experience integrating AI or intelligent decision systems into production operations
Deep understanding of failure modes, blast radius management, and risk-aware automation
Preferred
Experience designing or implementing agent-based or AI-assisted operational systems
Familiarity with modern AI platforms and model integration for operational use cases
Experience with control-plane architectures for automation and intelligent systems
Enterprise automation and governance experience
Knowledge of cost-aware reliability design, FinOps principles, and zero-trust security models
Relevant cloud or platform certifications
Company
INSPYR Solutions
INSPYR Solutions is a information technology staffing service providers.
Funding
Current Stage
Late StageLeadership Team
Recent News
2025-09-12
Company data provided by crunchbase