Infrastructure And Observability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Drawbridge Digital · 18 hours ago

Infrastructure And Observability Engineer

Drawbridge Digital is a veteran-owned company seeking an experienced Infrastructure And Observability Engineer to design and implement centralized monitoring and alerting systems. The role involves supporting a 24-hour production environment and contributing to long-term infrastructure planning and operational improvements.

AdvertisingAsset ManagementContentIT Management

Responsibilities

Architect and deploy centralized monitoring and log aggregation solutions across cloud and on-premises infrastructure
Design and implement alerting systems for critical infrastructure events, ensuring the right people are notified at the right time
Support day-to-day operations of infrastructure serving a 24/7 production environment, including troubleshooting, maintenance, and capacity management
Establish observability standards, dashboards, and runbooks to support operations and incident response
Analyze monitoring data to identify performance bottlenecks, inefficiencies, and opportunities for optimization
Contribute to long-term infrastructure planning, including capacity forecasting, technology roadmaps, and operational improvements
Create and maintain technical documentation, including system architecture diagrams, standard operating procedures, and emergency response playbooks
Partner with infrastructure, operations, and engineering teams to implement improvements based on observability insights
Drive continuous improvement initiatives that enhance system reliability, reduce costs, and improve performance
Evaluate and integrate tooling that fits our hybrid environment needs
Participate in an on-call rotation, including occasional overnight shifts, to respond to critical infrastructure incidents

Qualification

Monitoring platformsInfrastructure designLinux administrationNetworking fundamentalsTechnical writingCollaboration skillsCommunication skillsIndependent work

Required

3+ years of experience in infrastructure, site reliability, or systems engineering roles
Hands-on experience with monitoring and observability platforms (e.g., Prometheus, Grafana, Datadog, ELK stack, Splunk, or similar)
Strong understanding of networking fundamentals and server infrastructure in both cloud and physical datacenter environments
Experience with Linux server administration, Ceph storage clusters and highly available database clusters
Experience building alerting frameworks that balance signal quality with noise reduction
Demonstrated ability to translate monitoring insights into actionable infrastructure improvements
Strong technical writing skills—you'll be documenting systems, procedures, and emergency protocols
Strong collaboration and communication skills—you'll be working across teams to drive change
Comfortable working independently in a remote environment while collaborating effectively with distributed teams
Must reside in the greater NJ/NYC metropolitan area
Ability to commute to New Jersey 1–2 days per month for team meetings
Able to occasionally travel to customer locations to support on-site projects

Benefits

Health insurance
401(k)
On-call compensation

Company

Drawbridge Digital

twittertwittertwitter
company-logo
Drawbridge Digital builds and manages full spectrum content services from complex workflows to digital asset management and archive systems.

Funding

Current Stage
Early Stage

Leadership Team

leader-logo
Jennifer Pottheiser
Founding Partner
linkedin
Company data provided by crunchbase