46 applicants

Company

Computerworld · 8 hours ago

Site Reliability Engineer (EMO Engineer)

United States

Full-time

Remote

Senior Level

7+ years exp

Maximize your interview chances

Information TechnologyNews

Senior Management

No H1B

U.S. Citizen Only

Security Clearance Required

Insider Connection @Computerworld

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Design, implement, and maintain high-performance and scalable observability solutions in a cloud environment.

Collaborate with cross-functional teams to gather requirements, architect solutions, and deploy logging and monitoring environments that align with business needs.

Configuration and maintenance of Datadog integrations including Webhooks, Amazon, Cisco, CrowdStrike, Cribl Stream, Container, VMWare, SNMP, journald, Okta, python, Zscaler, Microsoft 365, Webhooks, Palo Alto.

Configuration of telemetry logs through Cribl Stream including syslog, SNMP traps, JSON, AWS CloudWatch, AWS S3.

Development of custom data/telemetry pipelines including Grok parsing, GeoIP parsing, field remapping, and error tracking.

Ingest telemetry logs directly from cloud SaaS providers such as Zscaler, Okta, CrowdStrike, ServiceNow, Microsoft 365.

Installation and configuration of the Datadog Agent and Datadog Synthetics Agent on Windows servers, Linux servers, and Docker/Kubernetes containers.

Configuration of the Datadog Agent to collect host logs, processes, custom metrics (including SNMP), and network performance monitoring (NPM).

Configuration of Synthetic testing to monitor infrastructure uptime SLAs and SLOs using private locations.

Configuration of service-related monitors based on metrics, logs, live processes, service checks, anomalies/outliers. Includes monitoring of serverless such as AWS Lambda functions.

Development of custom dashboards with a focus on reliability and performance of services.

Configuration and management of Service Catalog, including the definition of services and associated dashboards, monitors, SLOs, synthetic tests, metrics, and logs.

Configuration of incident management and service-based analytics including integration with JIRA and/or ServiceNow.

Maintain code repositories and versioning of any scripting or automation.

Provides technical leadership, oversight, governance, and direction for integrating with, and reporting on, observability pipelines.

Provide consultative services to support the application integrations required to be observed/monitored, such as Hadoop HDFS, Hadoop Map Reduce, Hive.

Identify opportunities for monitoring improvement, including incorporating APM and RUM monitoring.

Update documentation and user guides as needed.

Collaborate with cross-functional teams.

Configure monitors & alerts to integrate with Incident Management tools.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

DatadogAWSCriblInfrastructure & Monitoring as CodeIncident Response toolsData onboardingAgileScriptingMonitoringTechnical Communication

Required

Undergraduate degree in an engineering or computer science discipline and/or equivalent experience/certification.

7+ years of experience in information technology with hands-on technical/engineering roles including:

Must have 2+ years of experience working with Datadog, including hands-on experience administering AND supporting a Datadog migration or implementation.

Must have Hands-on experience supporting a Datadog migration or implementation.

3+ years of experience with AWS.

3+ years data onboarding within a large-scale enterprise environment.

Must have experience in DataDog including building dashboards, reports, and alerts to meet customer requirements.

Experience with Infrastructure & Monitoring as Code tools.

Experience configuring and supporting additional Datadog modules.

Solid understanding of networking and device configuration.

Experience with migrating from other monitoring platforms to Datadog.

Experience with Incident Response tools.

Knowledge of Agile and continuous integration practices.

Collaborative mindset that thrives in fast paced environments.

Excellent verbal and written communication skills including the ability to author and present materials ranging from detailed technical specifications to high-level concepts for senior audiences.

Public Trust security clearance.

Must be US Citizen.