Apply on Employer Site

Resolve Tech Solutions · 8 hours ago

Senior Machine Learning Engineer

DFW Metroplex

Full-time

Onsite

Senior Level

5+ years exp

Resolve Tech Solutions is hiring a Senior Machine Learning Engineer who will design and deliver production-grade machine learning capabilities for an observability and operations intelligence platform. The role involves working closely with DevOps, data engineering, and product teams to develop solutions for alert noise reduction, anomaly detection, and incident root cause assistance.

Artificial Intelligence (AI)Cloud ComputingCloud ManagementConsultingData Center AutomationEnterprise Resource Planning (ERP)Information TechnologySoftware

Growth Opportunities

H1B Sponsor Likely

Hiring Manager

Rohit George

Responsibilities

Own the machine learning design for operations and reliability use cases including alert noise reduction, alert grouping and clustering, anomaly detection, incident root cause assistance, and cost or usage insights

Translate product requirements and reliability targets into clear machine learning problems with well defined metrics such as false positive rate, false negative rate, alert reduction goals, and impact on incident handling time

Select appropriate model families for each use case including supervised and unsupervised classical models, deep learning models, and language model based approaches where appropriate

Work with data engineering to define and refine pipelines that ingest monitoring alerts, events, logs, metrics, and incident or ticket data from operations tools

Design features that capture temporal patterns, service and infrastructure relationships, and business criticality of systems and alerts

Implement data validation rules and data quality checks and collaborate on detection and handling of data drift and schema evolution

Establish and maintain a modern machine learning operations workflow including experiment tracking, model registry, automated training, and automated deployment

Build production ready inference services such as synchronous application programming interfaces, batch scoring jobs, and streaming based scoring that integrate with backend services and user interfaces

Collaborate with on site DevOps on deployment patterns in secure environments including staging, canary releases, controlled rollouts, and rollback strategies

Define retraining strategies and schedules for models whose performance depends on changing alert distributions and operational patterns

Design offline and online evaluation suites using historical alert and incident data including realistic scenarios for alert suppression and recommendation quality

Build dashboards that make model behaviour and impact transparent to product owners, operations teams, and technical leadership

Monitor model performance and drift in production and drive corrective actions when degradation occurs

Incorporate feedback from operators and subject matter experts into continual improvement cycles and where suitable into active learning workflows

Work within the constraints of secure and regulated deployments including strict access control, logging, and change management practices

Ensure that experimentation and training environments that use sensitive or regulated data follow required security and compliance guidelines including expectations associated with United States Federal government workloads and FedRAMP style environments

Document model inputs, outputs, assumptions, and controls so that the design can be reviewed by security, compliance, and audit teams

Coordinate shared machine learning components across multiple products such as embedding services, semantic search services, and evaluation frameworks

Participate in architecture and design discussions to promote reuse of patterns and components across the AI and data platform

Provide mentoring and technical guidance to junior engineers and data scientists where needed

Work primarily from the Dallas area office in close coordination with local engineering, product, and leadership teams

Participate in in person design sessions, whiteboard reviews, and incident reviews that require physical presence and real time collaboration

Help build a strong on site engineering culture through knowledge sharing, pair design, and support for local team members

Qualification

Machine Learning EngineeringPythonDeep Learning FrameworksCloud ServicesMLOps PracticesExperiment DesignData EngineeringMonitoringEvaluationCommunication SkillsCollaboration

Required

Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or a related field or equivalent practical experience

At least five years of hands on machine learning engineering experience with a strong record of shipping models into production systems

Strong programming skills in Python with fluency in libraries such as NumPy, pandas, scikit learn and at least one deep learning framework such as PyTorch or TensorFlow

Proven experience building and operating production machine learning systems including application programming interfaces, batch jobs, or streaming jobs and partnering with DevOps teams

Solid understanding of the full machine learning lifecycle including data preparation, feature engineering, model training, evaluation, deployment, and ongoing monitoring

Experience with at least one major cloud provider. Experience with Amazon Web Services is preferred including familiarity with services such as managed container platforms, serverless functions, object storage, and managed machine learning platforms

Experience with machine learning operations practices and tools such as experiment tracking, model registry, automated training pipelines, and automated deployment pipelines

Strong skills in experiment design and interpretation including backtesting, A and B style testing, and detailed error analysis

Excellent communication skills with the ability to explain model behaviour and trade offs to engineers, product managers, and operations stakeholders

Ability and willingness to work full time on site in the Dallas Fort Worth metro area

Preferred

Experience with observability and operations domains such as monitoring alerts, logs, metrics, traces, and incident ticket systems

Experience in environments that support United States Federal government or other highly regulated workloads with an understanding of security and compliance constraints

Background in large language models and retrieval augmented search or summarization for operational or knowledge management use cases

Familiarity with vector databases and semantic search platforms and experience building embedding based retrieval systems

Experience delivering anomaly detection, clustering, and time series modelling solutions at meaningful scale

Prior experience in a product engineering setting where the engineer owns design, implementation, and operational aspects of machine learning services

Company

Resolve Tech Solutions

Resolve Tech Solutions (RTS) is a technology services company focusing delivering SAP as a managed cloud service in public cloud.

Founded in 2010

Addison, Texas, USA

501-1000 employees

http://www.resolvetech.com

H1B Sponsorship

Resolve Tech Solutions has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (36)

2024 (36)

2023 (52)

2022 (18)

2021 (15)

2020 (47)