Senior Machine Learning Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Resolve Tech Solutions ยท 8 hours ago

Senior Machine Learning Engineer

Resolve Tech Solutions is hiring a Senior Machine Learning Engineer who will design and deliver production-grade machine learning capabilities for an observability and operations intelligence platform. The role involves working closely with DevOps, data engineering, and product teams to develop solutions for alert noise reduction, anomaly detection, and incident root cause assistance.

Artificial Intelligence (AI)Cloud ComputingCloud ManagementConsultingData Center AutomationEnterprise Resource Planning (ERP)Information TechnologySoftware
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Rohit George
linkedin

Responsibilities

Own the machine learning design for operations and reliability use cases including alert noise reduction, alert grouping and clustering, anomaly detection, incident root cause assistance, and cost or usage insights
Translate product requirements and reliability targets into clear machine learning problems with well defined metrics such as false positive rate, false negative rate, alert reduction goals, and impact on incident handling time
Select appropriate model families for each use case including supervised and unsupervised classical models, deep learning models, and language model based approaches where appropriate
Work with data engineering to define and refine pipelines that ingest monitoring alerts, events, logs, metrics, and incident or ticket data from operations tools
Design features that capture temporal patterns, service and infrastructure relationships, and business criticality of systems and alerts
Implement data validation rules and data quality checks and collaborate on detection and handling of data drift and schema evolution
Establish and maintain a modern machine learning operations workflow including experiment tracking, model registry, automated training, and automated deployment
Build production ready inference services such as synchronous application programming interfaces, batch scoring jobs, and streaming based scoring that integrate with backend services and user interfaces
Collaborate with on site DevOps on deployment patterns in secure environments including staging, canary releases, controlled rollouts, and rollback strategies
Define retraining strategies and schedules for models whose performance depends on changing alert distributions and operational patterns
Design offline and online evaluation suites using historical alert and incident data including realistic scenarios for alert suppression and recommendation quality
Build dashboards that make model behaviour and impact transparent to product owners, operations teams, and technical leadership
Monitor model performance and drift in production and drive corrective actions when degradation occurs
Incorporate feedback from operators and subject matter experts into continual improvement cycles and where suitable into active learning workflows
Work within the constraints of secure and regulated deployments including strict access control, logging, and change management practices
Ensure that experimentation and training environments that use sensitive or regulated data follow required security and compliance guidelines including expectations associated with United States Federal government workloads and FedRAMP style environments
Document model inputs, outputs, assumptions, and controls so that the design can be reviewed by security, compliance, and audit teams
Coordinate shared machine learning components across multiple products such as embedding services, semantic search services, and evaluation frameworks
Participate in architecture and design discussions to promote reuse of patterns and components across the AI and data platform
Provide mentoring and technical guidance to junior engineers and data scientists where needed
Work primarily from the Dallas area office in close coordination with local engineering, product, and leadership teams
Participate in in person design sessions, whiteboard reviews, and incident reviews that require physical presence and real time collaboration
Help build a strong on site engineering culture through knowledge sharing, pair design, and support for local team members

Qualification

Machine Learning EngineeringPythonDeep Learning FrameworksCloud ServicesMLOps PracticesExperiment DesignData EngineeringMonitoringEvaluationCommunication SkillsCollaboration

Required

Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or a related field or equivalent practical experience
At least five years of hands on machine learning engineering experience with a strong record of shipping models into production systems
Strong programming skills in Python with fluency in libraries such as NumPy, pandas, scikit learn and at least one deep learning framework such as PyTorch or TensorFlow
Proven experience building and operating production machine learning systems including application programming interfaces, batch jobs, or streaming jobs and partnering with DevOps teams
Solid understanding of the full machine learning lifecycle including data preparation, feature engineering, model training, evaluation, deployment, and ongoing monitoring
Experience with at least one major cloud provider. Experience with Amazon Web Services is preferred including familiarity with services such as managed container platforms, serverless functions, object storage, and managed machine learning platforms
Experience with machine learning operations practices and tools such as experiment tracking, model registry, automated training pipelines, and automated deployment pipelines
Strong skills in experiment design and interpretation including backtesting, A and B style testing, and detailed error analysis
Excellent communication skills with the ability to explain model behaviour and trade offs to engineers, product managers, and operations stakeholders
Ability and willingness to work full time on site in the Dallas Fort Worth metro area

Preferred

Experience with observability and operations domains such as monitoring alerts, logs, metrics, traces, and incident ticket systems
Experience in environments that support United States Federal government or other highly regulated workloads with an understanding of security and compliance constraints
Background in large language models and retrieval augmented search or summarization for operational or knowledge management use cases
Familiarity with vector databases and semantic search platforms and experience building embedding based retrieval systems
Experience delivering anomaly detection, clustering, and time series modelling solutions at meaningful scale
Prior experience in a product engineering setting where the engineer owns design, implementation, and operational aspects of machine learning services

Company

Resolve Tech Solutions

twittertwitter
company-logo
Resolve Tech Solutions (RTS) is a technology services company focusing delivering SAP as a managed cloud service in public cloud.

H1B Sponsorship

Resolve Tech Solutions has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (36)
2024 (36)
2023 (52)
2022 (18)
2021 (15)
2020 (47)

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Vinod Muthuswamy
Chief Executive Officer
linkedin
leader-logo
Syed Azhar
Chief Business Officer
linkedin
Company data provided by crunchbase