Morgan Stanley · 15 hours ago
Site Reliability Engineer (SRE) - Wealth Management
Morgan Stanley is a global leader in financial services, always evolving and innovating to better serve clients. The Site Reliability Engineer position focuses on overseeing the production environment, ensuring operational reliability of software, and optimizing performance while collaborating with various technical teams.
Asset ManagementFinanceFinancial ServicesLending
Responsibilities
The Wealth Management Production Management Site Reliability Engineer position is a highly visible/critical role, which will be a team member of technical SME’S managing the stability and optimization of the Wealth Management systems
Scope includes but is not limited to, the day-to-day support of the organization’s technology related outages, collaboration on technology projects focused on stability, optimization, business impact analysis, and associated risk-related methodologies
This role will be responsible for overall stability of the Wealth Management Investment Management application platforms, participation in key optimization initiatives, and collaboration with multiple technical teams within Morgan Stanley
Additionally, partners with WM business units, various levels of management and staff to collect, analyze and make recommendations on optimizing the platform
As a team member with expertise in deep analytical triage, you will provide subject matter expertise in debugging, issue analysis and troubleshooting, working with business and technical colleagues to provide reviews and recommendations to avoid any future application issues
Produce guidance documentation, standards and procedures, products assessments, and training material including working with the various application and infrastructure support teams ensuring that they are documenting every single troubleshooting step in Morgan Stanley knowledge base system to resolve issues in a faster time frame
You will serve as a fully seasoned/proficient technical resource; provide technical knowledge in outage management and proactive solutions to improve the user experience
This position will mainly perform DevOps/ SRE role in Application support, Platform Stability and Resiliency
Proactively detecting, troubleshooting, and resolving all issues affecting production applications
This involves coordination with and escalation to development and external teams where necessary
This team owns all issues escalated to us until it is resolved or a workaround is provided for end users to continue functioning
Responsible for maintaining clear, concise, and timely communication with affected parties during the investigation and resolution of any individual or system-wide outage
Responsible for the stability of the Production environment
Develop and continually revise (in partnerships with other teams where necessary) suitable policies and procedures to ensure appropriate application development standards are available to guide development for systems deployed to Production
As the gatekeepers of the Production environment, responsible for ensuring the Change Implementation Management guidelines/policies are adhered to for all systems deployed to Production
Responsible for servicing all requests for data or other activities that require access to Production systems
Work with development teams at the appropriate stages in application development to ensure any new systems or projects meet the Production standard
Responsible for maintaining and growing a body of knowledge that is accessible to all team members
Ensure information regarding any support related activities or issues is available and easily accessible
The goal is to improve self-reliance and reduce dependency on the availability of development or external team resources for the initial troubleshooting and resolution of problems
Qualification
Required
Minimum 5 – 7 years' experience in developing and/or supporting Enterprise Applications
3 – 5 years' experience in leading a small to medium team of alike skillset
Willingness to embrace Agile and DevOps/SRE concepts
Working knowledge on any of the DevOps & observability tools (Grafana, Prometheus, Splunk, Kibana)
Solid analytical skills, problem determination, and resolution recovery processes
Ability to interface and cultivate excellent working relationships with technology teams, business analysts, and vendors
Understanding database engineering and can develop of high quality database solutions
Strong Unix Shell scripting experience required
Have administrative competence in at least one major programming language or platform (for example: Perl, Powershell, Python, Java or C#, dotnet)
HTML, Javascript, JQuery
Knowledge of clickstream tagging
Experience in web analytics tools (preferably Adobe Experience Cloud tools) is Plus
Experience Azure, AWS is Plus
Should be a fast learner of technologies in a quick paced environment
Have strong organizational skills and the ability to manage multiple tasks and high-pressure situations for outage handling, management, or resolution
Is driven to learn about new technologies, techniques and what it takes to be an integral member of this team
Hands-on experience administering large-scale, high-availability systems and the tools to monitor performance and availability
Experience in creating technical architecture documentation
Excellent communication and writing skills specific to technical discussions across the management layers
BS/MS or equivalent, preferably in quantitative discipline (Computer Science, Computer Engineering)
Experience with incident “on call” and ability to respond to emergencies on a 24/7 basis
Experience working with Financial Services area will be a plus
Assisting in the investigation and troubleshooting of production issues and playing an active role in mentoring/coaching/training and development of team members
5-10 years' experience in supporting or developing transaction-based systems
Experienced, technically hands-on professional that understands both code and infrastructure
Solid track record in an operational/support role, understands incident/problem/change management and how to drive stability across organizations
Be able to manage an outage incident, coordinating user communications, and other teams to help resolve an incident
Strong and keen focus on metrics and trend analysis
Strong problem-solving skills with ability to analyze and understand data
Candidate must have the ability to forge strong relationships and coordinate effectively with multiple parties during outages and actively communicate updates to APG and BU partners
Must be comfortable with on-call rotation including weekend work
End user support - able to talk to users to discuss their problems and work through to a resolution
Self-motivated with exceptional oral and written communication skills, ability to communicate clearly and concisely
Strong ownership mentality with a focus on customer satisfaction
Detail oriented and organized with strong analytical skills
Experience working in a virtual or global team
Self-starter and ability to multitask with can do attitude
Familiarity with ITIL terms around incident and problem management
Benefits
Commission earnings
Incentive compensation
Discretionary bonuses
Other short and long-term incentive packages
Other Morgan Stanley sponsored benefit programs
Company
Morgan Stanley
Morgan Stanley is a financial services company that offers securities, asset management, and credit services.
H1B Sponsorship
Morgan Stanley has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (222)
2024 (195)
2023 (173)
2022 (153)
2021 (165)
2020 (173)
Funding
Current Stage
Public CompanyTotal Funding
unknown1997-02-05IPO
Recent News
Power Technology
2026-02-05
Company data provided by crunchbase