Malvern, PA- Three months RemoteBriefly describe the duties and responsibilitiesDescription
• Are you an engineer who loves to solve impactful complex operational problems?
• Are you passionate about finding opportunities to improve system performance and efficiency, scalability, fault tolerance, and self-healing capabilities?
• Are you excited about Chaos Engineering? Do you want to apply these principles and creatively experiment with our systems to Client hidden weaknesses?
• Are you obsessed with understanding systems inner state, interactions between systems or observability-driven development?
If the above holds, then the Lead Site Reliability Engineer opportunity at Vanguard is for you! A successful candidate will likely have experience in being a Full Stack Engineer who has supported their applications operationally. You will be solutioning reliability problems across product families and continuously seeking opportunities to improve our systems' "-ilities”. You will also help define, maintain, and carry out subdivisional Reliability Engineering standards, contribute to enterprise-wide libraries for reliability, and train product SRE and product family SRE leads within the subdivision.
In this role you will:
1. Instrument, enhance and advocate for system observability. Identify and develop solutions to bridge systems observability gaps.
2. Collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Looks for opportunity to improve system performance efficiency and resiliency.
3. Develops and communicates new standards and newly available tools and frameworks across subdivisions. Enforces reliability standards. Designs and develops new automated solutions for reliability.
4. Provides technical leadership, consultancy, and coaching on designing and implementing both traditional and serverless architectures in AWS with an emphasis on repeatability, scaling options, resilience, reliability, telemetry, networking, etc., including design patterns for resilient systems
5. Leads failure modes analysis spanning product families when new features and architecture patterns are introduced. Facilitates post-incident reviews for any high severity client impacting events local to the product family.
6. Leads cross-product or cross-subdivision chaos experimentation.
7. Designs, reviews, and coaches others on performance tests using appropriate components (e.g., requests per minute, # of threads, the construction of a request with headers and cookies)
8. Consults, reviews, coaches, and influences architectural decisions, including non-functional aspects, proposing potential technical solutions/enhancements, and explaining convincingly which is better and why.
9. Contributes to or leads Reliability Engineering and Resilience communities of
practice. Remains informed about Site Reliability Engineering activities happening within the subdivision.
10. Works with product owners to set subdivision goals for higher availability and SRE impact, and tracks progress toward achieving them.
11. Provides technical leadership, guidance, consulting, training, and governance on SRE to one or more product families in a subdivision.
12. Identifies opportunities to automate away toil and develops solutions, monitors error budget exhaustion rates, configures auto scaling thresholds for the product, and incorporates resilience patterns, such as circuit breakers, into the application code. Develops complex deployment and/or routing strategies for high availability.
13. Maintains and looks for opportunities to improve centralized incident response playbook for the subdivision to document standards for managing communication and escalation during an incident.
14. Oversees blameless post-incident reviews for high severity incidents involving more multiple product families.
Core Responsibilities/ Qualifications
• Minimum of eight years related work experience, with at least three years of development experience.
• Undergraduate degree or equivalent combination of training and experience. Graduate degree preferred.
• Full stack development – JDK8+ preferred with spring boot, Rest APIs, multithreaded, multiprocessing applications, Graphql. Experience with UI development (familiar with Angular, TypeScript, NodeJS etc.) is a plus.
• Ability to diagnose and resolve problems in high-throughput applications,
• Experience with one or more observability frameworks or tools – Experience with OpenTelemetry (java, js, etc.), Cloudwatch, Grafana, Splunk, etc.
• Exposure to *nix environments including some shell script development and basic command execution.
• Strong understanding of database principles and working knowledge in distributed storage and infrastructural solutions.
• Experience with container management and micro-services architectures such as Docker in cloud and on-premises infrastructure.
• Working knowledge of AWS network foundations, application networking, edge, and network security.
• Excellent communication, and documentation skills.