Session

Site Reliability Engineering for Reliable Machine Learning in Production

Keynote Interview

Many ML teams have evolved from simply trying to get models to work, to ensuring that they work in a way that meets the needs of the organization. This means building processes and systems that allow them to be produced and delivered efficiently, hardened against failure, and robust to those failures that inevitably occur.

ML and MLOps practitioners have much to learn from the evolution of DevOps in this regard, and particularly the evolution and application of site reliability engineering (SRE) in that field, which sought to apply engineering discipline to the challenges of operating large-scale, mission-critical software systems.

In this live podcast interview, Sam speaks with Google SRE practitioners and authors Todd Underwood and Niall Murphy about the application of SRE to MLOps.

Session Speakers

Consultant, co-Author of "Reliable Machine Learning: Applying SRE Principles to ML in Production"
Not Affiliated
Senior Director, ML SRE
Google

Oops, please Login or Create Account to view On Demand

The good news is that it's both easy and free to register and get access.

Account Login

Create Account

Password
Newsletter Consent(Required)
Terms and Privacy Consent
This field is hidden when viewing the form
This field is for validation purposes and should be left unchanged.