How do you build a machine learning platform for one of the world’s largest websites from the ground up? This is the question we posed to Chris Albon, Director of Machine Learning, at the Wikimedia Foundation.
This is what told us:
- Learn everything you can about your organization, the customers it serves, and the critical questions that still need to be answered (or improved upon).
- Take a deep dive into the technical aspects and resource requirements of machine learning at your organization.
- Design the future state for your ML platform and identify the tools to help you get there.
- Develop an MVP of your ML platform and test it against a small but representative sample of the models you plan on serving.
Learn about your organization.
In order to build an ML platform that delivers value to your organization, you must first understand its history and how the team has approached problems in the past. As Chris put it, in the case of Wikipedia “the learning curve is massive. People have been thinking about the problems your organization encounters for a long time.” Remember you’re not the first person to think about your team’s ML challenges.
Understand the technical aspects and resource requirements of the current state ML.
Once you understand the organization’s history and its priorities, you must gain a deep understanding of your technical infrastructure and its limitations. In the case of Wikipedia, the requirement to be transparent to the public coupled with the need to ensure its website is always available to a global community (and thereby avoids vendor lockin and potential government censorship) necessitates a machine learning platform built from the ground up (e.g., starting with bare metal data warehouses). This challenge also requires Chris and his team to be strategic about model training and resource utilization. Especially since his team can’t simply upgrade a cloud-based service to avoid overwhelming the platform during peak times.
Define the future state and identify the tools to help you get there.
With a deep understanding of your organization’s history and technical requirements, you are now ready to begin to think about the future. How do you design and build an ML platform that meets the needs of your stakeholders? What tools will help you get there? In the case of Wikipedia, this means working towards a world where ML models can be developed and trained both internally and by the external Wikipedia community but then deployed on Wikipedia’s platform with minimal hands-on support. This required Chris and his team to be thoughtful about how to architect their ML platform to enable ease of model deployment. It also required the selection of open source tools that enable scalability.
In this case, Wikipedia decided to use Kubeflow to manage their MLOps workflow and leveraged KFServing on Kubernetes to enable a standardized, scalable approach to model deployment. Though the reasons for choosing Kubernetes and Kubeflow were many, in particular, Chris appreciated the active communities around each. He also emphasized how Kubeflow allowed the specialized folks on his team to contribute in a productive and efficient manner and highlighted the power of the Kubernetes cluster to enable scalability.
Test an MVP of your ML platform.
Regardless of how aggressive your timeline is, you must allow yourself the time to test a minimum viable product of your ML platform. Chris shared how his first step in testing Wikipedia’s ML platform was to choose four ML models, three developed internally and one developed by the Wikipedia community and put them into production. This initial experience forced the team to think hard about both the infrastructure and culture required to stand up an ML platform that deploys a multitude of models at scale.
With this experience under their belt, Chris and his team were then able to gradually increase the complexity of the use cases and models served by their ML platform, eventually deploying a large number of models in a standard and scalable manner. As Chris puts it, the goal is to get to a place where “now it’s about productization and treating models less like a crystal chandelier and more like a disposable coffee cup. If you find a better one, use it and throw away the old one.” In other words, build an ML platform that standardizes and automates model deployment. This allows your ML engineers and data scientists to focus on solving the analytical problems that really matter and quickly iterate on model improvements without the headache of fighting with ML model deployment tooling.
To learn more about how Chris is building an ML platform to serve the needs of the global Wikipedia community and hear from other global AI platform leaders like him, we encourage you to check out TWIMLcon On Demand. Last month, we gathered over 500 machine learning and artificial intelligence practitioners and leaders to explore the real-world challenges of developing, operationalizing, and scaling ML & AI in the enterprise. The conference featured 50+ world-class presenters and panelists from teams leading the application of AI and Machine Learning at companies like Netflix, Shopify, LinkedIn, Spotify, Google, Walmart, iRobot, Adobe, Intuit, Yelp, Salesforce, Prosus Group, Palo Alto Networks, Microsoft, Qualcomm, and more. TWIMLcon’s 20+ hours of presentations, workshops, and discussions will provide you with a practical blueprint for delivering machine learning efficiently and at scale. To explore this great content and learn more about building smarter, innovating faster, and avoiding costly mistakes across end-to-end ML model production, visit twimlcon.com/ondemand.