This post covers how Uber builds, scales and organizes around it’s AI/ML platform using Uber’s “Scaling Machine Learning at Uber with Michelangelo” article from November 2018 (link). I’ll cover the subject at a high level so if you’re one of those pHD-y types already focus on things like hybrid pointer-generator networks, spherical convolutional neural networks or Wasserstein Autoencoders then this may not be for you.
Recently a consulting firm reached out to me for advice about building and scaling Artificial Intelligence (AI) and Machine Learning (ML) platforms for their customers. I have some experience in this space at the infrastructure level working with Intel and NVIDIA, and at the software and services level working with Amazon, IBM and a few others so I decided to help them out. This post covers some key focus areas when it comes to scaling AI/ML platforms for their customers. I figure it is best to approach this topic through household names with products and services that most understand and use on a daily basis so I chose Facebook and Uber.
For Uber, I’ll focus on their ML platform, use cases and the organizational changes required to successfully scale AI/ML infrastructure. I’ve taken some points from Uber’s insightful article on “Scaling Machine Learning at Uber with Michelangelo,” interviews from Michelangelo’s Product Manager Mike De Balso and other AI/ML companies I’ve worked for or with along the way. If you are interested in learning about how Facebook scales their software and hardware AI/ML infrastructure per Yangqing Jia, Facebook’s Director of AI Infrastructure, then click here, to continue with Uber’s AI/ML efforts keep reading. One last note, if you are new to this space and need to understand terms like features, training, evaluation and inference at a high level then its probably best for you to read the Facebook article first as I detail those before going into specifics.
Products and Services Using Machine Learning at Uber
Uber has set a goal to “Democratize ML across Uber.” We see this in execution when we look at products and services that various Uber teams have created with Uber’s Machine Learning (ML) platform. Before getting into that platform and how they structure their organization to accommodate successful implementation of ML, lets take a look at a few examples of ML in action across Uber’s products and services.
For those that have never had a lazy night in and used the service before, Uber Eats is an online food ordering and delivery platform launched by Uber in 2014. It allows users to order food from local restaurants and have it delivered to your front door by an Uber driver. The Uber Eats service uses ML models to make hundreds of predictions to optimize the eater experience and does so each time the app is opened. Those ML models suggest restaurants and menu items and estimate meal arrival times.
Uber’s leverages various forecasting predictions to make data-driven decisions at scale:
- Marketplace Forecasting. Predicts user supply and demand to direct driver-partners to high demand areas before they arise, thereby increasing their trip count and earnings.
- Hardware Capacity Planning. Predicts hardware capacity requirements to avoid under-provisioning which may lead to service outages or over-provisioning which can be costly to pay for expensive underutilized infrastructure.
- Marketing. Makes estimates on marginal effectiveness of different media channels while controlling for trends, seasonality, and other dynamics (e.g., competition or pricing).
- Setting Financial Goals. Predicts future values of time-dependent data such as sales, weekly trip counts, available driver levels, app traffic, economic conditions, etc.
Uber has a customer support team that responds to thousands of tickets (i.e. leaving a wallet/phone in the car) from the 15 million trips that happen every day. They use ML models to speed up processing and resolve support tickets. The first iteration of the models sped up ticket handling time by 10 percent with similar to better customer satisfaction ratings. The second version drove an additional 6 percent speedup.
Read carefully as this may be one of the most important notes of this article. Speeding up ticket resolution by 15% (20min average ticket resolution time 10% faster is a 18min resolution average, then speeding it up by 6% means a 16min 55sec resolution time which is roughly 15% faster than the original 20min average resolution time) represents the value of ML most enterprises can realize. AI/ML and even deep learning are typically advertised in a way that will “revolutionize” and “fundamentally change” business. In some cases that may be true but in most cases companies will realize marginal but meaningful gains. Staffing a workforce to resolve 2,000 tickets daily and then getting a 15% reduction in resolution time either means  you need less staff to resolve the same amount of tickets or  you have 15% more time to solve tickets above the 2,000 daily average. Assuming your customer support reps resolve 24 tickets a day at a 20min average per ticket over 8hrs, those reps are paid $70,000 annually and you take the  approach. With a 15% reduction in ticket resolution time, you can remove 12 people and still solve 2,000 tickets daily, saving $875,000 annually in salaries (not saying fire them, reallocate them to something else). $875,000 saved with “only” a 15% improvement of the business process in this one use case. This is the real power of AI/ML. Tangent over.
Estimating Time of Arrivals
Uber’s estimated time of arrival (ETA) system, probably one of the most important and visible metrics in the app, provides ETAs on pickups and arrivals. If you’ve ever taken a ride share before you know that ETAs are critical to a positive user experience. For Uber, correct ETAs are important also as the ETA metrics feed into other internal systems to help determine pricing and routing. Uber’s Map Services team uses ML models to predict errors with their existing ETA estimation system and make corrections based on the error prediction. The addition of ML model to predict ETAs has significantly increased ETA prediction accuracy, in some cases reducing error rates by more than 50 percent.
Digitally Transforming Your Business with Machine Learning
Anyone that has ever been involved in digital transformation at a large enterprise understands how organizational structure can make or break the success of adopting a new technology. Gartner’s has identified at least six barriers to digital transformation outside of just the technology. Uber is no different and successfully scaling ML at Uber requires getting more than just the technology right.
To get ML right at Uber, they considered critical success factors across three areas: Use of Technology, Processes in place to enable or manage the digital transformation and Organizational structure. The rest of this article covers these three areas, providing what Uber has learnt over the past three years and considerations for companies undergoing the same process of scaling ML across important use-cases critical to business success.
Leveraging Machine Learning Technology at Uber
Uber, like Facebook, has given us an intimate look into their journey with deploying and scaling AI/ML infrastructure across business units. They’ve seen their ML deployments grow to hundreds of use cases with thousands of models deployed in production and millions of predictions made every second. They’ve also grown their ML staff to hundreds of data scientist, engineers, product managers and researchers, all in just 3 years.
They did this through their advanced ML platform, Michelangelo.
Michelangelo consists of a mix of open source systems and components built in-house when open source solutions were not ideal for their use case. The primary open sourced components used are HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow. Uber’s preference is for mature open source options where possible and they fork or contribute to the open-source libraries as needed. To interface with Michelangelo, a web UI and APIs (via Jupyter notebooks) are provided. Many teams use the API interface to interact with Michelangelo during the ML workflow (i.e. training, evaluating, deploying, predicting).
Gathering those open-source technologies and combining them with custom in-house tools doesn’t inherently give you a ML platform that can scale to 40 million monthly active riders. Uber also makes sure they get the most important technical considerations correct which are the end-to-end ML workflow, treating ML as software engineering, model development velocity and maintaining a modular and tiered architecture.
Michelangelo’s End-to-End Machine Learning Worfklow
The team at Uber found the same general workflow exists across most of their ML use cases regardless of type (i.e. classification, regression, time-series forecasting). They also designed their standard workflow to be implementation and mode agnostic to allow for ease of expanding support to new algorithms and frameworks with the ability to do so in offline or online modes.
The standard workflow used in Michelangelo consists of the following six-steps:
- Manage Data. Provides standard tools for building data pipelines to generate feature and label data sets for training and predictions. These tools have deep integration with Uber’s data lakes, data warehouses and the company’s online data serving systems. The ML team found that finding good features is often the hardest part of ML and building and managing data pipelines can be one of the most costly pieces of a complete ML solution.
- Train Models. A distributed model training system, with a front-end API, that can scale from small datasets up to large data sets with billions of samples. For every model trained, Michelangelo has a model store that keeps attributes like who trained the model, start/end time of training, reference to training data sets, model accuracy metrics, learned parameters of the model and model visualization statistics.. The Michelangelo engineering team regularly adds new algorithms in response to customer (internal Uber business units) needs and allows customer teams to add their own models and code for flexibility.
- Evaluate Models. Model accuracy and feature reports, along with visualizations are made available through a web UI and API, which help data scientist with inspecting the details of an individual model and comparing one or more models against each other.
- Deploy Models. Michelangelo has end-to-end support for managing model deployment via the UI or API and supports and three model deployment modes: offline (scheduled batch or on-demand predictions), online (available for immediate individual or batch predictions) and library (model deployed to a serving container embedded as a library in another service).
- Make Predictions. Various services make API based inferences against the offline or online models to get predictions based on feature data loaded from a data pipeline or directly from a client service.
- Monitor Predictions. Uses ongoing live measurements of model accuracy when monitoring predictions. This ensures Uber’s data pipelines are continuing to send accurate data and production environments haven’t changed to the point their models are no longer accurate.
ML as Software Engineering
An important principle the Michelangelo team adopted is to think of machine learning as software engineering. This means running a ML platform with the same iterative, rigorous, tested, and methodological processes used in software engineering. An example of this philosophy in execution is thinking of an ML model as a compiled software library. If you view the model as a software library, then in typical software engineering fashion you’d want to keep track of the model’s training configuration in a rigorous version controlled system. Just like with software, in absence of good controls and tracking Uber has seen cases where models are built and deployed to production but are impossible to reproduce because the data and/or training configuration was lost. To make sure software works correctly, it is important to run comprehensive tests before deploying the software and monitor the software in production to make sure it is operating as expected. Uber takes this approach with ML and always evaluates models against holdout sets before deploying and monitor models in production to make sure they don’t behave differently than they did in offline evaluation.
Model Developer Velocity
Building ML systems requires many iterations to get right. This is agnostic of ML as iteration speed affects how fast innovation scales out across organizations and productivity of teams using that innovation, regardless of the technology. Uber’s ML team prioritized enabling data science teams to go faster as the more experiments ran, the more hypotheses tested and the better the results.
Uber shared their Machine Learning Project Workflow, also detailing different feedback loops within that flow. The core of the workflow consists of (1) defining a problem, (2) prototyping a solution, (3) productionizing the solution and (4) measuring the impact of the solution. The other loops throughout the workflow represent areas where many iterations of feedback gathering is required to perfect the solution and complete the project.
Michelangelo’s “zero-to-one speed” or “time-to-value speed” is also critical to the spread of ML across Uber. A few principles have proven very useful in enabling teams to develop quickly:
- Solve the data problem (access, integration, feature management, etc) so the data scientist don’t waste their time doing so.
- Automate or provide tools to speed up common/recurring workflows.
- Make the deployment process fast and magical by hiding unnecessary details in the UI and providing single click deployments.
- Let the end-user use tools they love with minimal cruft.
- Have the platform enable collaboration and reusability with things like feature stores that allows modelers to easily share high-quality features and metrics that allow those model to be reproduced and used by others.
- Guide the user through a structured workflow by providing sufficient detail where needed and automate the workflow where possible with the workflow management system, Michelangelo.
We’ll now move on to one of the more important parts of successfully adopting ML at any corporation, the organizational structure you put in place.
Structuring Your Organization for Successful ML Adoption
Widely varying requirements for ML problems and limited expert resources make organizational design difficult and important to get right. Getting the structure right has allowed Uber’s ML projects to be owned by teams with multiple ML data scientists but also by teams with little to no technical expertise.
Machine Learning Teams
Having the right people working on the right problems is critical to building and scaling high quality ML solutions. This becomes more challenging with the AI talent shortage. Forbes states, “There about 300,000 AI professionals worldwide, but millions of roles available. While these are speculative figures, the competitive salaries and benefits packages and the aggressive recruiting tactics rolled out by firms to recruit AI talent would suggest the supply of AI talent is nowhere near matching up to the demand.”
With this shortage, Uber had to think intelligently about allocating scarce expert resources and amplifying their impact across many different ML problems. Let’s take a look at some of the key teams and how they work together to design, build, and deploy new ML systems in production.
Each product engineering team at Uber owns the models they build and deploy in production. For example, the Map Services team owns the models that predict Uber’s ETAs and the Uber Eats’ team owns their restaurant ranking models. These teams are typically staffed with the full set of skills they need, like informal pods of ML engineers and data scientist, to build and deploy models using Uber’s ML platforms. Depending on the product and complexity of the ML solution, the product teams can sometimes have a specialist team that addresses gaps on what the platform provides versus specific needs of that product team.
When product teams need additional additional expertise they can’t handle with their existing ML talent, they utilize specialist teams. Uber’s structures their specialists teams to have deep expertise across different domains like natural language processing (NLP), computer vision, recommender systems, forecasting and others. An example of product teams collaborating with specialist ML teams can be seen with the improvement of customer care with NLP and ML. Projects involving specialist and product teams typically last a few weeks to many quarters. Uber wisely de-risks projects as they move closer to launch by adding full-time experts to the product team to fill the expertise gap. This allows that product team to maintain their own systems in production and also has the positive side effect of freeing up specialist resources.
Machine Learning Platform Teams
The Michelangelo Platform team builds and operates a general purpose ML workflow and toolset used directly by the product engineering teams to build, deploy, and operate machine learning solutions. As ML use-cases at Uber become more sophisticated, the Michelangelo team also spin up domain-specific platforms to address specialized use cases that are not as well served by Michelangelo workflow tools. An example of use cases that don’t fit well in the standard ML system are the NLP and computer vision-specific platforms being built with special visualization tools, pre-trained models, metadata tracking, and other components. To build these custom ML systems rapidly, with the lowest amount of engineering overhead, the platform team reuses as much of the existing Michelangelo platform as possible.
Uber maintains a research team that conducts AI/ML research and connects cutting-edge advancements back into the broader business through product teams. The research team’s charter is the pursuit of fundamental advancements at Uber with AI/ML and to vigorously engages with the broader ML community to learn and provide innovations.
Now that we’ve covered team structures Uber uses to successfully scale and adopt ML, lets next take a look at processes used to increase the impact those teams can have on the company.
Process to Increase Productivity and Effectiveness of ML Teams
When expanding the reach of ML across diverse organizations with different use cases, it is important to have the right processes in place to standardize productivity and the effectiveness of adopting ML technologies. As Uber’s ML operations mature, they’ve developed a number these processes to guide teams and avoid repeating mistakes. These internally focused community building efforts and transparent planning processes help to engage and align ML teams under common goals. Lets review a few.
Designing reliable, repeatable processes that avoid common development pitfalls and verify intended model behavior is critical to safely scaling ML across organizations at Uber. Uber has also recognized risk profiles differ significantly across use cases and require tailored approval and launch processes as some uses cases are more vulnerable to unintended behaviors, tricky edge cases, and complicated legal/ethical/privacy problems. As an example, launching a new pricing model will require more scrutiny and privacy than launching an automated update to an ETA prediction model.
For these reasons, as mentioned before, product organizations own the launch processes around their ML models. They are provided a centralized launch playbook that walks through general product, privacy, legal, and ethical topics around experimenting with and launching ML models but are able to adapt these processes to their specific product area. This makes the process more efficient as the product teams themselves best understand the product implications of different model behavior and are best suited to consult with relevant experts to evaluate and eliminate risks.
We’ve all ran into situations where our available tools and platforms aren’t perfectly aligned with the needs of the business unit. This is typically where we branch off and build our own systems tailored to our needs. Uber attempts to strike the perfect balance of ensuring product teams are empowered to solve their own problems but also making good engineering trade-offs to avoid fragmentation and technical debt. To manage this, Uber has an internal group of senior leaders that oversees the evolution of ML tooling across the company to ensure they make smart trade-offs and maintain long-term architecture alignment. This team is experienced, sensitive to the trade-offs mentioned earlier and positioned as help versus acting as an authoritative body with the sole purpose to regulate its citizens.
Scaling ML across any large company requires a connected and collaborative organization. To build that internal ML community, Uber hosts an annual internal ML conference called UberML which recently hosted around 500 employees from more than 50 groups presenting talks or posters on their work. I’ve experience something similar working for Amazon. I was working as a global lead for artificial intelligence and invited to what I thought was going to be a big conference room with one to two hundred people max. What I instead found upon arriving at the internal event was a small stadium filled with thousands of people working on AI/ML across Amazon globally. It even had guest speakers from various other Fortune 500 companies working on AI/ML, one speaker a CSO of a company worth over $70 billion. Similar to Uber, the event did a great job of allowing teams to evangelize their AI/ML efforts and provided networking for me to meet others relevant to the work I was doing at Amazon. Uber also organizes community building events including ML reading groups, talk series, and regular brown bag lunches for Uber’s ML enthusiasts to learn about internal ML projects from the individuals that build them. Intelligently, they also engage heavily with the external ML community through conferences, publishing papers, contributing to open source projects, collaborating on ML projects and research with other companies and academia. This external community engagement process has grown into a global effort to share best practices, collaborate on cutting-edge projects, and generally improve the state of AI/ML.
Just as it is standard for sales people to Always Be Closing, it important for ML teams to Always Be Learning. These resources need to stay on top of developments in ML theory, track and learn from internal ML projects and master the usage of new ML tools, making channels to efficiently share information and education on ML-related topics critical. Uber starts their ML education during employee’s first week through hosting special sessions for ML and Michelangelo boot camps for all technical hires. They communicate major new functionality of Michelangelo through hosted special training sessions for employees that frequently use the platform. Complete and effective documentation of key tools (take note software developers) and user workflows also help encourage knowledge sharing and scaled adoption of platform tools. Office hours are also held by different ML-focused groups in the company to offer support when questions arise.
Between looking at how Facebook scales their AI/ML infrastructure to service over 2 billion people or how Uber uses their ML platform, Michelangelo, to enable innovation across multiple products and services, establishing ML in a large enterprise is a non-trivial process. Uber has learned a lot from their successes and failures over the past three years where they’ve gotten some things immediately right, but more frequently, had to take multiple iterations to discover what worked best for them. The key lessons learned are:
- Let developers use the tools that they want.
- Data is the hardest part of ML and the most important piece to get right.
- It can take significant effort to make open source and commercial components work well at scale.
- Develop iteratively based on user feedback, with the long-term vision in mind.
- Structure your ML teams in a way that best respond to the demands of the business use cases you are using ML to solve
- Think intelligently about all ares needed to adopt ML at your company (ability to rapidly launch products/services, project planning, education and establishing communities, etc) and put processes in place to ensure success across those areas.
- Have a senior leader that is technically competent enough to understand Machine Learning and advocate for teams needing to build solutions or adopt the technology. At the very least make sure a senior leader has a trusting relationship with your technical lead so when that technical lead has a person to advocate for them with executive leadership.
And last but not least, real-time ML is challenging to get right. So take it easy on yourself if you don’t get everything right the first time around as your company learns, iterates and grows the scope of ML to fuel innovation.