This post covers how Facebook builds and scales AI/ML platforms using the March 2018 Scaled Machine Learning Conference session given by Yangqing Jia, Facebook’s Director of AI Infrastructure (link). I’ll cover the subject at a high level so if you’re one of those pHD-y types already focus on things like hybrid pointer-generator networks, spherical convolutional neural networks or Wasserstein Autoencoders then this may not be for you.
Recently a consulting firm reached out to me for advice about building and scaling Artificial Intelligence (AI) and Machine Learning (ML) platforms for their customers. I have some experience in this space at the infrastructure level working with Intel and NVIDIA, and at the software and services level working with Amazon, IBM and a few others so I decided to help them out. This post covers some key focus areas when it comes to scaling AI/ML platforms for their customers. I figure it is best to approach this topic through household names with products and services that most understand and use on a daily basis so I chose Facebook and Uber.
For Facebook, I’ll focus on the software and hardware considerations they made to successfully scale AI/ML infrastructure per an excellent talk given by Yangqing Jia, Facebook’s Director of AI Infrastructure, at the Scaled Machine Learning Conference. If you are interested in learning about how Uber successfully built their AI/ML platform, Michelangelo, and organizational changes they’ve made to ensure successfully AI/ML adoption then click here, to continue with Facebook’s AI/ML efforts keep reading. One last note, this discussion focuses on Facebook’s approach so it is given from a non-cloud perspective which changes some of the narrative as public clouds alter your approach, eliminating some options and making other choices significantly easier.
Machine Learning Workflow Defined
Lets start by quickly level setting on the some Machine Learning (ML) definitions by analyzing the Machine Learning Execution Flow.
The main components of an ML workflow you need to understand to read further are the following:
- Data. Information generated and stored by an organization. For a hospital think of data gathered by entering patient data into a database.
- Features. An individual measurable property or characteristic of something you observe. If you have a table of one million hospital patients with columns for things like their height, weight, BMI, blood pressure, blood types, etc. then each of those columns or attributes about your patients are considered features. When you hear about things like feature extraction, you can simply think pulling relevant data out of a larger set of data to use for model training.
- Training. Using your data with a ML algorithm to create a model to use for predictions on new data. For our hospital example, this would be taking your patient data and training a model to identifying trends to predict people at high risk for diabetes based on extracted features form your data set like a patient’s age, weight, blood pressure and BMI.
- Evaluation. Evaluating how accurate your model is with predicting the outcome you’re looking for. This can be percentage based where your model’s accuracy is determined by the total number of correct predictions divided by the total number of predictions made. So we can give our model 1,000 diabetic patient’s with features like age, weight, blood pressure and BMI and test how many out of the 1,000 it correctly predicts are diabetic. If it gets 874 out of 1,000 then we say the model is 87.5% accurate. Acceptable accuracy depends on the application as a model that can correctly predict diabetic patients 87.5% of the time is acceptable. That same accuracy with a life saving emergency system that uses a predictive model that works only 87.5% of the time, leaving people to die 12.5% of the time, may not be considered so acceptable.
- Inference. After you are satisfied with the accuracy of your model, you deploy it to a production environment where other applications and services can make API calls against the model to infer or “ask questions” against the model and receive a prediction back. An example with our diabetes prediction model above is having a new patient visit the hospital and after their age, weight, blood pressure and BMI are entered into the system, your application makes an API call (inference) against the model to provide the doctor with a percentage on how likely that patient is to have or develop diabetes.
- Online. For learning (improving your model), this means you are learning and updating that model as soon as data comes in. The online term is also used to indicate your model is available in production for other applications to make API calls (inferences) against to retrieve a prediction in return.
- Offline. With learning, this means you are learning with a static data set that isn’t constantly being updated. Offline is also used to indicate the model is not in a production environment and isn’t generally available for applications make inferences against.
Primary Infrastructure Challenges
Now that we’ve level set on terms, lets take a look at infrastructure challenges. Three primary infrastructure challenges you’ll find with building out your AI/ML platform, which are storage, networking and compute.
- Storage. If you’re training ImageNet, a large visual database designed for use in visual object recognition software research, then you can do so locally on most laptops or desktops as the total size of the 14 million included images is around 150 Gigabytes. If you’re Facebook with 350 million photos uploaded daily and over 250 billion stored at any given time, that laptop won’t quite cut it. You have to build a large storage system and focus on typical big-data storage tasks like ensuring you have enough space for 350 million new images daily, making your storage system fault tolerant to avoid loosing data and also make that system highly available so your ML developers, data scientist and applications have access to the images 24x7x365. The requirement to make data in your storage system available to your ML platform may seem obvious but is also a non-trivial requirement as you will likely have data stored in multiple systems (i.e. logging systems, BI tools, data warehouses) and multiple formats (i.e. text, csv, json, orc files) which can make integration difficult.
- Network. Machine learning is useful but only when fed tons of relevant data and transferring large amounts of data across corporate networks is a non-trivial task for most. Transfer a week’s worth of Facebook’s image data, over 2 billion images, from storage to the machines doing the computation will require a lot of available bandwidth. There is also a skills gap when it comes to networking mangers with experience in traffic reduction and management techniques that improve network performance of distributed machine learning systems. Questions around the security requirements for the networking traffic in a ML cluster or how to establish network performance metrics when you have data scientist and developers creating thousands of models across hundreds of ML use cases aren’t well understood by most network administrators.
- Compute. For compute, you need to decide whether a CPU, GPU, ASIC, or FPGA is the best option for your ML workloads and then sort through the list of hardware vendors to determine who gives the best hardware for your needs. While there are some general axioms you can follow, “best” is relative to your specific ML workloads and requires benchmarking (which has only recently been standardized) to determine the best solution. Testing compute isn’t limited to just the selection of hardware as hardware vendors like Intel, NVIDIA and others offer software solutions to further improve your ML performance by optimizing across processor instruction sets, firmware, programming runtimes, math processing libraries and deep learning frameworks.
- Cost. Cost-to-performance analysis need to be completed to know which hardware/software combinations give the best performance at the lowest price across similar ML workloads. There is also difficulty in weighing the cost against the benefits to your organization as the benefits aren’t always clear at the time you begin building out your infrastructure. If you have pockets as deep as Google, Facebook, Uber and others then you can probably afford the experimentation phase until you figure out how to deliver your ML platform in a performant, cost-effective manner. If your company doesn’t have a large IT budget or the leadership doesn’t understand or believe it is an ROI positive investment then your attempts to build out a ML infrastructure may be stopped before they begin.
When building out your AI/ML infrastructure, its very difficult to get everything right the first time as you have to process the plethora of compute, networking and storage decisions and come to a solution while your ML use cases and requirements are constantly changing. Those that have solved these infrastructure challenges, like Uber, Facebook and others, all agree that this is an iterative, never-ending process. We’ve covered the generic challenges and will now dive into specific implementations Facebook uses, across hardware and software, when creating their scalable AI/ML infrastructure.
Facebook’s Leveraging of Machine Learning
Facebook was nice enough to show us the inner workings of how they build and scale ML infrastructure to support over 2 billion users. If you follow Facebook (in real life, not on their social media platforms) you’ll know their openness and willingness to share internal technical details is nothing new as they have a history of sharing innovations and their data center designs with the public through opencompute.org. Their AI platform can be categorized with these primary pillars:
- the Frameworks needed to create, migrate and train models
- Platforms for model deployment and management and
- the Infrastructure needed to compute workloads and store data
Use Cases, Model Selection & Training
Facebook uses machine learning for classification, ranking and content understanding services. These include, but are not limited to, things like your news feed, serving ads, search, classifying objects, identifying people’s faces in posts and language translation from one country’s language to another’s. To build a platform like Facebook’s you will need to work across internal business units to take their use case and have your AI/ML leads select ML algorithms that are “best” for that particular use case. The “best” model being defined by attributes like how interpretable, simple, accurate, fast and scalable the model is. When training these models to achieve the desired accuracy, Facebook asks and answers these questions:
- Training Frequency. How often should you train or re-train your models? For news feeds and advertisements they need to train those models in a sub-24hr rate as the content of the news feeds and type of advertisements change often. For the computer vision models, that identify things like human faces or dogs in your uploaded images, they typically train those models every couple of months as dog and human faces don’t evolve or change daily, which makes your computer vision model outdated if they did.
- Training Speed. How fast does it take to train models? After determining the rate they want to update their models, they also consider the training speed. Computer vision models can take hours or days to complete depending on the size of the data sets. For models like news feeds they can train in a sub-hour fashion and update the online models in minutes.
Lets now take a look into the hardware that enables these ML use cases.
Facebook has been designing their own hardware since 2010. Their hardware design philosophy is to re-imagine hardware, making it more efficient, flexible, and scalable. For their ML infrastructure they identified a small number of major services with unique resource requirements and design servers for those major services, making the hardware available in one global shared pool rather than using customized, dedicated hardware. When building your own infrastructure you will be tasked with typical data center operations (racking, stacking, installing, maintaining, monitoring, patching, etc) while also dealing with ensuring that hardware is performant with ML workloads. Facebook found that designing hardware around a 3-tiered approach for the web, compute/memory and storage tiers was the best approach.
The web tier is for, surprise-surprise, stateless web application and services. Through testing they found the web tier processes a lot of I/O and used the Open Compute Yosemite design that holds 12 machines (called Twin Lakes) in a power efficient box with single socket CPUs that best serve the throughput requirements of the web tier.
Compute/Memory & Storage Tiers
For the compute and memory heavy workloads, they use the Tioga Pass design which is designed for powerful CPUs with the ability to insert large amounts of memory as required by your application. For the storage tier they use the Bryce Canyon design which is great for high throughput storage as the design supports many storage disks.
Facebook started with off-the-shelf GPUs in HP servers and then learned about things like serviceability and thermal effects on reliability. After building on their infrastructure knowledge they begin building their own boxes and started with the Open Compute Big Sur design which was later updated it to Big Basin design. A primary difference from Big Sur to Big Basin being they attached a pure GPU box and then isolated the CPU based head node for better reliability. Amazon does something similar by isolating the GPU from the CPU with their Elastic Inference offering, albeit to reduce cost by up to 75% versus Facebook’s serviceability and reliability reasons.
We can see how it looks when we map the hardware to the Machine Learning Workflow mentioned earlier. This isn’t the entire picture of their ML infrastructure though so lets move on to software.
Now that you’ve mastered the hardware components of your AI/ML infrastructure, you now need to establish the software components such as ML libraries, frameworks and workflow orchestration.
Similar to Facebook’s hardware needs, their software needs don’t have a one size fits all ML approach. The principle they use when looking at ML frameworks is to split development and deployment (production) environments. You can see things like flexibility and a developers ability to iterate fast matter more in the development environment where attributes like the framework’s stability and ability to scale to thousands of nodes matter more in production environments. To satisfy both the developers and operations teams they’ve settled on PyTorch and Caffe2. Note that PyTorch and Caffe2 aren’t the only frameworks in use but are the primary ones in use today per Jia.
For those not very familiar with this space you may have heard of TensorFlow, created by Google, but not Caffe2. Caffe2 is a newer, lightweight, modular, and scalable deep learning framework that was created and open-sourced by Facebook April 2017 (the “2" is because there is a Caffe “1” framework created by Jia during his UC Berkeley PhD). For the infrastructure compatibility portion of the framework they focus on making sure it supports the appropriate backend plugins to allow workloads to run efficiently and across multiple platforms. For example the NNPACK, CUDA and cuDNN integration ensures Caffe2 runs well with CPUs and GPUs while the OpenGL ES and Qualcomm integration helps when deploying to mobile. Facebook also created and then integrated into Caffe2, Tensor Comprehension, which shortens typical workflow for creating new high-performance machine learning (ML) layers. During testing they found this drops workflow times from days or weeks to just minutes through automatically optimizing based on underlying hardware characteristics.
During the experimentation phase, Facebook found frameworks like Caffe2 and TensorFlow were too hard to debug and developers were more comfortable with Python. To accommodate developers they created PyTorch, an open source deep learning platform that provides a seamless path from research prototyping to production deployment. PyTorch makes Python a first class citizen allowing those comfortable with Numpy, a popular Python package for scientific computing, to easily integrate with PyTorch. Developers also get the typical Python errors they are used to from PyTorch, which simplifies the debugging process.
Based on their Twitter feedback, this approach seems to be working effectively.
Your next question may be, “Two frameworks, how do they get models from the development framework to the production framework?” which Facebook thought of also and collaborated with Amazon and Microsoft to create ONNX, an Open Neural Network Exchange. ONNX is an open format to enable developers to more easily move models between one ML framework to another. Outside of the ML frameworks, ONNX also standardizes portability across converters, runtimes, compilers and visualizers. supports and to .
The image above shows creating an ONNX model with a ML framework of choice and then deploying to CPU, GPU or Arm based hardware without having to make any vendor specific changes as the vendors in that graphic all support ONNX models. Another workflow could be one developer creating a model in TensorFlow, handing it off to another developer using Caffe2 and then the Caffe2 developer using the same model and deploying it to Apple devices (Apple CoreML) and NVIDIA based GPU servers (Nvidia TensorRT).
Even after establishing the frameworks you want to use, you still need an orchestration engine to assist with the entire ML workflow. To manage the ML workflow Facebook created FB Leaner.
FB Leaner has three primary components for processing Facebook’s Machine Learning Workflow
- Feature Store. Helpful for data manipulation and feature extraction. Has an API for developers to use to interact with the feature store. Reduces development time as common features and model attributes can be stored here with associated metadata.
- FB Learner Flow. Manages workflow processes required during training. Takes care of requesting required hardware, setting machines up in a cluster for training and packaging the models. Capable of easily reusing algorithms in different products and scaling to run thousands of simultaneous custom experiments.
- FB Learner Predictor. Used for serving the models that other applications use to make inferences against. Provides an API to make inferences against the models easier.
Putting it all together, this is how Facebook’s scalable AI/ML infrastructure looks with the hardware and software components discussed in this article.
A key takeaway, whether your focus is on hardware or software, is that building an ML platform from scratch that can scales to millions or billions of users is difficult. To do it like Facebook a few of the things you’d have to do is:
- Work across business units and understand their problems, mapping them to the appropriate ML models that are best for the use case
- Custom build your servers to match requirements of your ML workloads
- Benchmark available hardware and software components, determining the solutions that give the best cost-to-performance ratios
- Establish the appropriate training frequency and speeds that are aligned with the business use case
- Hire machine learning superstars like Yangqing Jia to lead your AI/ML infrastructure deployment
- Create multiple open-source a deep learning frameworks for research, development and production teams
- Create an orchestration tool for end-to-end ML workflow management
- Work with partners like Microsoft, Amazon and others to create an open-neural network exchange. Then get other companies like Tencent, SAS, HP, Arm, AMD, IBM, Baidu, Alibaba onboard with the new standard
- Work with hardware vendors like Apple, NVIDIA, Intel and Qualcomm to make sure your framework has smooth integration into their hardware
Oh, and manage that AI/ML infrastructure that scales globally to service over two billion people with a team of 17.