Scaling Enterprise Machine Learning Through Governance & MLOps

Post covers challenges organizations face when trying to operationalize ML and how best to move from experimentation into production. Written for business executives and IT leaders, also applicable to technical resources challenged with creating, supporting or using production-grade artificial intelligence and machine learning platforms.

Background

In my roles as a customer success and business development executive covering Artificial Intelligence & Machine Learning (AIML) at leading tech companies, I’ve spoken with executives, data scientists and IT managers across startups, Fortune 500 and Global 1000 companies about their AIML needs. After discussing what is AIML, platform features or API services easiest to use for non-specialist, companies get stuck on an equally important component of enterprise AIML, governance of operations. Companies get caught up in the hype led by consultants and industry media outlets that promote AIML led digital transformation is happening across every industry, in companies of all sizes with millions of models being deployed to production weekly. AIML software vendors promise adoption of their solution enables instant production readiness enabling their customers to, “Build and deploy a machine learning model in 9 minutes,” with limited or no expertise. The reality is not quite as advertised but I’ll help you on your journey by discussing why deploying ML in production can be difficult, provide a way to assess your return on investment (ROI) with AIML, how to create a comprehensive ML platform and provide a framework for assessing your organization’s AIML maturity to better determine the capabilities you need to acquire to improve your org’s proficiency.

What is MLOps?

There are many definitions for Machine Learning Operations (MLOps) and governance but to keep things simple, I’ll define governance and MLOps as the best practices and policies for businesses to run AIML successfully. While simple, this definition captures the desired outcome of running AIML successfully and that defined best practices and policies are what set you up for success. For newcomers to this space, MLOps is different than AIOps, as the latter is focused on using AI to simplify complex IT operations management (e.g. using AI to predict network downtime, notify IT staff and/or make necessary changes in networking gear to avoid downtime). For those looking for a more complex dive into MLOps you can check out the recent MLOps: From Model-centric to Data-centric AI talk from Andrew Ng, Founder & CEO of Landing AI, Coursera and deeplearning.ai, adjunct professor at Stanford, prior Chief Scientist at Baidu, Co-Founder and head of Google Brain…a Godfather of AIML. Check out Tesseract’s video for more details on AIOps.

Why is Deploying AIML in Production So Hard?

Gartner’s Predicts 2020: Artificial Intelligence — The Road to Production found that 50% of Machine Learning projects never see the light of day. Algorithmia’s 2020 State of Enterprise Machine Learning Report clarifies the difficulty of deployment by finding the 50% of models that do make it take 18 months to get into production, a far cry from an advertised 9 minutes. Let’s review a typical diagram you’d see when someone talks about MLOps (Figure 1).

Figure 1: Sample MLOps Workflow

If you’ve managed enterprise IT systems, you see the missing pieces needed to make Figure 1’s workflow production ready. Similarly, CxOs, VPs, head of product development and data science, engineering and analytics leads also call out these barriers to getting AIML into production successfully that technology components and a flow diagram don’t address. Missing considerations include:

  • Product & Program Management — person to rank AIML projects to ensure data science team focusing on highest-value AIML projects, not just the coolest (e.g. geekiest) or easiest to complete. Someone to track relevant milestones
  • Calculate ROI — determine metrics for measuring which AIML projects are the best investment for your company, evaluation of efficiency or profitability of production AIML
  • Cost Controls — putting guardrails in place to make sure AIML cost remain within budget, infrastructure components used optimally
  • Identity & Access Management — identify who is accessing what, auditing, prevent unauthorized access
  • Persona Enablement — ensure AIML platform meets current and future needs of personas, enables knowledge and asset sharing
  • Explainability — how to explain the underlying model logic, prove predictions aren’t biased and don’t violate compliance requirements
  • Model & Data Governance model risk management to govern development, validation, approval, modification, implementation, retirement and inventorying of models. Functional framework to manage data across the ten data management functions

While not comprehensive, these points show the workflow or technology alone doesn’t address all needs for deploying and governing enterprise AIML. I’ll now further address two of the more critical considerations of determining ROI and creation of a comprehensive AIML platform.

Enterprise-Grade AIML Platforms

Figure 2: Example of an Enterprise AIML Platform’s Components by Algorithmia

When building or adopting a comprehensive enterprise AIML platform, consider the following:

  • Users: What personas do I need to enable? (e.g. data scientist, citizen data scientist, software engineers, business analyst, ML engineers, IT operations)
  • User Interface (UI) & Experience (UX): Does the platform offer intuitive, easy-to-use UI for technical, non-technical? Is training, support, enablement readily available for end-users?
  • Supported AIML Operations: What operation, should this platform support? (e.g. Collaboration. data ingestion, preparation, exploration and governance. Feature engineering, training, hyperparamater optimization. Model creation, testing, deployment and monitoring. Bias detection, explainability. Business value tracking)
  • Integrations & Interoperability: What applications should my platform integrate with?
  • Infrastructure: Managed vs unmanaged? What type of deployment technologies best meet existing skill sets? (e.g. Containers, VMs, Serverless). Should my platform be singe-cloud, multi-cloud, on-prem?

The above represent primary considerations but there are other questions around open-source or proprietary, adherence to security and compliance requirements, build vs buy and more. Reading reports like Gartner’s March 2021 Magic Quadrant for Data Science and Machine Learning Platforms (free from Alteryx) help as it compares/contrast ML platform vendors showing strengths and weaknesses. There is also a magic quadrant for Cloud AI Developer Services (free from Google). In parallel with building an enterprise ready AIML platform, you can create a framework to assess ROI of AIML enabled business outcomes.

Determining ROI with AIML

You should know the ROI of AIML enabled business use cases before starting the work. This aligns with Stephen Covey’sBegin with the End in Mind”, Amazon’s Secrete to Success of Working Backwards and general common sense although not often done. When considering ROI, categorize projects into one of three categories most common for organizations using AIML for digital transformation:

  • Optimization Effort. Improves productivity, reduces cost with a 2x ROI on average. Example is a car manufacturer using sentiment analysis to classify customer support emails in queues making it easier for support staff to pick or prioritize the appropriate tickets for immediate response. Increase customer satisfaction, reduces churn, reduces average time support agents spend on tickets lowering the average OpEx cost to support customers
  • Improved Decision Making. Improves a customer experience, revenue or margin with a 10x ROI on average. Imagine the same car manufacturer using natural language processing to create a chatbot that can respond to customer’s support needs faster with reduced number of needed human support agents, further reducing OpEx
  • Business Model Innovation. Disrupts industries. Creates new markets, businesses or revenue streams with an average ROI of 100x. Picture the car manufacturer adding sensors to their cars and using sensor data with deep learning for object detection and classification to create a new autonomous vehicle driving services. New industry, new revenue streams, C.R.E.A.M. to the ceiling 🤑

90% of AIML projects fit in the Optimization and Improved Decision Making categories as they are more easily defined, attainable and measured by organizations. Business Model Innovation can be the most transformational but also the most challenging to undertake. Before starting an initiative, categorize it into one of these buckets and assess your organization’s AIML maturity. Less mature organizations should start with Optimizations and Improved Decision Making, as they better align with their ability to execute. More mature organizations are better enabled to drive Business Model Innovation with AIML.

Assess Your Organization’s AIML Maturity Level

Your organization’s maturity level with adopting and deploying AIML can be assessed across the six dimensions shown in Figure 3.

Figure 3: Categories to Assess ML Maturity

For each of these categories, I’ll contrast what it means to have a low, medium or high maturity so, where appropriate, you can identify gaps.

Organizational Alignment

  • Low Maturity no dedicated data scientist, tools chosen ad-hoc by AIML practitioners, IT Operations (IT Ops) are mostly opportunistic
  • Medium Maturity multiple data science teams with tools chosen by needs, IT integrated AIML operations with some IT planning
  • High Maturity data science centralized, standardized platform and operations for management of tooling, IT Ops has established KPIs and understands AIML performance requirements across business groups

Data & Training

  • Low Maturity individual data owners, few or new data scientist using shared or personal file storage for data, limited IT oversight across ad-hoc projects
  • Medium Maturity department-level data administration, data scientist using general corporate shared file storage systems, IT Ops applying emerging data governance principles against concurrent AIML projects
  • High Maturity executive owner of data (e.g. Chief Data Officer), IT Ops creating data-first strategy, IT Ops understands challenges, complexities, and value of effective data management

Deployment & Operations Management

  • Low Maturity data scientist doing everything, limited tooling and workbenches, need to manually handoff, limited or no integration across tools
  • Medium Maturity more personas (e.g. data scientist, developers, DevOps) for deployment & operations, some automation of the AIML deployment process, platform-specific infrastructure management, IT Ops providing non-standardized workload-specific deployment automation
  • High Maturity multiple experienced deployment & operations personas with clearly defined roles and responsibilities, persona and role specific tooling, IT Ops providing one-button deployment, CI/CD pipelines, infrastructure agnostic tooling with performance monitoring

Governance

  • Low Maturity no governance from IT Ops, data scientist using spreadsheets, sticky notes or napkins to track efforts
  • Medium Maturity more personas involved with governance (e.g. DevOps, developers), reporting tools disconnected from other AIML efforts, ad-hoc management of AIML efforts from IT Ops
  • High Maturity executive ownership of governance (e.g. Chief Risk Officer), MLOps governance platform with IT Ops managing access, reporting and policies that integrate with existing IT Ops

Closing the Maturity Gap

Your desire to increase maturity and close identified gaps should be influenced by your organization’s current capabilities and how they do or don’t align with your ability to execute against desired business outcomes with AIML. Closing gaps can done through adding personas with the needed expertise (e.g. Chief Science Officer, Chief AI Officer) to ensure AIML is successfully applied across organizational siloes. Reviewing How Uber Organized Their AIML Teams or other AI-forward companies may also help. Working with Professional Services (Consulting) firms is another option. A good one should pair your team with their ML experts to identify and build ML solutions. In addition to governance of AIML operations, these firms can help you discover relevant use cases through ideation sessions, work backwards from your business problem and deliver a step-by-step implementation plan that include training for your business and technical staff.

Conclusion

Adopting, governing and scaling AIML is still an emerging area for many organizations so don’t stress if you don’t get everything right initially. The process is iterative and definitely well worth the pursuit. In the AI frontier: Modeling the Impact of AI on the World Economy, McKinsey establishes the stakes at play for those who can successfully role out AIML, stating “by 2030, companies that fully absorb AI could double their cash flow” and “companies that don’t could see a 20% decline.” Applying the key takeaways I’ve provide and using a little patience, its just a matter of time before you’ve established a lean, mean AIML business outcome delivering machine. Good luck on your journey.

If you enjoyed this article, please tap the claps 👏 button and share on social media.

Interested in learning more about Jamal Robinson or want to work together? Reach out through LinkedIn.

Enterprise technologist with experience across cloud, artificial intelligence, machine learning, big-data and other cool technologies.