5Vs of Big Data: Volume, Velocity, Variety, Virtualization & VMware Pt. 1
Part 1 of my benefits of virtualizing big data workloads with VMware technologies series. This part covers a quick intro to VMware, their upcoming initiatives and some reasons to consider running big data on vmware.
Disclaimer: VMware hasn’t paid me or asked me to publish this article and all thoughts and opinions are my own. I did give VMware early access to the writing to ensure I didn’t disclose any of their company’s intellectual property, shown to me at a special event under NDA, they don’t deem ready for public consumption at the moment.
Last month I was invited out to VMware for their “Experts Workshop: Big Data, Scientific and Engineering Workloads” event that completed last week. As you can imagine from the title, we discussed VMware’s big data related technologies and I was able to learn from other vendors across the big data landscape. There were speakers from premier big data companies like the Tom McCuch, VP of Solutions Engineering at Hortonworks and the “Spark Godfather” Matei Zaharia, Co-founder and Chief Technologist of DataBricks. Other attendees had networking, consulting or software backgrounds in areas of science and High Performance Computing (HPC).
VMware also brought out the heavy hitters like their VP of Server Platform Technologies Richard Brunner, Sr. Director & Chief Technologist for HPC Josh Simons and even Ray O’Farrell the Executive Vice President & CTO at VMware. I’m assuming O’Farrell heard I was going to be there so he changed his busy schedule around to make sure he could meet with me to present VMware’s future business initiatives, or at least that is what I’m telling everyone. In his presentation, O’Farrell exposed VMware’s future roadmap covering their cross-cloud (private, hybrid, public) strategy and tools they have to enable their customer’s infrastructure management requirements for emerging technology verticals like IoT, Core (Edge) Computing and 5G networking.
I’ve noticed more and more companies verbally trying to re-position themselves as “data companies” where VMware, by way of their parent company Dell Technologies, has pledged to make a $1 billion investment in R&D for IoT to go after what they call the “Tidal Wave of IoT Opportunity.” From Figure 1 we see VMware’s plan is to provide virtualization technologies from the data center to the edge, allowing customers to easily manage servers or devices and deploy applications where needed in a secure fashion. Why are we talking IoT on a big data article? Well IoT and edge computing efforts matter for big data as both will provide us with more data than ever before while pushing the computation and analysis closer to the device, meaning we will need to re-examine existing big data technologies along with architectural and access patterns. Lets get into how VMware sees their overall business initiatives manifesting when it comes to virtualization of big data workloads but first I’ll define what I mean by big data as the term has become a bit ambiguous.
Defining Big Data
Before getting into the VMware’s value proposition of running big data workloads on their technologies, lets first quickly level set on what big data is. The quickest and best description of big data I use, from Eddie Satterly, is “data that doesn’t fit into traditional data models.” I like this description because when you focus solely on speed or the size of data you results will vary from one use case to another. A “big data” application to one company that is getting GBs a day of data may be “small data” to a company bringing in 605 TB every hour. As the traditional data wrangling technologies couldn’t handle the new use cases created from the large data generation, a new solution, “big data” was born. Since then, along with the explosion of data, we’ve seen an explosion in new technologies trying to wrangle the explosion of data.
Most big data tools can be broadly categorized as tools that help you collect, store, process or analyze data. Some tools span multiple categories and others, like machine learning’s analysis of data, create a new subcategory but overall modern big data tools fit in one of these four categories (at least they do for the purpose of this article). If you need a deeper dive on “What is Big Data?” from a business or technical sense then follow the links. Now that we’ve level set on a definition of big data, lets discuss how VMware wants to help you with those workloads.
Why Big Data on VMware?
How VMware Sees the World of Big Data
From O’Farrell’s presentation, it is clear that VMware is tracking the onslaught of new data being generated, by people or devices, and looking to arm their enterprise customers with the necessary tools to navigate the big data jungle. From VMware’s dedicated big data site they state,
“VMware is the best platform for big data as well as traditional applications. Virtualizing big data applications simplifies the management of your big data infrastructure, delivers faster time to results and is more cost effective.”
They believe their offering’s ability to simplify the management, deliver faster results and reduce cost allows you to jump over big data’s biggest hurdles which are a lack of IT expertise or available enterprise grade tools and budget constraints. I’ll now take a deeper dive into what these benefits look like in execution.
Virtualized Architecture and Managing Big Data Workloads
Virtualization of Hadoop is a pretty common, well known use case by now. I feel comfortable saying the majority of Hadoop workloads run in some type of virtualized environment, rather than running on bare metal. The value propositions of virtualization (like ease of deployment, flexibility to scale as you need, ability to separate compute and storage, etc.) are well known and mostly achieved with any virtualized hardware. Rather than look at why virtualization is good for Hadoop in general, I tried to focus on how VMware differentiates itself from other virtualization technologies.
During the workshop’s presentations, I couldn’t discern VMware’s major differentiation to virtualization and management of big data technologies like Hadoop or Spark. This is not to say that there are none as you can look at the other leader in Gartner’s Magic Quadrant for x86 Server Virtualization, Microsoft’s Hyper-V, and find differences in supported operating systems, pricing, ease of deployment, public cloud support, supported hardware and more, but when looking specifically at differentiation for big data workloads nothing immediately jumped out at me. A lack of a immediately noticeable differentiation over competitors isn’t necessarily a bad thing as VMware customers don’t have to look elsewhere to satisfy their big data needs as the platform can help in realizing all of the virtualization benefits mentioned earlier. To realize these benefits, they’ve provided a Best Practices for Virtualized Big Data Applications guide to help with hardware selection, VM sizing, application tuning and testing of your setup through benchmark analysis. Now on to accelerators.
Accelerating Big Data Workloads
All you have to do is take a look at NVIDIA’s stock performance over the past couple of years to see that accelerators like GPUs are really taking off across multiple compute use cases, but primarily artificial intelligence. FPGAs are finding their ground also, in use cases like SmartNICs and Machine Learning (ML), due to their high efficiency and low power consumption.
As ML is utilized more and more in the analysis phase of big data, we are seeing a rise in GPU consumption as GPUs have significantly more cores (1,000 core on average) than CPUs (12 cores on average) which make them ideal for the redundant matrix multiplication required in Deep Learning (DL), a subfield of ML.
GRID vGPU is recommended when you need to pair 1 VM to 1 GPU and for applications that require short training times and use multiple GPUs to speed up machine learning tasks, DirectPath I/O is the better option. For more details on the performance differences check this article out. And note that I’ve used NVIDIA GPUs as an example but other accelerators, like Intel’s Xeon Phi or AMD’s FirePro S7150x2 can also be utilized by vSphere in a similar manner. On to the important question, what is the virtualization “hit” on ML workloads?
For language modeling with a Recurrent Neural Network (RNN) on the Penn Treebank (PTB) data set, there was only a 4% performance hit due to virtualization. For most ML workloads, the 4% hit is minor when comparing it with the benefits you get from virtualization. Looking at other GPU benchmarks across machine learning and graphics rendering workloads, the virtualization hit is anywhere from 1% to 4% so the 4% for this specific language modeling use case is within an optimal performance range. For more details and examples of ML on vSphere with accelerators check out VMware’s presentation given at the 2017 GPU Technology Conference.
VMware’s technology stack shows a lot of maturity when using it for big data and machine learning workloads. In the next section we’ll explore more of the answers provided to me when asking why should a customer run their big data workloads on VMware. Part 2 will cover networking capabilities, persistent memory and benchmarks of big data workloads. I’ll end Part 2 by answering the question of whether you should move your big data workloads from another solution to VMware or, for existing customers, from VMware to an alternative.