5Vs of Big Data: Volume, Velocity, Variety, Virtualization & VMware Pt. 2
Part 2 of my benefits of virtualizing big data workloads with VMware technologies series. Part 2 continues Part 1’s reasons to consider big data on VMware and gives a conclusion as to whether big data on VMware is right for you.
Disclaimer: VMware hasn’t paid me or asked me to publish this article and all thoughts and opinions are my own. I did give VMware early access to the writing to ensure I didn’t disclose any of their company’s intellectual property, shown to me at a special event under NDA, they don’t deem ready for public consumption at the moment.
Last month I was invited out to VMware for their “Experts Workshop: Big Data, Scientific and Engineering Workloads” event which completed last week. Through the three day session, VMware and their partners discussed initiatives around big data and capabilities of their software stack when it comes to big data and High Performance Computing (HPC). Part 1 already gave a brief overview of what big data is, how VMware positions their platform for big data, unique benefits of deploying and managing your big data workload on VMware and the platform’s ability to utilize accelerators. We will start this article where the last left off by jumping into considerations around why you should run your big data workload on VMware.
More Reasons on Why Big Data on VMware?
VMware also has a persistent memory initiative which will have significant effects on the speed of writing to and reading from storage with big data workloads.
The problem with available storage options are either they are super-fast but volatile and expensive, or inexpensive and non-volatile but provide slower speeds than modern applications require.
This new storage will speed up big data workloads by affordably putting more data next to the processor and reducing I/O bottlenecks associated with existing storage options. As PMEM is still bleeding edge technology, VMware is actively working with hardware vendors, OEMS and ISVs to develop and support PMEM in vSphere.
Future architecture will look something like what is shown in Figure 4. For more details on persistent memory, check out Richard Burner’s vSphere’s Virtualization of PMEM talk at SNIA’s Persistent Memory Summit.
Mellanox also had representation at the event. Their Sr. Director of Enterprise Market Development Motti Beck and Distinguished Architect Liran Liss discussed the partnership with VMware and shared their roadmap of upcoming product and service releases. The focus was on how the partnership between VMware, Mellanox and hardware providers like Dell can enable high networking throughput with low latency which big data workloads benefit from as data sets grow.
As we got into discussions about bandwidth, it was clear that 10 Gigabits Per Second Ethernet (GbE) is the no longer considered acceptable in the world of big data. Mellanox made the comment,
“25 GbE is the new 10 GbE. We recommend customers don’t consider purchasing a system if this capability isn’t there.”
In response a Dell Technology Services (DTS) Architect stated they’ve done a price analysis of 25 GbE cards and found the price is 125% more but the performance is 250% more, meaning your overall cost per performance is cheaper with higher bandwidth cards. The technologies shown didn’t stop at 25 GbE either as the roadmap also showed the current 100 GbE offerings up to future 400 GbE offerings.
Based on when I got into tech, 400 GbE is an insane number to me. For context, you could download all of the Star Wars movies in 4K quality (100GB per movie, 1TB total) in 2 seconds (well at least the 10 Star Wars movies that matter). To go with the 400 GbE solutions, they also discussed the BlueField family of products which are their ARM based system-on-a-chips (SoC) that are optimized for NVMe storage systems, Network Functions Virtualization (NFV), security systems and embedded appliances.
Mellanox is considered a leader in networking solutions and it was made clear that through their strong relationship with VMware, Mellanox’s advancements in networking technologies will be enabled in VMware’s virtualization stack, ensuring you have the best networking performance, no matter what big data workload your running. The examples in Figure 6 were provided to show how Mellanox enables big data analytics with Machine Learning. For more details around Mellanox’s value proposition to their customers, check out their Enabling the Use of Data presentation.
But Does It Really Work on the Field?
If you’ve ever been hands on in the field you’ve experienced frustration many times from new technologies not quite delivering on their promises when it was sold to you. To alleviate these thoughts, VMWare’s Big Data Performance Lead, Dave Jaffee delivered some of my favorite presentations (no offense O’Farrell and Zaharia) showing actual performance of big data workloads running on VMware and alternative infrastructures.
For the benchmarks, Dave provided details of the hardware he tested on (Figure 7), software components (Figure 8) and node and rack placement of the servers. He then proceeded to show a common big data performance tool, TeraSort, and how with 4 virtual machines you can get better sort performance than with a bare metal server.
I found those results pretty amazing as we have always been trained to think that virtualization has an overhead where in this case it had 0% overhead and actually performed faster. This trend continued with other big data technologies benchmarked on VMware. In Part 1, we’ve already seen where across various Machine Learning (ML) workloads, the virtualization overhead for GPU performance is ~4% on average. Dave’s benchmark’s of Apache Spark continued the trend of minimal virtualization overhead for machine learning and in some places performed better than bare metal.
Logistic regression, k-means (Figures 10, 11) and random forest are classification and regression algorithms used in Spark Machine Learning (Spark MLlib or normally called SparkML) workloads. Across all three, Dave showed where VMware’s hypervisor could provide you, at least, bare metal performance if not better than bare metal performance. If your looking to get similar performance out of your hardware with VMware virtualization or want to get more details on benchmarks for logistic regression and random forest use cases, check out VMware performance team’s verbosely named whitepaper, Fast Virtualized Hadoop and Spark on All-Flash Disks — Best Practices for Optimizing Virtualized Big Data Applications on VMware vSphere 6.5. Maybe for vSphere 7 Dave will cut it to three words, “Vmware+Hadoop=Fast”
VMware’s technology stack is one of the most mature out there when it comes to virtualization of your big data infrastructure. If you have any of these questions when considering a virtualization technology for your big data workloads:
- Can I use accelerators (FPGAs, GPU, etc) in my workloads?
- Is there integration with open-source technologies, like using Kubernetes to manage containers on the platform?
- Can I virtualize bleeding-edge hardware as it comes available, like persistent memory (PMEM)?
- Does VMware partner with big data platform vendors like Hortonworks, Cloudera, Databricks, etc. to ensure VMWare runs their technologies optimally?
- Will I have access to enterprise grade software and support?
VMware either currently provides a solution for you or is thinking deeply about integration of the technology into their virtualization stack. This is not theoretical as VMware has the benchmarks and Fortune 500 and Global 2000 companies running their production big data workloads on top of VMware’s technology stack to prove it. On to the major question…
Do I Dump My Existing Technology and Call VMware Sales?
My simple two part answer is:
- Its your money, spend it how you like…
- Think about the ecosystem.
The answer to the question of what big data virtualization technology(s) is best for you is answered when considering your business needs and contrasting them against the vendor’s ecosystem. I don’t foresee many existing VMware customers analyzing the VMware technology stack for big data and making a decision to run their workloads elsewhere, like I don’t foresee a Google Cloud, Amazon Web Services or Microsoft Azure customer leaving their existing big data technology stack to use VMware’s. With that said, I didn’t get the feeling VMware’s desires for its big data virtualization efforts are to pull customers from AWS’, Azure’s or Google’s clouds. I believe instead they want to ensure if their existing customers decide to venture into big data workloads, they can:
- Execute big data workloads in their existing VMware ecosystem without having to make trade-offs due to a lack of maturity with the technology stack
- Deploy and manage that infrastructure through the same interface they already use for their other infrastructure, with little to no requirements to re-train resources
- Run big data workloads on owned infrastructure to meet internal or external compliance requirements
- Use a technology that spans private, hybrid, public and edge cloud usage models
- Get the same level of enterprise support they’ve grown accustomed to with VMware
To those efforts, from what I’ve seen, VMware has succeeded thoroughly.