By now, even your non-techy mom has probably heard of Big Data, with IBM and others advertising it on TV and every other IT vendor pushing their platform. If you don't know about big data, here is it is in a nutshell: as more and more devices are connected to the internet and storage capabilities continue to advance, we’re able to collect, store, and run analytics on massive sets of information in order to discover insights and make more informed decisions.
Some industries like research, oil and gas, manufacturing, and logistics, have been doing this for years, often on dedicated hardware. The advantages of virtualization can be levied for big data use, too, even though at its core, big data is focused on distribution of jobs over a wide array of resources, while virtualization as a concept is the exact opposite.
If you’re gearing up for a big data deployment, you can use VMware tools to stack it on top of virtual machines, allowing you to add resources easily when you need to run large analytics jobs and scale back when you don’t need as much processing power or want to delete old unused datasets from storage. This elasticity helps maximize your available compute resources and can be used in a mixed-workload environment. Plus, you can manage and automate your big data VMs from the same tools as your other infrastructure.
Here is quick primer on what to keep in mind with VMware big data platforms.
Big data hardware and storage must be able to scale with the rapid growth of information, handle enormous datasets, and provide the IOPS (speed) needed for analytics software. Direct Attached Storage (DAS) is often the best option to keep latencies low. Solid State Drives in each server means dedicated hardware is still common in the big data world, especially for near-instantaneous calculations like where to serve a custom web advertisement.
Virtualized big data is better when the need is not quite as urgent, as it will inherently add latency. A service provider should be able to provide scaling NAS as well as DAS options for virtualized hardware. However, some VMware features like vMotion can’t be used with DAS. Scale out Network Attached Storage (NAS) is acceptable storage architecture that can solve some issues like data protection and efficiency, with offerings from EMC and others.
The general recommendation for Hadoop is a Dual Quad core processor, 24-48 GB of memory, and 4-6 disks with 2 TB of storage each; the deployment generally requires several slave machines with at least this configuration and one master machine.
Born out of VMware’s Serengeti project, which was announced nearly three years ago to virtualize Hadoop on VMware, the Big Data Extensions (BDE) for vSphere support Hadoop 2 management from vCenter. It includes HDFS, MapReduce, Pig, hive, and Hbase. It requires at least vSphere 5.0 with an Enterprise or Enterprise Plus license. BDE runs on top of two VMs, one for management and one Hadoop template, to create, configure, and assign Master and Slave roles.
BDE also supports other Hadoop distributions besides Apache’s, like Cloudera, Pivotal, Hortonworks, and MapR. Unlike the open-source version of Serengeti, BDE enables runtime automation and elastic scaling.
BDE has shows time improvements of 13% when each Hadoop workload host is split into 2-4 virtual machines, bringing it up to par with a dedicated environment.
While it comes with its own considerations for compute, storage, and networking resources, and performance can still lag behind dedicated hardware, virtualizing Big Data platforms leads to even greater scalability; integration with existing cloud infrastructure; and elastic automated management with separate scaling and deployment of storage and compute.
Both in-house virtualized infrastructure and service provider clouds can support BDE for vSphere, making it a great way to dip your toes into big data—or even manage a large scale implementation using the tools you already have. This simplifies operations and can also enable a self-service portal for data scientists and other analysts to access big data platforms without needed infrastructure expertise.