The Essentials of Big Data Analytics in the Cloud

Written by Joe Kozlowicz on Wednesday, December 10th 2014 — Categories: Cloud Hosting, Cloud Hosting, Cloud Storage, VMware

ten serversBig data is driving new business insights and is hyped as a world-changer, as more and more devices are connected to the internet (Gartner predicts enterprise data growth of 800% between 2011 and 2015). Big data is the practice of locating patterns in enormous datasets to make better decisions. It enables intelligent decision-making across industries and applications.

Scalable cloud environments are great for big data platforms, but they come with their own set of planning and management concerns. In many big data environments, a hybrid solution may be the best fit. Let’s see what IT managers have to contend with to deliver the big data insights demanded by CIOs and CEOs.

Types of Data and Primary Planning

Data comes in two flavors: structured and unstructured. Big data can play with both of them. Unstructured data is unorganized content of many formats, like e-mails, documents, and media. Structured data is organized in some fashion, like contact information.

There are two main issues when evaluating and running a big data platform in the cloud: storage and performance. Big data workloads are usually either run constantly, which enables predictive analytics and is better suited for on premise infrastructure, or in batch loads. Even constant data analysis can encounter sudden, massive spikes in resource demand.

Monitoring and Streamlining Big Data Performance

These intense and sudden resource crunches strain networks, storage, and virtual machines alike. Cloud infrastructure is great for quick and elastic adjustments of resources, so companies don’t need to permanently purchase infrastructure to cover a potential maximum spike in demand. Instead, they can ramp up and down as needed. Monitoring tools are essential to keep an eye on resource consumption, as well as network activity.

Network capacity is a major bottleneck for large datasets, so if the information is being collected locally, it might make the most sense to analyze data on premise to avoid long transfer periods and consumption of network resources. When transferring data, bandwidth and latency impact performance.  A cloud provider with 10-100 Gbps connections will be much better suited to push around terabytes quickly.

Storage Constraints

Business applications, sensors, video cameras, and just plain old computers are generating both structured and unstructured data at a constant pace, meaning data analysis constitutes a heavy storage workload. Magnetic disks are poor at quickly accessing disparate data and reading and overwriting from many different tenants in a cloud environment. Springing the extra money for Solid State storage is often worth it for serious big data crunchers. NAS can often provide better performance as well, but it might need to be tuned to the platform at hand.

Hadoop, the most popular data analytics platform, uses both HDFS (Hadoop Distributed File System) and temporary data. HDFS files consist of a series of data blocks that can be spread across machines, generally with two attached replications. Temporary data is created when Hadoop reads the HDFS files specified by a big data task and finds some patterns. These patterns are stored as temporary data. Higher I/O), DAS, and SSD storage are better suited for temporary data.

Benefits of Virtualizing Big Data Software

Hadoop and No SQL databases are two of the most common tools for big data crunching. VMware Serengeti is an open-source method to automate deployment and management of Hadoop clusters in vSphere.

Virtualizing big data tools has a number of benefits, including:

With faster internet connections and the wide array of compute and storage resources available from cloud providers, enterprises should give a hybrid environment a serious look when expanding their big data operations.