Following is the third article from guest blogger BlueData, by company co-founder Kumar Sreekanti. The article is the final installment in a three-part series on the complexities of Big Data and Hadoop implementation, to complement Technavio’s report on the Global Hadoop as a Service Market.
Over the last few weeks, my colleagues have framed the current state of Big Data (Greg Kirchoff’s “Big Data, Big Opportunity Cost”) and the steps that enterprises can take to avoid the common mistakes around Big Data deployments (Anant Chintamaneni’s “Guide on How to Avoid Big Data’s Pitfalls”).
In this week’s post, I want to address the business requirements for Hadoop-as-a-Service, and why enterprises should demand lower cost, greater agility, and faster time-to-insights for on-premises Big Data deployments.
First, lets discuss IT spend on Big Data. A recent analysis showed the overall financial outlay in 2013 was approximately $13B in IT spend for Big Data. In 2014, that number grew to $28B. Those are big numbers. And based on that growth rate, it certainly appears that the prediction of $50B by 2017 is possible. What’s also interesting is that when CIOs are surveyed about their ROI, the number comes in at a lackluster 50 percent or less. So, these numbers require a bit more analysis. As we peel the onion back, we see that roughly 40 percent of the Big Data spend is sunk into services, 38 percent on hardware and 22 percent on software. How can we gain efficiencies and cost savings across all three of these areas?
Deployment Time (and Failures)
There is no question that Hadoop and other related Big Data technologies are complex. Over the past few years, we have seen the evolution in the enterprise from “what is Hadoop and Spark?” to “how do we fix this Big Data project we started?”. While this can be a boon for consulting firms (and software vendors like BlueData) who come to the rescue, it shouldn’t be this way.
Enterprise IT can’t afford to invest in costly and time-consuming deployments with a high risk for failure.
What should you demand out of your Big Data deployment? It shouldn’t take months to get up and running with Hadoop. I once tried to count the number of “clicks” it takes for the complicated set-up and configuration of a physical Hadoop cluster on-premises – it was literally hundreds, if not thousands, of clicks. Enterprises should demand Hadoop clusters in minutes, with five clicks or less. If it takes more than this, you’ve probably been contributing to the 40 percent spend on Big Data professional services and consulting.
Physical Servers (and Cluster Sprawl)
For those enterprises that count themselves in the minority of “successful Big Data implementations”, they understand the challenges of Big Data infrastructure all too well. The vast majority of Hadoop deployments today are on dedicated, physical bare-metal servers. They need to provision multiple servers and multiple physical Hadoop clusters to handle multiple applications with different Quality of Service levels and security requirements. The result is Hadoop cluster sprawl, high capital expenditures on hardware, and low hardware utilization (often 30 percent or even less).
What you should demand from your Big Data infrastructure? It’s about time that we take a serious look at virtualization for Hadoop. Concerns about performance are now a myth; here at BlueData, we’re seeing comparable performance for virtual clusters and physical clusters on-premises. Amazon’s Elastic MapReduce has been doing this for years. The benefits of virtualization for other applications are well-documented. With logical separation, you can run multiple applications (and multiple “tenants”) on the same physical server. Businesses benefit from greater agility, lower costs, less server sprawl, and higher utilization.
Data Storage (and the Risk of Data Swamps)
The “data lake” concept is popular due to the fact that most Big Data applications typically need data in a centralized and dedicated file system. The goal of the Data Lake is that any application can access any data in the Data Lake. However, the current reality of the Data Lake in the enterprise is much different. Frequently, Big Data applications can access data only when it is stored in a specific type of file system. In order to be used by such Big Data applications, data that already exists elsewhere must be copied into different file system formats. Moreover, in the enterprise, different compute clusters cannot easily share the same clustered file systems due to internal security and administration requirements. Computer Administrators spend time copying massive amounts of data into multiple enormous reservoirs. But why? Why should enterprises adopt an approach that risks data leakage and creates security issues? Why would an enterprise jump into the potential nightmare of duplicate and triplicate un-curated data? And finally, to add insult to injury, all of this comes at a cost of additional hardware and storage.
What should you demand out of your big data storage? Nothing less than logic. In its current form, the concept of a data lake is more a data swamp. There is simply no need to create new risk and assume new costs associated with creating a centralized collection of disconnected information silos all sitting in one place. Instead, adopt a “logical” data lake strategy that keeps data in place, where it currently resides. By virtualizing your Big Data environment and separating compute from storage, you can eliminate the risk of moving the data and avoid additional hardware costs to sustain the swamp.
Agility and Speed (or Lack Thereof)
I now want to address the concept of agility and speed. In cloud services like Amazon Elastic MapReduce (EMR), the agility of Hadoop-as-a-Service is fantastic. But what about data that must reside within the confines of the enterprise firewall? Financial institutions, hospital networks, insurance companies, government organizations, and most other large organizations have some types of data (or IP) that simply cannot be transported into the ether. How do we handle this type of information when we want to run Big Data jobs against it? Can you keep this data on-premises while delivering the agility and consumption model of a cloud service? The answer is that these two models are worlds apart.
In a public cloud model, users can take advantage of Hadoop-as-a-Service without having to deal with the complexity behind the scenes. Yet on-premises deployments must tackle this complexity head-on, cobbling together all the various tools required and seeking out specialized labor that is difficult to find and expensive to hire. Public cloud services eliminate the user’s need to understand Hadoop; the user can simply spin up or spin down projects with the click of a mouse. On the other hand, most on-premises deployments require heavy IT involvement for new server acquisition, cluster provisioning, storage configuration, etc.; it’s typically a lengthy and heavily capex-dependent deployment.
What you should demand for on-premises agility? Nothing short of an Amazon EMR experience for Hadoop-as-a-Service. Analysis of data with the gravitational pull to stay on-premises no longer needs to be slow and costly. A data scientist should not need to know the difference between working within Hadoop-as-a-Service in the public cloud versus within his/her own data center. These data scientists and analysts should have self-service access to their data, and the ability to spin up Hadoop clusters without waiting for IT. Companies that have failed to unlock this agile and flexible environment for Big Data on-premises are at a distinct disadvantage.
A few years ago, my co-founder and I arrived at the conclusions that many enterprises are seeing today. Big Data is hard. Big Data is complex. We started BlueData to address these challenges and provide a solution for all enterprises. Data has gravity and while some data will move into the cloud, some data will need to remain on-premises. Enterprises shouldn’t have to copy and move their data. It’s time to separate compute and storage, challenging long-held Hadoop assumptions about virtualization and data locality. Data scientists should be able to spin up virtual Hadoop clusters on-demand, rather than waiting for IT to provision and configure the required hardware. Hadoop-as-a-Service is a great model, but it’s not only for the public cloud. It’s time to demand lower cost, greater agility, and faster time-to-insights for on-premises Big Data deployments.