One of the common reactions I get when I start to talk to people about Big Data is puzzlement. Despite fairly frequent analyst coverage of the topic lately, many people still aren't familiar with the topic. Therefore, discussing what Big Data means seems like a good initial topic for this blog.
Big Data is a term that has come to characterize problems that have four characteristics:
Let's look at each of those terms individually and examine what they really mean.
Volume is really pretty simple to define. It simply means that you have A LOT of data. How much is a lot? 1TB? 1PB? More? Less? That, unfortunately, is harder to define and the truth is that it will vary. I believe that this aspect of Big Data is really a multiplier. What I mean by that is the amount of data you have will act as a multiplier for the degree of difficulty potentially caused by the other three factors.
Velocity means that you are having trouble keeping up. What you are having trouble keeping up with will vary. It will often be the rate at which new data has to be loaded into your system. This is what most people think of when they think of velocity. However, I think velocity is really much broader than just ingest rates. Your architecture may not be able to keep up with the rate of user queries, updates or even the rate at which "things" change. In this case "things" could be data formats, service specifications or any number of other items.
Bottom line: if you are having trouble (or will have trouble) keeping up then you have a velocity problem.
Variety, like velocity, is at its surface very simple. Variety problems occur when you have to deal with data that is in lots of different formats. Sometimes you can "normalize" this data into a common model. However, that normalization is often expensive to achieve both in terms of man hours and in terms of sheer complexity. Additionally, this method of dealing with variety could result in a loss of data fidelity as the information is morphed from its original format.
If you don't "normalize" your data into a common model then you have to find a way to model and store each different data format. Again, this approach is often very time consuming and complex, especially when tackled using relational models. You now also face the additional task of figuring out how to query across each different data format while still achieving meaningful results. This can be a very challenging problem.
Furthermore, whenever you are faced with an additional variety of data, you have to go through this process again and every time you do, the problem of dealing with the variety gets exponentially harder. Normalizing eight different types of data is far more than twice as difficult as normalizing four different types. The same can be said for modeling them separately and then trying to query them together in a meaningful way.
Unfortunately, the problems of variety don't stop with different data formats, although that is likely one of the more common and hardest forms of the variety problem. Organizations are also faced with the challenge of dealing with the variety in the form of:
- User queries / analytics (often ad hoc)
- Data formats for exporting results
- Interface formats (REST, ATOM, RSS, SOAP, OpenSearch, etc.)
Ironically, defining complexity within this context is in itself, simple. Complexity can occur in any number of areas including complexity of the data formats, user queries, security requirements, or data retention policies, just to name a few.
Putting it Together
Unsurprisingly, this is where things really get hard. This is what makes Big Data problems so challenging. If you have a system with one, two or often even three of the problem areas described above, it is likely something that you have some reasonable experience solving in the past. In is only fairly recently that organizations have begun collecting enough varying data, quickly enough and trying to extract value from that mass of data to truly start to deal with these types of problems. Many organizations have tried to deal with this new class of challenge using legacy tools. I understand that. I've spent many years successfully building large systems using a pretty standard set of programming languages, tools and techniques. It makes sense that we first try to apply those same techniques and tools to this problem to see if they work. The problem is that they don't. They weren't designed to deal with these types of challenges. In order to successfully deal with this new challenge we need new techniques and new tools. As this blog progresses we will examine some of those new approaches.