Wednesday, January 25, 2012

What is Big Data?

Big Data

One of the common reactions I get when I start to talk to people about Big Data is puzzlement. Despite fairly frequent analyst coverage of the topic lately, many people still aren't familiar with the topic. Therefore, discussing what Big Data means seems like a good initial topic for this blog.

Big Data is a term that has come to characterize problems that have four characteristics:

  • Volume
  • Velocity
  • Variety
  • Complexity

Let's look at each of those terms individually and examine what they really mean.

Volume

Volume is really pretty simple to define. It simply means that you have A LOT of data. How much is a lot? 1TB? 1PB? More? Less? That, unfortunately, is harder to define and the truth is that it will vary. I believe that this aspect of Big Data is really a multiplier. What I mean by that is the amount of data you have will act as a multiplier for the degree of difficulty potentially caused by the other three factors.

Velocity

Velocity means that you are having trouble keeping up. What you are having trouble keeping up with will vary.  It will often be the rate at which new data has to be loaded into your system. This is what most people think of when they think of velocity. However, I think velocity is really much broader than just ingest rates. Your architecture may not be able to keep up with the rate of user queries, updates or even the rate at which "things" change. In this case "things" could be data formats, service specifications or any number of other items.

Bottom line: if you are having trouble (or will have trouble) keeping up then you have a velocity problem.

Variety

Variety, like velocity, is at its surface very simple. Variety problems occur when you have to deal with data that is in lots of different formats. Sometimes you can "normalize" this data into a common model. However, that normalization is often expensive to achieve both in terms of man hours and in terms of sheer complexity. Additionally, this method of dealing with variety could result in a loss of data fidelity as the information is morphed from its original format.

If you don't "normalize" your data into a common model then you have to find a way to model and store each different data format. Again, this approach is often very time consuming and complex, especially when tackled using relational models. You now also face the additional task of figuring out how to query across each different data format while still achieving meaningful results. This can be a very challenging problem.

Furthermore, whenever you are faced with an additional variety of data, you have to go through this process again and every time you do, the problem of dealing with the variety gets exponentially harder. Normalizing eight different types of data is far more than twice as difficult as normalizing four different types. The same can be said for modeling them separately and then trying to query them together in a meaningful way.

Unfortunately, the problems of variety don't stop with different data formats, although that is likely one of the more common and hardest forms of the variety problem. Organizations are also faced with the challenge of dealing with the variety in the form of:

  • User queries / analytics (often ad hoc)
  • Data formats for exporting results
  • Interface formats (REST, ATOM, RSS, SOAP, OpenSearch, etc.)

Complexity

Ironically, defining complexity within this context is in itself, simple. Complexity can occur in any number of areas including complexity of the data formats, user queries, security requirements, or data retention policies, just to name a few.

Putting it Together

Unsurprisingly, this is where things really get hard. This is what makes Big Data problems so challenging. If you have a system with one, two or often even three of the problem areas described above, it is likely something that you have some reasonable experience solving in the past. In is only fairly recently that organizations have begun collecting enough varying data, quickly enough and trying to extract value from that mass of data to truly start to deal with these types of problems. Many organizations have tried to deal with this new class of challenge using legacy tools. I understand that. I've spent many years successfully building large systems using a pretty standard set of programming languages, tools and techniques. It makes sense that we first try to apply those same techniques and tools to this problem to see if they work. The problem is that they don't. They weren't designed to deal with these types of challenges. In order to successfully deal with this new challenge we need new techniques and new tools. As this blog progresses we will examine some of those new approaches.

Hello World!

I have a lot more experience writing software than I do blogging and when undertaking any new endeavor most software engineers start off with a simple hello world application. This first post is my blogging equivalent. 

First, a little (but hopefully not too much) about me. I'm currently a Principal Technologist and SE Manager, DoD for MarkLogic where I have been working for four years. Prior to that I've spent over ten years architecting and developing large scale applications in a variety of programming languages, primarily for Government customers. Additionally, I teach on and off at the Computer Science Masters program for Loyola College as well as in the undergraduate program at U.M.B.C. Enough about me already. 

In my current job I spend a lot of time thinking about big data. MarkLogic is the world's leading Big Data company and I spend a lot of time helping customers solve Big Data problems. This blog will primarily be devoted to the things I've learned along the way and other musings about Big Data, with the occasional sprinkle of other software engineering topics. As I've already said, I work for MarkLogic, and many of these posts will directly touch on MarkLogic. However, this blog isn't a marketing pitch for MarkLogic. At times I will definitely advocate for the MarkLogic technology as appropriate, but many posts won't mention it at all.

I'm not entirely sure where this blog will go but I hope you will join me and perhaps even invite some of your friends to tag along. See you around!