Tuesday, October 14, 2014

The 5 Persona of Big Data

How do we decide WHICH Big Data is right?

The stories about Big Data talk about it as if it is one thing; it isn't. Big Data actually encompasses several different universes based to some extent on the fact that there are different underlying technology that can be deployed. And since technology is versatile but not omnipotent it makes some sense to first try to articulate what one is trying to achieve rather than just saying "let's do Big Data".

And any time there is hype like we're seeing today a gap emerges between what the business thinks it wants and what the technologists believes it can deliver.

To help bridge that gap it might help to describe the various approaches as people with skills and characteristics. To provide context, I'll start with the old guard and then introduce the new characters.
  • Rachel - is the oldest with a very structured and ordered view of the world, her motto is "everything has a place, and everything is in its place" and put there immediately. She doesn't handle ambiguity very well and her rigid out look makes it difficult to change things as the world around her changes. She struggles when either too much complexity exists or the scale of the problem is too big. We all know Rachel as "relational databases" introduced in the '80s/'90s when computation, storage and memory were all expensive. She is showing her age. Think business intelligence and MySQL or Oracle.
  • Chloe - views the world thru a different lens. Rather than looking at everything as well-structured events like Rachel, she is more interested in how groups of those things overlap on key attributes. She is less interested in "how many" than in "what are the commonalities/differences" in a set of things. In addition she clearly recognizes that every thing isn't known about everything so is quite comfortable with uncertainty. These holes in the data create a sparse matrix that is often best handled by "column-oriented databases". Think highly variable data elements and GoogleBigTable or Hbase.
  • Kayla - looks at the onslaught of data and says "I can manage this if I spread it around across as many buckets as possible." Her goal is to preserve things as quickly as possible and to make sure they are retrieved upon request. She defers understanding and definition to others. In the purest sense "key-value pairs" make the data opaque by design, trading insight and thus latency for scale and availability. While excellent at writing information out her challenge is trying to organize and integrate data from several processes that all use the same term to mean different things. Think extreme write and access of user-supplied data and Apache Cassandra or Amazon Dynamo.
  • Danielle - has probably the most unstructured view of the world "everything is a chunk of stuff and tags". Success then is dependent on the completeness and accuracy of the descriptors assigned to a chunk since searching its entire contents can be costly. The unstructured view applies to connectedness as well as content. In the "document store" world there are few relationships between the chunks beyond the tags. Consider posts tagged with #BigData - the hashtag is the only commonality. There are no implicit relationships beyond that fact. Think content management with mongoDB or CouchDB.
  • Grace - is the most social and believes relationships are everything. She thrives in scenarios where the existence or nature of a connection matters more than what is connected. This focus makes it very easy to extend the universe in new ways as new things happen or new actors emerge. Grace exists as "graph databases" that have been around for a while, but most noticeably in social networking and social sciences. Think recommendations, communities of interest and Horton or Neo4J.
Which persona we adopt depends a lot on what we want to accomplish. And it is quite possible we'll end deploying more than one solution in our quest to "do Big Data."

No comments: