Wednesday, September 17, 2014

Big Data vs. Data Science

What is the difference?

A lot of conversations I'm in having these days ask about these two phrases:  Have I done it? Can I lead a team doing it?   To answer I've had to put some stakes in the ground and define them from my point of view.
  • Big Data:  a state in which current systems and capacities are simply overwhelmed. One cannot use traditional thinking or tools because the data doesn't fit in memory on a single machine.
  • Data Science: the process of interrogating data in hopes of improving the human condition.
While Big Data is a state of being it is by no means static.  Like the rapids on the Inga river it can be a massive torrent of moving droplets.  The bigger the wave, the more a Data Scientist {team} needs computer science skills to navigate from point to point.  And unlike its predecessors "Data Science" as a discipline starts from a different place: given data, what questions could be answered?   Empirical, theoretical and computational sciences start with a question and don't actually have much data - they tackle different problems through observation, logic/proof and Big Hardware.

Because we're looking at the world passing by as a torrential stream of bits we need to have a goal, an objective or a problem to solve. One simply doesn't just jump in, there needs to be a plan and a lot of preparation (did I mention a LOT of preparation) grounded in experience, math and statistics.

Big is in the eye of the beholder.

Having worked with US and Canadian clients there is a line in the sand where things seem big.   For example a reasonably sized loyalty program for a national US retailer is considered big by Canadian standards since it is larger than the total population.  Frame of reference matters.

Science is a pursuit, a line of reasoning not an algorithm.

Along the path we need to visualize, explain and communicate what we've learned to date. Sometimes it is enough to know that a tactical change improves conversion because of correlation; other times we need to explain why and address causality.

Big Data is not Data Science and Data Science is not Big Data although it is quite clear the two overlap and the most frequently mentioned stories come out of that intersection.

Congo: The Grand Inga Project
The story of Steve Fisher and friends running those rapids was released in a documentary in 2012.

No comments: