The dark side of the mean
Poor data - it’s been subject to almost every analogy in the world lately. It’s been called the new oil, the new currency, and the new electricity. Can’t we let it just be what it usually is - confusing, duplicated, missing, imperfect pieces of information about how our world works?
Free text municipal traffic court data:
Not since data has become one of the most important political, technological, and social forces of the last 20 years. (I truly believe this.) Data has become both a tool to make decisions and help us figure out where the rats are and one that exposes our political and technological liabilities, our darkest hopes and dreams.
However, there are two types of data: One that we’re talking about excessively, and one that we’re not talking about, at all.
For the first kind, in one of my absolute favorite pieces about data and general online life today, Maciej writes (in 2015),
The terminology around Big Data is surprisingly bucolic. Data flows through streams into the data lake, or else it's captured in logs. A data silo stands down by the old data warehouse, where granddaddy used to sling bits.
And high above it all floats the Cloud. Then this stuff presumably flows into the digital ocean. I would like to challenge this picture, and ask you to imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle.
This is where I think we are with the state of social media tracking data for sure. This data, there is a lot of swirl about (finally!) in the public, corporate media: how Facebook, Google, and other internet giants are dealing with this data, what they’re doing with it, and how it impacts us as consumers.
How about other type of data that lives in company servers, in Oracle and Terradata, in spreadsheets, in emails, in shudder Access 2010 workbooks, in all the dark, dirty corners of corporations that only one person can get to to pull for you? The type of data that can make your life as a paying customer miserable, data that can decide whether to shut off your water, whether your company should move forward with
This is dark data (what a great term). RedMonk wrote this recently about it:
We recently had a briefing that referenced the problem of dark data in the enterprise. The vendor offered the estimate that 85% of enterprise data was ‘dark’; this particular company’s definition of dark was that the data did not show up in a database, but generally speaking ‘dark data’ is data that exists in a format that is not easily usable or scalable for the enterprise.
People have been trying to combat dark data since the beginning of time, or at least since they thought they had dark data. Today, there are two ways to accomplish this, both of which actually make dark data worse.
First, companies are moving to The Cloud. Not only does The Cloud promise flexibility, scalability, and cost-savings, one of the things it falsely offers is the chance to take a “fresh look” at your data in a new setting, with tools to analyze it, like machine learning packages, streaming libraries, extensible logging, and the like. What they don’t tell you about the cloud is that you have to migrate your data somewhere, and that somewhere is usually either a super-fancy relational datastore like BigQuery or Redshift that you need to re-architect all of your data for, or, worst case, an object store like S3, where there is no metadata catalog and files drift in and out, uncurated, unloved, and named with hideous things like results_1_tar.gz in millions of folders (i.e. buckets), which are potentially open to the world. Microsoft appropriately, has the best name for its version of this service: Blob Storage.
The ironic thing about moving data to the cloud for transparency is it makes data even darker, at least for the first 6-12 months.
In order to combat this first thing, companies are, in futility, using more and more data visualization solutions to try and get to the data in these object stores. Nowhere is this more evident than in the recent slew of data visualization product acquisitions. As Tristan notes,
Within the past week we’ve seen the acquisitions of the two biggest players in the modern BI landscape, Looker (announcement) and Tableau (announcement). And if you broaden your view to the entire analytics tech stack, it’s bigger: here are the major acquisitions I’ve tracked over the past year:
Stitch. Acquired by Talend on 11/7/2018 for $60m.
Alooma. Acquired by Google on 2/19/19 for an undisclosed amount.
Periscope. Acquired by Sisense on 5/14/19 for an undisclosed amount.
Looker. Acquired by Google on 6/6/19 for $2.6b.
Tableau. Acquired by Salesforce on 6/10/19 for $15.7b.
He has a very astute analysis about why this is going on, but my simple reason is that companies are desperate to visualize their dark data, now shepherded into large data lakes.
Most companies, though, don’t realize that most of the problem in the process of how dark data is handled. There are any number of reasons for visualization and The Cloud not being the end-all be-all. Here’s some just off the top of my head:
It’s immensely hard to denormalize your data and move it from your large warehouse into a data object store
The Cloud is a political thing at your organization and every decision you try to make about it gets five rounds of meetings
Your data engineering pipeline is terrible and you put junk data in and get junk data out
Your engineering data pipeline is great, but the latency is so slow that no one cares what last week’s data looks like
Your organizational politics dictate that certain people can’t touch certain kinds of data, resulting in silos and the inability to visualize what should be available to the entire company
Your organizational politics dictate that different parts of the organization own different parts of the data engineering pipeline, and so there is never a single holistic way to look at the data
Neither The Cloud or Data Visualization will be the magical solution if your company loves dark data patterns, but as the Red Monk post muses,
That said, I’m not convinced that there is value in all dark data. Some data is dark for a reason.
Expect to see me thinking and expanding more on this idea of data movement and dark data in coming newsletters.
Art: Dark Freshness, Kandinskiy, 1927
What I’m reading lately
What happened to Target’s cash registers this weekend? They all went down. Here’s a report from the front lines.
A thread about terrible programming advice:
Tom writes, “This weekend in Evanston, Illinois (where I live) was the annual Custer Fair - a pretty chill and small street festival. Pretty standard street festival stuff: hot dogs, music, children’s face painting, and creepy paramilitary Department of Homeland Security vehicles”
This really inspirational book on what to do if you want to write
About the Author and Newsletter
I’m a data scientist in Philadelphia. This newsletter is about tech and everything around tech. Most of my free time is spent kid-wrangling, reading, and writing bad tweets. I also have longer opinions on things. Find out more here or follow me on Twitter.
If you like this newsletter, forward it to friends!