Messy data is beautiful

  • simonh

Data scientists have always been expected to curate data into ‘aha’ moments and tell stories that can reach a wider business audience. But what is the cost of this curation?

That beautifully crafted dataset is full of bias

Many data workers spend a long time cleansing datasets and ETL pipelines that turn ugly, noisy data into something more report and dashboard-worthy. They’re a veteran of munging data, fixing schemas and processing, and loath to let all of those hours of code, sweat, and tears go to waste.

While this was almost certainly the right thing to do before the age of AI, it’s now a critical business mistake.

The real signal is in the noise

The problem is, tidy data doesn’t help that much. 

Every aggregation and pivot performed on datasets reduces the total amount of information available to analyze. That clever NLP topic mining on free text fields was no doubt very useful, but the raw text is more interesting. Perhaps those ‘meaningless’ raw sensor logs have much more meaning than anticipated.

Some example signals we’ve seen in noisy data

Just a few examples of messy data we’ve seen:

  • Spelling mistakes on loan applications
  • Error reports from maintenance crews
  • Oscillating pressure changes in wells
  • Proximity of launderettes to grocery stores
  • Broken features on app causing customer churn

Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI analytics can reveal these possibilities faster and more efficiently than ever before.

An example

Let’s say there’s sensor data that’s difficult to understand. Typically, a sensor array will generate a lot of data, usually unreadable.

Following a detailed investigation, the analytics team has noticed that one of the sensors has a sustained high reading and the high variability seems to predict one type of mechanical fault. As a result, reports now occur on the 3-hour rolling average for this sensor and the 1-hour rolling variance.

These metrics are easy to explain, and everyone from the senior management team to the repair crews understands what they’re measuring. But what was the cost of curating data like this?

While tidy data delivers a nice, explainable story, it does so at the cost of ruling out hypotheses that never may have been considered. And that’s exactly where the actual underlying issue may lie.

Instead, a powerful AI-powered analytics platform can apply a host of functions to this and every other sensor reading, exponentially moving averages, roots, and FFTs. Then, an analyst can try a range of threshold values, comparing this to context data sets such as weather or more bespoke domain knowledge.

Capturing unique insights and revealing hidden patterns buried deep in messy data allows us to spot emerging trends, and identify new behaviors and customer needs. In these uncertain times, this is more critical than ever.

Related Posts

3 Ways To Achieve Data Science Success – Every Time
Scaling-up analytics: what major CPG & e-commerce are doing right
The Game Changers:
AI in Finance

Intelligence in your inbox

Subscribe to our blog for SparkBeyond’s latest news and insights.