Messy data is beautiful

SparkBeyond Team

October 6, 2021

Data scientists have always been expected to curate data into ‘aha’ moments and tell stories that can reach a wider business audience. But what is the cost of this curation?

Cognitive bottlenecks? There’s more than one.

Human intelligence is limited by preconceptions, experiences, assumptions, and behavioral bias. Even the most brilliant data scientist has these human limitations, preventing him/her from obtaining a wholly clear and precise understanding of data.

You may have heard about cognitive bottlenecks in the context of building features – and with good reason. An AI-driven analytics engine can build millions of features that a human analyst would never consider or never have time to hard code into a model. Yet what if there was another source of bottlenecks? What if the very source of bottlenecks was the data itself?

No one wants to deal with data from a free text box on a website, and few care much about seeing the untreated sensor logs from a heavy plant or server stack.

Data scientists have always been expected to curate data into ‘aha’ moments and tell stories that can reach a wider business audience. But what is the cost of this curation?

‍

Your beautifully crafted dataset is full of bias

We understand. You’ve spent a long time cleansing your datasets and your ETL pipelines that turn all of that ugly, noisy data into something more report and dashboard worthy. You’re a veteran of munging data, fixing schemas and processing and you’re loath to let all of those hours of code, sweat, and tears go to waste.

While this was almost certainly the right thing to do before the age of AI, now you may be making a critical business mistake without even knowing it.

‍

The real signal is in the noise

The problem is, tidy data doesn’t help that much. Every aggregation and pivot you’ve performed on your datasets reduces the total amount of information available analyze. Your clever NLP topic mining on free text fields was no doubt very useful, but the raw text is more interesting. Maybe no one in your company wants to look at ‘meaningless’ raw sensor logs, but that’s exactly where AI analytics shines best.

Just a few examples of messy data we’ve seen:

Spelling mistakes on loan applications
Error reports from maintenance crews
The amount of head tilt in video preview images
Oscillating pressure changes in wells
Proximity of launderettes to grocery stores
Broken features on app causing customer churn

Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI-based technologies can reveal these possibilities faster and more efficiently than ever before.

An example

Let’s say you have sensor data that’s difficult to understand. Typically, a sensor array will generate a lot of data, and it usually won’t be very naturally readable, if at all.

Following a detailed investigation, your data science team has noticed that something to do with one of the sensors having a sustained high reading and high variability seems to predict one type of mechanical fault. As a result, you now report on the 3-hour rolling average for this sensor and the 1-hour rolling variance.

These metrics are easy to explain, and everyone from the senior management team to the repair crews understands what they’re measuring. But what was the cost of curating data like this?

While your tidy data gets you a nice, explainable story, it does so at the cost of ruling out hypotheses that you would never consider or have had the chance to test. And that's where you may find the underlying issue.

An AI-powered analytics engine may have the capability to apply a host of functions to this and every other sensor reading, exponential moving averages, roots, and FFTs. Using one, we can try a range of threshold values, comparing this to context data sets such as weather or more bespoke domain knowledge.

Capturing unique insights and revealing hidden patterns buried deep in messy data allows us to spot emerging trends and identify new behaviors and customer needs. In these uncertain times, this is more critical than ever.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Features

No items found.

See a demo of SparkBeyond Discovery in action, and claim a $100 Amazon gift card (valid for demos taken until 31 October 2021).