Human intelligence is limited by preconceptions, experiences, assumptions, and behavioral bias. Even the most brilliant data scientist has these human limitations, preventing him/her from obtaining a wholly clear and precise understanding of data.
There’s more than one cognitive bottleneck
At SparkBeyond, we often talk about cognitive bottlenecks in the context of building features – and with good reason. Our AI engine can build millions of features that a human analyst would never consider or never have time to hard code into a model. What if there was another source of bottlenecks? What if the very source of bottlenecks was the data itself?
No one wants to deal with data from a free text box on a website, and few care much about seeing the untreated sensor logs from a heavy plant or server stack.
Data scientists have always been expected to curate data into ‘aha’ moments and tell stories that can reach a wider business audience. But what is the cost of this curation?
Your beautifully crafted dataset is full of bias
We understand. You’ve spent a long time cleansing your datasets and your ETL pipelines that turn all of that ugly, noisy data into something more report and dashboard worthy. You’re a veteran of munging data, fixing schemas and processing and you’re loath to let all of those hours of code, sweat, and tears go to waste.
While this was almost certainly the right thing to do before the age of AI, now you may be making a critical business mistake without even knowing it.
The real signal is in the noise
The problem is, tidy data doesn’t help that much. Every aggregation and pivot you’ve performed on your datasets reduces the total amount of information available to SparkBeyond. Your clever NLP topic mining on free text fields was no doubt very useful, but the raw text is more interesting. Maybe no one in your company wants to look at ‘meaningless’ raw sensor logs, but that’s exactly where SparkBeyond’s Hypothesis Engine wants to work.
Some example signals we’ve seen in noisy data
Just a few examples of messy data we’ve seen:
- Spelling mistakes on loan applications
- Error reports from maintenance crews
- The amount of head tilt in video preview images
- Oscillating pressure changes in wells
- Proximity of launderettes to grocery stores
- Broken features on app causing customer churn
Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI-based technologies can reveal these possibilities faster and more efficiently than ever before.
Let’s say you have sensor data that’s difficult to understand. Typically, a sensor array will generate a lot of data, and it usually won’t be very naturally readable, if at all.
Following a detailed investigation, your data science team has noticed that something to do with one of the sensors having a sustained high reading and high variability seems to predict one type of mechanical fault. As a result, you now report on the 3-hour rolling average for this sensor and the 1-hour rolling variance.
These metrics are easy to explain, and everyone from the senior management team to the repair crews understands what they’re measuring. But what was the cost of curating data like this?
While your tidy data gets you a nice, explainable story, it does so at the cost of ruling out hypotheses that you would never consider or have had the chance to test. The SparkBeyond discovery engine can help you overcome this, and that’s why we love messy, granular data.
Within the SparkBeyond AI engine, we have the capability to apply a host of functions to this and every other sensor reading, exponential moving averages, roots, and FFTs. We can try a range of threshold values, comparing this to context data sets such as weather or more bespoke domain knowledge.
Capturing unique insights and revealing hidden patterns buried deep in messy data allows us to spot emerging trends and identify new behaviors and customer needs. In these uncertain times, this is more critical than ever.