The Data Science Problems We All Have Made and How to Fix Themby Oliver Gräser, Director of Data Science - APAC, 30 Jul 2019
Capable of driving truly groundbreaking results, data science is one of the most exciting forces in business and technology today. Yet, despite this excitement, data science can be unfortunately under-utilized — meaning businesses are not able to achieve the success that they are hoping to.
With that, below are a few common traps that data scientists can avoid in order to help deliver the desired impact our organizations are looking for.
Who Wants to Predict When You Can Create?
Predictive modeling is probably the first thing that comes to mind when people think “Data Science.” There is no doubt that predictive modeling is one of the many upsides of data science — especially during those frequent instances when we know that the result is out of our control so predicting it is all we can do. But why only limit data science to predictions? For example, should we simply accept that customers will churn, and simply make retention offers to those most at risk? Or should we understand why people are likely to churn and make them happier customers in the first place?
We need to move beyond just building predictive models to uncovering underlying drivers. This is obviously easier said than done, given finding the root causes of your open-ended problem is much more complex than building a model. If you want to shape the future instead of being shaped by it, you need to discover what drives your problem.
Do you know what you want to know?
When turning a business problem into a data science use case, the first question is often, “What is my target variable?” This isn’t as trivial a question as you may think. For example, I was once involved with a study that looked for drivers of sales representative productivity. During the scoping process, each business stakeholder was asked, “How is it measured?” Different stakeholders had different perspectives: total sales volume, sales growth year-on-year, sales target fulfillment, etc. Together, we identified multiple target definitions — all of which had different drivers — in order to fully understand what improves sales rep productivity. Using traditional data science methods, the necessary efforts would have skyrocketed, but with automated hypotheses search, the additional effort was miniscule.
In the same manner, common analytics use cases have often multiple angles. Take insurance claims, for example. We would want to know which claims are overall low risk and can be fast-tracked. In addition, we would also want to know which would need to go through triage with another insurer, or which ones are likely uncovered. Each of these objectives typically has different drivers, and using traditional methods, exploring five use cases will require you to put in five times the effort. In order to create sustainable business impact, this is not enough.
What holds true today will not hold true tomorrow
A common starting point for many supervised learning problems is the question, “How much data do we have?” If the answer is, “Data for the last five years, about 100,000 records,” then the common reaction is simply to go ahead. However, how are we controlling that the information contained in the bulk of our data is still relevant?
This is actually a very well-known problem that might be addressed incorrectly. To fix this problem, data scientists typically just recalibrate their models on up-to-date data. The fallacy of this approach is that recalibration only allows your models to correctly interpret the information presented to them — the information encoded in the features that the data scientist provides. But what about the information that was discarded or ignored by the data scientist, because, in the past, it did not matter?
As different as the above problems may seem, one approach does offer a solution to all of them: leveraging AI to generate hypotheses at scale. If you are generating millions of hypotheses, you can evaluate them against any multitude of problem aspects, and discover insights into which ones indeed influence the underlying behaviour. And you can repeat the process, from scratch, again and again as soon as new data is available. Moreover, each time you do this, you can then encode the relevant hypotheses and use them to build predictive models covering phenomena that only just now appeared in the data.