→

Why is Advanced Text Feature Discovery Important?

Mining raw text for insights

Many data science projects can benefit from effective handling of text data in addition to the more traditional numeric analysis. Text, sometimes referred to as a type of unstructured data, can be challenging to handle and requires many pre-processing steps before it is useful to a machine learning model. Mining raw text for insights is as much art as it is science and this is where feature discovery comes in.

In this article, we outline how you can go from simple text methods such as keyword analysis to more advanced techniques which help you unlock deeper insights based on textual context.

From keywords to sentiment analysis

Before attempting feature extraction from raw text data it is very common to apply a number of pre-processing steps such as tokenization, removal of irrelevant punctuation and/or stop-words, lemmatization and others. All of these techniques have been extensively covered in the literature so we will not go into detail here.

Once text fields have been pre-processed, one of the most straightforward methods of mining insights from text would be searching for keywords and phrases. For example, imagine a dataset containing loan information for defaults/ repayments where one of the columns contains a free text describing why the person requested the loan. You can calculate the frequency of appearance of the word “bills” in the descriptions and find a correlation between high usage and increased likelihood to default. Such an insight can lead you to pose a hypothesis that you can further test and analyse and eventually arrive at a new feature to add to your machine learning model.

Another common approach to natural language processing is sentiment analysis. This involves bringing in external information about the sentiment scores of words to enhance our ability to analyse text - is it positive or negative, and to what degree – which is useful when analysing, say, consumer reviews.

Above we covered a couple of relatively simple ways of analysing text and they often serve as a starting point.

From words and sentences to numeric representations

The biggest hindrance in analysing text is the lack of a unified way to process raw words and phrases. In the same way we carry out computations with numeric data, we need to find a way to compare text and compute metrics such as, for example, the difference or similarity between two sentences. Imagine we have a dataset of products - perhaps different equipment such as computers, tablets, but also foods and beverages. How can we systematically compare the text descriptions of these items and figure out which items are more similar to one another?

To accomplish that, we need to convert text into vectors, essentially turning sentences into a numerical representation through a technique called text embedding.

Arguably one of the oldest text representation methods is the “bag-of-words” which simply encodes the number of times a word has been mentioned in a text. An extension to this is the tf-idf calculation, which stands for term frequency–inverse document frequency which adjusts the weights of different words. It transforms text into long, sparse vectors containing the tf-idf value for each word in our corpus.

Using these new vector representations we can start calculating how similar a piece of text is to another and one can quickly see the application of this approach to a traditional machine learning problem - for example, if houses with a high price have a similar description, e.g. three bedrooms or specific amenities, it can imply that other houses with that description will also sell at a relatively higher price. Moreover, we can now systematically search and find such similar descriptions using the numeric vectors.

Advanced text embeddings

One of the major disadvantages of the representation above has to do with the resulting vectors - we can end up with very large, sparse vectors and on top of that we do not learn any relationships between individual words. For example, the words pleasant and enjoyable will be treated as completely separate entities even though they contain very similar meanings.

That brings us to one of the more modern vector representations stemming from the field of deep learning - word2vec. Word2vec uses a neural network model to produce much richer and denser vectors from raw text data. Those representations are not only more efficient in storing information but also encode a lot more meaning than the simple tf-idf vectors which in turn enables data scientists to carry out more sophisticated arithmetics with text.

A commonly cited example of a calculation you can perform using word2vec goes as follows. Take the vectors for each of the words - King, Queen, Man and Woman. If you subtract Man from King, and add Woman, your result would be the vector for Queen.

Not only are we able to represent text numerically, but different words with similar contextual meaning are now “closer” together. This is incredibly powerful as one can now compare texts that are lexicographically different (i.e. use different words) but are semantically similar (i.e. the words convey a similar meaning).

Thinking about concept graphs

In the sections above we examined some of the ways in which we can use raw text corpus to create numeric representations of words and sentences. These are incredibly powerful capabilities for our ever-growing toolbox but they are not exhaustive.

One topic we have not yet touched on is the so-called concept or knowledge graphs. In short, we will look at how to introduce pre-existing knowledge about the relationships of different words and phrases.

Consider a group of foods: pizza, pasta, gelato. Intuitively, these are Italian foods – but a machine won’t know that unless it’s trained. Likewise a human knows that a series of company names that read Deloitte, KPMG and PwC are connected in some way and belong to the umbrella term of professional services firms.

You can encode such knowledge by making use of concept graphs, allowing your models to figure out the relationships between terms. This enables you to define much more aggregated and robust insights rather than relying on endless key-word features. For example, it is likely to be much more informative that food items related to Italian cuisine are popular in a specific region rather than replicating the same insight for each separate item.

By Iliya Kostov, Lead Data Scientist at SparkBeyond

SparkBeyond Discovery is a data science platform for supervised machine learning that helps data professionals save time, deepen their understanding of the problem space and improve model performance by automating feature discovery in complex data.

Get a personalised demo to see the platform in action or watch the on-demand demo.

Overcoming the Enterprise LLM Blindspot

Turns out Enterprise LLMs have a massive blindspot, diminishing AI's impact on real-world performance. Here's how to solve it.

Continue reading

Generative AI for data analytics: the future of enterprise sense-making

In the case of enterprise data analytics, generative AI will radically change the way we interrogate our data to explore, react to and shape our business realities.

Continue reading

Turning enterprise data into accessible knowledge for LLMs

With the recent release of the GPT edition of our Discovery Platform, we introduce novel ways to unlock the vault of deep enterprise knowledge and internally developed insights, making them accessible to decision makers at all levels