April 30, 2021

The data mining process

Arne Wolfewicz

Growth Machine Builder

Artificial intelligence lacks important aspects of human intelligence

To this day, simply dumping a pile of data into even the most advanced machine is unlikely to give you back anything meaningful, let alone produce the outcome that you desire. Intelligent systems still need people to ask the right questions, set goals and evaluate the performance.

At Levity, we set ourselves the objective of democratizing machine learning and allow users to prepare data, as well as train, evaluate and put a model into production – without having to write a single line of code.

But how can we get from an idea to a functioning system?

In this article, we walk you through the CRISP framework and highlight not only the critical elements in the process. In addition, we want to show you how modern tools can take away much of the complexity.

CRISP for data mining

Machine learning practitioners around the world have been – consciously or unconsciously – following a certain pattern in order to make a machine produce good results: CRISP-DM (cross-industry standard process for data mining). It suggests that certain steps have to be taken in the following order:

Problem understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

However, it is usually wrong to assume that one can get from idea to working system by just taking each step once – iteration is the rule rather than the exception (see illustration).

CRISP-DM: Cross-industry standard process for data mining.

Let's go through the steps one by one.

1. Problem understanding

Note to those who are familiar with some machine learning techniques: This is the time for objective analysis. It is quite common to reframe the problem several times as the project progresses but any iteration that can be avoided should be avoided.

Also referred to as business understanding, it is important to first have a clear view on the problem at hand. Each business problem is unique in some way and may not present itself as a data mining case from the start.

Some guiding questions during this phase:

What is the issue you are facing?
What is the input, processing and desired output of the process?
How is the process being done today?
What steps would you like to automate?
Which aspects should be automated by machine learning and which can be handled by other tools?

2. Data understanding

After the problem is framed, it is important to understand (1) what data is available and (2) how that data looks. In any business setting, there is data in a variety of formats. Images, plain text, sound, videos, databases – standardized and unstructured. Today's technology is able to deal with all those mentioned but we first must determine what can be used.

Besides the format alone, it also helps to get a first idea of where the data is coming from. In the simplest way, you have immediate access to it.

3.-5. Data preparation, modeling & evaluation

Business people are usually great at framing the problem and understand the data involved in the process. At this point, however, projects turn a lot more "technical": Data needs to be retrieved at greater quantities, labeled and transformed into a machine-digestible format. Afterwards, a number of models has to be set up, trained and finally evaluated.

All these activities typically go beyond the skillset of managers and non-technical staff, which is why the project has to be given out of hand at some point. There is nothing bad about this per se, however, most companies do not even have these skills on their payroll. This is the very reason why many projects either get stale at this point or – worse – never get considered in the first place.

This is why we established Levity: On our platform, all three stages can be handled without code:

Data can be labeled in Slack (if needed)
State-of-the-art models are automatically selected and trained
The user receives immediate feedback on how good or bad the training process went

As such, control remains with whoever thought of the problem in the first place without having to employ someone or apply for developer capacity.

6. Deployment

Simply having a prediction machine is worth little to nothing. What ultimately drives performance, speed and quality in processes is having the system work on automatic request, possibly embedded in a no-touch workflow.

In the traditional system, there are two popular ways of deploying a model:

Embed the trained model into an existing program
Set up a microservice that is able to communicate via API

While the first option is becoming less and less common, the second one at least allows to connect workflow automation tools like Zapier. Our software does that for you: you get the Zapier integration right out of the box.

On discipline and CRISP

Most users that we speak to immediately get excited about the technology. This is understandable given that such software is usually not available without major investments of time, money or both. Therefore, especially smaller or young businesses do not get to enjoy the benefits of such systems and end up either hiring manual labor or not doing something at all.

Having said that, we strongly recommend you to not "jump the gun" by skipping the initial steps. This typically means additional iterations which you might want to avoid.

We are working hard to make the flow as user-friendly as humanly possible. Our ultimate goal is to make the software entirely self-explanatory. Until then, feel free to connect with us to discuss your idea!