Knowing the tools and techniques of data mining is extremely important for both experts in the sector and students who want to work in big data. So, after our first article on data mining , where we explained its definition, in which sectors its processes are applied and what its stages are, we go a little deeper.
8 data mining techniques
In this subsection we will present the 8 data mining techniques most used by companies, explaining what they consist of in a simple way.
Decision tree
It is called this because it has a tree structure in which we find two types of nodes : decision points and chance points.
The problems and sequence of decision trees are captured in these trees, where a node is a joining point connected by branches.
The tree is created from left to right, but it is evaluated in reverse, simply because the decision is on the left and the results are on the right.
It consists of 4 elements:
Decision points: are represented by a square. Here the decision-maker chooses an action alternative from a finite gambling data mexico phone number number of them, which are represented by the branches whose associated costs are written on them. The chosen branches can end in another decision point, in a chance point or in a result.
Chance points: are drawn with a circle and indicate that a random event is expected at this point in the process. Branches also emerge from here.
Branches: In big data jargon, they are defined as alternatives when they arise from decision points and as states of nature when they arise from chance points. In the latter case, they are assigned certain probabilities.
Result: In the end we have to decide what decision to take based on the result obtained from each branch.
Neural network
This data mining technique is based on the functioning of our neurons , since the human brain has millions that connect to each other in a process called “synapses.” Thanks to this, each of us is able to think.
This artificial neural network is so similar to a biological one that it has input nodes (which receive information from the outside), output nodes (which transmit information to the outside) and hidden nodes (which exchange information with other nodes in the network).
Once these nodes are defined, we move on to the learning phase, where different values are assigned to these nodes until answers are found, since it is the network itself that creates, modifies or eliminates them automatically.
The main advantage of this data mining technique is its ability to work with incomplete data.
Statistical modeling
It is based on the relationships between variables in the data using mathematical equations to predict results.
It is the oldest of the data mining techniques , since it began to be developed in the 17th century with more archaic methods, but the essence was the same as today.
If it is so old, it is because it is a branch of mathematics that was introduced to the world of data as it became part of our society.
Association rules
They allow us to find the combinations of articles that occur most frequently in a database and their importance.
An example of this data mining technique is a customer who is going to buy an item and their purchase intention is associated with that of other consumers in the database, or they are even shown other products based on their history.
The data is grouped in the form of a list, in a vertical representation or in a horizontal one.
Clustering
Elements are grouped into a data set, which in turn are grouped into distinct subsets.
The goal is for elements of the same class to have great similarities between them, while those belonging to a different class have the least similarity possible.
There are many types of clustering, but the most common are two:
Hierarchical clustering: An object is more related to nearby objects than to distant objects.
Density-based clustering: Objects are grouped into clusters as long as the nearest elements are within a set threshold.
Genetic algorithm
Just as the neural network is based on our neurons, the genetic algorithm is based on the theory of evolution.
This data mining technique attempts to replicate the biological behavior of natural selection and genetics.
The algorithm is given an initial population of data that represents certain outcomes (chromosomes) and contains bits (genes).
They go through the evaluation phase together, where they are assigned a percentage based on their fitness. The fittest continue and the others do not, just like in Charles Darwin's theory.
After this, the data is crossed or mutated and the process is repeated until the expected result is reached or until it is stopped manually .
Linear regression
Linear regression is another of the most widely used data mining techniques in a sector that continues to grow due to digital transformation. It relates two continuous variables , specifically, the prediction and response variables.
We speak of linear regression when there is only one predictor variable and multiple regression when there is more than one. Whether linear or multiple, one is an independent variable while the response variable depends on the previous one.
Bayesian networks
They represent certain uncertainties that are associated with nodes that reproduce random variables, which are in turn associated with an external condition. For this, there are so-called "Bayesian classifiers" , which organize each variable and manage to express the conditions in such a way that they are very easy to read.
They are very characteristic in medicine for serious diagnoses. Bayesian networks are used to rule out diseases quickly.
Data mining tools and techniques
-
- Posts: 16
- Joined: Thu Dec 05, 2024 6:32 am