What you need to know about data mining. What is Data Mining? Text information analysis tools

data mining) and "rough" exploratory analysis, which forms the basis of online analytical data processing (OnLine Analytical Processing, OLAP), while one of the main provisions of Data Mining is the search for non-obvious patterns. Data mining tools can find such patterns on their own and also independently build hypotheses about relationships. Since it is the formulation of a hypothesis regarding dependencies that is the most difficult task, the advantage of Data Mining over other analysis methods is obvious.

Most statistical methods for identifying relationships in data use the concept of averaging over a sample, leading to operations on non-existent values, while Data Mining operates on real values.

OLAP is more suited to understanding historical data, Data Mining relies on historical data to answer questions about the future.

Prospects for Data Mining Technology

The potential of Data Mining gives a "green light" to expand the boundaries of technology application. Regarding the prospects of Data Mining, the following areas of development are possible:

  • selection of types of subject areas with their corresponding heuristics, the formalization of which will facilitate the solution of the corresponding Data Mining tasks related to these areas;
  • creation of formal languages ​​and logical tools with the help of which reasoning will be formalized and automation of which will become a tool for solving Data Mining problems in specific subject areas;
  • creation of data mining methods that can not only extract patterns from data, but also form some theories based on empirical data;
  • overcoming a significant gap between the capabilities of Data Mining tools and theoretical achievements in this area.

If we consider the future of Data Mining in the short term, it is obvious that the development of this technology is most directed towards business-related areas.

In the short term, data mining products may become as commonplace and essential as email, and, for example, be used by users to find the lowest prices for a particular product or the cheapest tickets.

In the long term, the future of Data Mining is really exciting - it could be the search for new treatments for various diseases, as well as a new understanding of the nature of the universe, by intelligent agents.

However, Data Mining is also fraught with potential danger - after all, an increasing amount of information becomes available through worldwide network, including private information, and more and more knowledge can be obtained from it:

Not so long ago, the largest online store "Amazon" was at the center of a scandal over its patent "Methods and systems for helping users to purchase goods", which is nothing more than another Data Mining product designed to collect personal data about store visitors. The new technique allows predicting future requests based on the facts of purchases, as well as drawing conclusions about their purpose. The purpose of this technique is what was mentioned above - obtaining as much information as possible about customers, including private ones (gender, age, preferences, etc.). In this way, data is collected about the private lives of shoppers, as well as their family members, including children. The latter is prohibited by the laws of many countries - the collection of information about minors is possible there only with the permission of the parents.

Research notes that there are both successful solutions using Data Mining and unsuccessful experiences with this technology. Areas where applications of Data Mining technology are most likely to be successful have the following features:

  • require knowledge-based decisions;
  • have a changing environment;
  • have accessible, sufficient and meaningful data;
  • provide high dividends from the right decisions.

Existing approaches to analysis

For a long time, the discipline of Data Mining was not recognized as a full-fledged independent field of data analysis, sometimes it is called the "backyard of statistics" (Pregibon, 1997).

To date, several points of view on Data Mining have been defined. Proponents of one of them consider it a mirage that diverts attention from classical analysis.

Data Mining

Data Mining is a methodology and process for discovering large amounts of data that accumulate in information systems ah companies, previously unknown, non-trivial, practically useful and accessible for the interpretation of the knowledge necessary for making decisions in various areas of human activity. Data Mining is one of the stages of the larger Knowledge Discovery in Databases methodology.

The knowledge discovered in the process of Data Mining must be non-trivial and previously unknown. Non-triviality suggests that such knowledge cannot be discovered by simple visual analysis. They should describe relationships between the properties of business objects, predict the values ​​of some features based on others, and so on. Found knowledge should be applicable to new objects.

The practical usefulness of knowledge is due to the possibility of their use in the process of supporting managerial decision-making and improving the company's activities.

Knowledge should be presented in a form that is understandable to users who do not have special mathematical training. For example, the logical constructions “if, then” are most easily perceived by a person. Moreover, such rules can be used in various DBMS as SQL queries. In the case when the extracted knowledge is not transparent to the user, there should be post-processing methods that allow them to be brought to an interpretable form.

Data mining is not one, but a combination of a large number of different knowledge discovery methods. All tasks solved by Data Mining methods can be conditionally divided into six types:

Data mining is multidisciplinary in nature, as it includes elements of numerical methods, mathematical statistics and probability theory, information theory and mathematical logic, artificial intelligence and machine learning.

The tasks of business analysis are formulated in different ways, but the solution of most of them comes down to one or another Data Mining task or to a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are elements from which you can "assemble" the solution to most real business problems.

To solve the above problems, various methods and algorithms of Data Mining are used. In view of the fact that Data Mining has developed and is developing at the intersection of such disciplines as mathematical statistics, information theory, machine learning and databases, it is quite natural that most Data Mining algorithms and methods were developed based on various methods from these disciplines. For example, the k-means clustering algorithm was borrowed from statistics.

We welcome you to the Data Mining Portal - a unique portal dedicated to modern Data Mining methods.

Data Mining technologies are a powerful tool for modern business intelligence and data mining to discover hidden patterns and build predictive models. Data Mining or knowledge mining is not based on speculative reasoning, but on real data.

Rice. 1. Scheme of application of Data Mining

Problem Definition - Problem definition: data classification, segmentation, building predictive models, forecasting.
Data Gathering and Preparation - Data collection and preparation, cleaning, verification, removal of duplicate records.
Model Building - Building a model, assessing accuracy.
Knowledge Deployment - Application of the model to solve the problem.

Data Mining is used to implement large-scale analytical projects in business, marketing, the Internet, telecommunications, industry, geology, medicine, pharmaceuticals and other areas.

Data Mining allows you to start the process of finding significant correlations and relationships as a result of sifting through a huge amount of data using modern methods pattern recognition and the use of unique analytical technologies, including decision and classification trees, clustering, neural network methods, and others.

A user who discovers data mining technology for the first time is amazed at the abundance of methods and efficient algorithms that allow finding approaches to solving difficult problems related to the analysis of large amounts of data.

In general, Data Mining can be described as a technology designed to search in large amounts of data. non-obvious, objective and practically useful patterns.

Data Mining is based on effective methods and algorithms designed to analyze unstructured data of large volume and dimension.

The key point is that data of large volume and high dimension appear to be devoid of structure and relationships. The goal of data mining technology is to identify these structures and find patterns where, at first glance, chaos and arbitrariness reign.

Here is an actual example of the application of data mining in the pharmaceutical and drug industries.

Drug interactions are a growing problem facing modern healthcare.

Over time, the number of prescribed drugs (over the counter and all kinds of supplements) increases, making it more and more likely that interactions between drugs can cause serious side effects that doctors and patients are unaware of.

This area refers to post-clinical studies, when the drug is already on the market and is being used extensively.

Clinical studies refer to the evaluation of the effectiveness of the drug, but poorly take into account the interactions of this drug with other drugs on the market.

Researchers at Stanford University in California studied the FDA (Food and Drug Administration) database of drug side effects and found that two commonly used drugs - the antidepressant paroxetine and pravastatin, used to lower cholesterol levels - increase risk of developing diabetes if taken together.

A similar analysis study based on FDA data identified 47 previously unknown adverse interactions.

This is remarkable, with the caveat that many of the negative effects noted by patients remain undetected. Just in this case, network search is able to show itself in the best way.

Upcoming Data Mining courses at the StatSoft Academy of Data Analysis in 2020

We start our acquaintance with Data Mining using the wonderful videos of the Academy of Data Analysis.

Be sure to watch our videos and you will understand what Data Mining is!

Video 1. What is Data Mining?


Video 2: Data Mining Overview: Decision Trees, Generalized Predictive Models, Clustering, and More

JavaScript is disabled in your browser


Before launching a research project, we must organize the process of obtaining data from external sources, now we will show how it is done.

The video will introduce you to the unique technology STATISTICS In-place database processing and Data Mining connection with real data.

Video 3. The order of interaction with databases: a graphical interface for building SQL queries In-place database processing technology

JavaScript is disabled in your browser


Now we will get acquainted with interactive drilling technologies that are effective in conducting exploratory data analysis. The term drilling itself reflects the connection between Data Mining technology and geological exploration.

Video 4. Interactive Drilling: Exploration and Graphing Methods for Interactive Data Exploration

JavaScript is disabled in your browser


Now we will get acquainted with the analysis of associations (association rules), these algorithms allow you to find relationships that exist in real data. The key point is the efficiency of algorithms on large amounts of data.

The result of link analysis algorithms, for example, the Apriori algorithm, is to find the rules for links of the objects under study with a given reliability, for example, 80%.

In geology, these algorithms can be applied in the exploration analysis of minerals, for example, how feature A is related to features B and C.

You can find specific examples of such solutions in our links:

In retail, Apriori algorithms or their modifications allow you to explore the relationship of different products, for example, when selling perfumes (perfume - varnish - mascara, etc.) or products of different brands.

The analysis of the most interesting sections on the site can also be effectively carried out using association rules.

So check out our next video.

Video 5. Association rules

JavaScript is disabled in your browser

Let us give examples of the application of Data Mining in specific areas.

Internet trading:

  • analysis of customer trajectories from visiting the site to purchasing goods
  • evaluation of service efficiency, analysis of failures due to lack of goods
  • linking products that are of interest to visitors

Retail: analysis of customer information based on credit cards, discount cards, etc.

Typical retail tasks solved by Data Mining tools:

  • shopping cart analysis;
  • creation of predictive models and classification models of buyers and purchased goods;
  • creation of buyer profiles;
  • CRM, assessment of customer loyalty of different categories, planning of loyalty programs;
  • time series research and time dependencies, selection of seasonal factors, evaluation of the effectiveness of promotions on a large range of real data.

The telecommunications sector opens up unlimited opportunities for the application of data mining methods, as well as modern big data technologies:

  • classification of clients based on key characteristics of calls (frequency, duration, etc.), SMS frequency;
  • identification of customer loyalty;
  • definition of fraud, etc.

Insurance:

  • risk analysis. By identifying combinations of factors associated with paid claims, insurers can reduce their liability losses. There is a known case when an insurance company discovered that the amounts paid out on the applications of people who are married are twice the amounts on the applications of single people. The company responded to this by revising its discount policy for family customers.
  • fraud detection. Insurance companies can reduce fraud by looking for stereotypes in claims claims that characterize relationships between lawyers, doctors, and claimants.

The practical application of data mining and the solution of specific problems is presented in our next video.

Webinar 1. Webinar "Practical tasks of Data Mining: problems and solutions"

JavaScript is disabled in your browser

Webinar 2. Webinar "Data Mining and Text Mining: Examples of Solving Real Problems"

JavaScript is disabled in your browser


You can get deeper knowledge on the methodology and technology of data mining at StatSoft courses.

Ministry of Education and Science of the Russian Federation

Federal State Budgetary Educational Institution of Higher Professional Education

"NATIONAL RESEARCH TOMSK POLYTECHNICAL UNIVERSITY"

Institute of Cybernetics

Direction Informatics and Computer Engineering

Department of VT

Test

in the discipline of informatics and computer engineering

Topic: Data Mining Methods

Introduction

data mining. Basic concepts and definitions

1 Stages in the data mining process

2 Components of data mining systems

3 Data mining methods in Data Mining

Data Mining Methods

1 Derivation of association rules

2 Neural network algorithms

3 Nearest neighbor and k-nearest neighbor methods

4 Decision trees

5 Clustering algorithms

6 Genetic algorithms

Applications

Manufacturers of Data Mining Tools

Criticism of methods

Conclusion

Bibliography

Introduction

The result of development information technologies is a colossal amount of data accumulated in electronic form, growing at a rapid pace. At the same time, data, as a rule, have a heterogeneous structure (texts, images, audio, video, hypertext documents, relational databases). Data accumulated over a long period of time may contain patterns, trends and relationships that are valuable information in planning, forecasting, decision making, and process control. However, a person is not physically able to effectively analyze such volumes of heterogeneous data. The methods of traditional mathematical statistics have long claimed the role of the main tool for data analysis. However, they do not allow the synthesis of new hypotheses, and can only be used to confirm pre-formulated hypotheses and “rough” exploratory analysis, which forms the basis of online analytical processing (OLAP). Often, it is the formulation of a hypothesis that turns out to be the most difficult task when conducting an analysis for subsequent decision making, since not all patterns in the data are obvious at first glance. Therefore, data mining technologies are considered as one of the most important and promising topics for research and application in the information technology industry. In this case, data mining is understood as the process of determining new, correct and potentially useful knowledge based on large data sets. Thus, MIT Technology Review described Data Mining as one of the ten emerging technologies that will change the world.

1. Data Mining. Basic concepts and definitions

Data Mining is the process of discovering previously unknown, non-trivial, practically useful and accessible knowledge in raw data, which is necessary for making decisions in various areas of human activity.

The essence and purpose of Data Mining technology can be formulated as follows: it is a technology that is designed to search for non-obvious, objective and practical patterns in large amounts of data.

Non-obvious patterns are patterns that cannot be detected by standard methods of information processing or by an expert.

Objective laws should be understood as laws that are fully consistent with reality, in contrast to expert opinion, which is always subjective.

This concept of data analysis suggests that:

§ data may be inaccurate, incomplete (contain gaps), contradictory, heterogeneous, indirect, and at the same time have gigantic volumes; therefore, understanding data in specific applications requires significant intellectual effort;

§ the data analysis algorithms themselves may have “elements of intelligence”, in particular, the ability to learn from precedents, that is, to draw general conclusions based on particular observations; the development of such algorithms also requires considerable intellectual effort;

§ The processes of processing raw data into information and information into knowledge cannot be performed manually and require automation.

The Data Mining technology is based on the concept of templates (patterns), reflecting fragments of multidimensional relationships in data. These patterns are patterns inherent in subsamples of data that can be expressed concisely in a human-readable form.

The search for templates is carried out by methods that are not limited by a priori assumptions about the structure of the sample and the type of distributions of the values ​​of the analyzed indicators.

An important feature of Data Mining is the non-standard and non-obviousness of the patterns being sought. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that instead of checking interdependencies that users presuppose, they are able to find such interdependencies on their own based on the available data and build hypotheses about their nature. There are five standard types of patterns identified by Data Mining methods:

association - high probability of connection of events with each other. An example of an association is items in a store, often purchased together;

sequence - a high probability of a chain of events connected in time. An example of a sequence is a situation where, within a certain period of time after the acquisition of one product, another will be purchased with a high degree of probability;

Classification - there are signs that characterize the group to which this or that event or object belongs;

clustering - a pattern similar to classification and differing from it in that the groups themselves are not specified - they are detected automatically in the process of data processing;

· temporal patterns - the presence of patterns in the dynamics of the behavior of certain data. A typical example of a temporal pattern is seasonal fluctuations in demand for certain goods or services.

1.1 Steps in the Data Mining Process

Traditionally, the following stages are distinguished in the process of data mining:

1. The study of the subject area, as a result of which the main goals of the analysis are formulated.

2. Data collection.

Data preprocessing:

a. Data cleaning - elimination of contradictions and random "noise" from the original data

b. Data integration - combining data from several possible sources in a single repository. Data conversion. At this stage, the data is converted to a form suitable for analysis. Data aggregation, attribute discretization, data compression, and dimensionality reduction are often used.

4. Data analysis. Within this stage, mining algorithms are applied to extract patterns.

5. Interpretation of found patterns. This stage may include visualization of the extracted patterns, identification of really useful patterns based on some utility function.

Use of new knowledge.

1.2 Components of mining systems

Typically, the following main components are distinguished in data mining systems:

1. Database, data warehouse or other repository of information. It can be one or more databases, data warehouse, spreadsheets, other types of repositories that can be cleaned and integrated.

2. Database or data warehouse server. The specified server is responsible for extracting relevant data based on the user's request.

Knowledge base. It is domain knowledge that indicates how to search and evaluate the usefulness of the resulting patterns.

Knowledge Mining Service. It is an integral part of the data mining system and contains a set of functional modules for tasks such as characterization, association search, classification, cluster analysis and variance analysis.

Pattern evaluation module. This component calculates measures of interest or utility of patterns.

Graphical user interface. This module is responsible for communication between the user and the data mining system, visualization of patterns in various forms.

1.3 Data Mining Methods

Most of the analytical methods used in Data Mining technology are well-known mathematical algorithms and methods. New in their application is the possibility of their use in solving certain specific problems, due to the emerging capabilities of hardware and software. It should be noted that most of the Data Mining methods were developed within the framework of the theory artificial intelligence. Consider the most widely used methods:

Derivation of association rules.

2. Neural network algorithms, the idea of ​​which is based on an analogy with the functioning of the nervous tissue and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between the "neurons", and the response of the entire network is considered as the answer resulting from the analysis to the original data.

Selection of a close analogue of the original data from the already available historical data. Also called the nearest neighbor method.

Decision trees are a hierarchical structure based on a set of questions that require a "Yes" or "No" answer.

Cluster models are used to group similar events into groups based on the similar values ​​of multiple fields in a dataset.

In the next chapter, we will describe these methods in more detail.

2. Data mining methods

2.1 Derivation of association rules

Association rules are rules of the form "if...then...". Searching for such rules in a data set reveals hidden relationships in seemingly unrelated data. One of the most frequently cited examples of the search for association rules is the problem of finding stable relationships in a shopping cart. This problem is to determine which products are purchased together by the customers, so that marketers can appropriately place these products in the store to increase sales.

Association rules are defined as statements of the form (X1,X2,…,Xn) -> Y, where it is understood that Y can be present in a transaction provided that X1,X2,…,Xn are present in the same transaction. Note that the word "may" implies that the rule is not an identity, but only holds with some probability. In addition, Y can be a set of elements, not just one element. The probability of finding Y in a transaction that contains elements X1,X2,…,Xn is called confidence. The percentage of transactions containing the rule out of the total number of transactions is called support. The level of certainty that must exceed the rule's certainty is called interestingness.

There are different types of association rules. In its simplest form, association rules report only the presence or absence of an association. Such rules are called Boolean Association Rules. An example of such a rule is: “customers who purchase yogurt also purchase butter with low level fat."

Rules that collect several association rules together are called Multilevel or Generalized Association Rules. When constructing such rules, the elements are usually grouped according to a hierarchy, and the search is carried out at the highest conceptual level. For example, "customers who buy milk also buy bread." In this example, milk and bread contain a hierarchy of different types and brands, but searching the bottom level won't turn up any interesting rules.

A more complex type of rules are Quantitative Association Rules. This type of rule is searched using quantitative (eg price) or categorical (eg gender) attributes, and is defined as ( , ,…,} -> . For example, "customers who are between 30 and 35 years old with an income of more than 75,000 a year buy cars worth more than 20,000."

The above types of rules do not affect the fact that transactions, by their nature, are time dependent. For example, searching before a product is listed for sale or after it has disappeared from the market will adversely affect the support threshold. With this in mind, the concept of attribute lifetime is introduced in the search algorithms for Temporal Association Rules.

The problem of finding association rules can be in general view is decomposed into two parts: searching for frequently occurring sets of elements, and generating rules based on the found frequently occurring sets. Previous research has, for the most part, followed these lines and extended them in various directions.

Since the advent of the Apriori algorithm, this algorithm has been the most commonly used in the first step. Many improvements, for example, in speed and scalability, are aimed at improving the Apriori algorithm, at correcting its erroneous property of generating too many candidates for the most frequently occurring sets of elements. Apriori generates item sets using only the larger item sets found in the previous step, without revisiting transactions. The modified AprioriTid algorithm improves Apriori by using the database only on the first pass. The calculations in subsequent steps use only the data created in the first pass, which is much smaller than the original database. This results in a huge increase in productivity. A further improved version of the algorithm, called AprioriHybrid, can be obtained by using Apriori on the first few passes, and then, on later passes, when the kth candidate sets can already be fully placed in the computer's memory, switching to AprioriTid.

Further efforts to improve the Apriori algorithm are related to the parallelization of the algorithm (Count Distribution, Data Distribution, Candidate Distribution, etc.), its scaling (Intelligent Data Distribution, Hybrid Distribution), the introduction of new data structures, such as trees of frequently occurring elements (FP-growth ).

The second step is mainly characterized by authenticity and interestingness. The new modifications add the dimension, quality, and temporal support described above to the traditional boolean rule rules. An evolutionary algorithm is often used to find rules.

2.2 Neural network algorithms

Artificial neural networks appeared as a result of applying the mathematical apparatus to the study of the functioning of the human nervous system in order to reproduce it. Namely: the ability of the nervous system to learn and correct errors, which should allow us to simulate, albeit rather roughly, the work human brain. The main structural and functional part of the neural network is the formal neuron, shown in Fig. 1, where x0 , x1,..., xn are the components of the vector of input signals, w0 ,w1,...,wn are the values ​​of the weights of the input signals of the neuron, and y is the output signal of the neuron.

Rice. 1. Formal neuron: synapses (1), adder (2), converter (3).

A formal neuron consists of 3 types of elements: synapses, adder and converter. Synapse characterizes the strength of the connection between two neurons.

The adder performs the addition of the input signals previously multiplied by the corresponding weights. The converter implements the function of one argument - the output of the adder. This function is called the activation function or transfer function of the neuron.

The formal neurons described above can be combined in such a way that the output signals of some neurons are input to others. The resulting set of interconnected neurons is called artificial neural networks (artificial neural networks, ANN) or, in short, neural networks.

There are the following three general types of neurons, depending on their position in the neural network:

Input neurons to which input signals are applied. Such neurons usually have one input with unit weight, there is no bias, and the output value of the neuron is equal to the input signal;

Output nodes, the output values ​​of which represent the resulting output signals of the neural network;

Hidden nodes that do not have direct connections with input signals, while the values ​​of the output signals of hidden neurons are not output signals of the ANN.

According to the structure of interneuronal connections, two classes of ANNs are distinguished:

ANN of direct propagation, in which the signal propagates only from input neurons to output neurons.

Recurrent ANN - ANN with feedback. In such ANNs, signals can be transmitted between any neurons, regardless of their location in the ANN.

There are two general approaches to training ANNs:

Training with a teacher.

Learning without a teacher.

Supervised learning involves the use of a pre-formed set of training examples. Each example contains a vector of input signals and a corresponding vector of reference output signals, which depend on the task at hand. This set called the training set or training set. The training of the neural network is aimed at such a change in the weights of the ANN connections, in which the value of the output signals of the ANN differs as little as possible from the required values ​​of the output signals for a given vector of input signals.

In unsupervised learning, the connection weights are adjusted either as a result of competition between neurons or taking into account the correlation of the output signals of the neurons between which there is a connection. In the case of unsupervised learning, the training sample is not used.

Neural networks are used to solve a wide range of problems, such as planning payloads for space shuttles and forecasting exchange rates. However, they are not often used in data mining systems due to the complexity of the model (knowledge, fixed as the weights of several hundred interneuronal connections, is completely impossible to analyze and interpret by a person) and long training time on a large training set. On the other hand, neural networks have such advantages for use in data analysis tasks as resistance to noisy data and high accuracy.

2.3 Nearest neighbor and k-nearest neighbor methods

Nearest neighbor algorithm and k-nearest neighbor algorithm (KNN) are based on object similarity. The nearest neighbor algorithm selects among all known objects the object that is as close as possible (using the distance metric between objects, for example, Euclidean) to a new previously unknown object. The main problem with the nearest neighbor method is its sensitivity to outliers in the training data.

The described problem can be avoided by the KNN algorithm, which distinguishes k-nearest neighbors from all observations that are similar to a new object. Based on the classes of nearest neighbors, a decision is made regarding the new object. An important task of this algorithm is the selection of the coefficient k - the number of records that will be considered similar. Modification of the algorithm, in which the contribution of the neighbor is proportional to the distance to the new object (method of k-weighted nearest neighbors), allows to achieve greater classification accuracy. The k nearest neighbors method also allows you to evaluate the accuracy of the forecast. For example, if all k nearest neighbors have the same class, then the probability that the object being checked will have the same class is very high.

Among the features of the algorithm, it is worth noting the resistance to anomalous outliers, since the probability of such a record falling into the number of k-nearest neighbors is small. If this happens, then the impact on voting (especially weighted) (for k>2) is also likely to be insignificant, and, consequently, the impact on the classification outcome will also be small. Also, the advantages are simple implementation, ease of interpretation of the result of the algorithm, the ability to modify the algorithm by using the most appropriate combination functions and metrics, which allows you to adjust the algorithm for a specific task. The KNN algorithm also has a number of disadvantages. First, the data set used for the algorithm must be representative. Second, the model cannot be separated from the data: all examples must be used to classify a new example. This feature severely limits the use of the algorithm.

2.4 Decision trees

The term "decision trees" refers to a family of algorithms based on the representation of classification rules in a hierarchical, sequential structure. This is the most popular class of algorithms for solving data mining problems.

A family of algorithms for constructing decision trees makes it possible to predict the value of a parameter for a given case based on a large amount of data on other similar cases. Typically, algorithms of this family are used to solve problems that make it possible to divide all initial data into several discrete groups.

When applying decision tree algorithms to a set of initial data, the result is displayed as a tree. Such algorithms make it possible to carry out several levels of such separation, breaking the resulting groups (tree branches) into smaller ones based on other features. The division continues until the values ​​to be predicted are the same (or, in the case of a continuous value of the predicted parameter, close) for all received groups (leaves of the tree). It is these values ​​that are used to make predictions based on this model.

The operation of algorithms for constructing decision trees is based on the use of regression and correlation analysis methods. One of the most popular algorithms of this family is CART (Classification and Regression Trees), based on the division of data in a tree branch into two child branches; further division of one branch or another depends on how much initial data is described by this branch. Some other similar algorithms allow you to split a branch into more child branches. In this case, the division is made on the basis of the highest correlation coefficient for the data described by the branch between the parameter according to which the division occurs and the parameter that must be further predicted.

The popularity of the approach is associated with visibility and comprehensibility. But decision trees are fundamentally incapable of finding the “best” (most complete and accurate) rules in the data. They implement the naive principle of successive viewing of signs and actually find parts of real patterns, creating only the illusion of a logical conclusion.

2.5 Clustering algorithms

Clustering is the task of partitioning a set of objects into groups called clusters. The main difference between clustering and classification is that the list of groups is not clearly defined and is determined in the course of the algorithm.

The application of cluster analysis in general terms is reduced to the following steps:

selection of a sample of objects for clustering;

definition of a set of variables by which the objects in the sample will be evaluated. If necessary - normalization of variable values;

calculation of similarity measure values ​​between objects;

application of the cluster analysis method to create groups of similar objects (clusters);

· presentation of the results of the analysis.

After receiving and analyzing the results, it is possible to adjust the selected metric and clustering method until an optimal result is obtained.

Among the clustering algorithms, hierarchical and flat groups are distinguished. Hierarchical algorithms (also called taxonomy algorithms) do not build a single partition of the sample into disjoint clusters, but a system of nested partitions. Thus, the output of the algorithm is a tree of clusters, the root of which is the entire sample, and the leaves are the smallest clusters. Flat algorithms build one partition of objects into non-intersecting clusters.

Another classification of clustering algorithms is into crisp and fuzzy algorithms. Clear (or non-overlapping) algorithms assign a cluster number to each sample object, that is, each object belongs to only one cluster. Fuzzy (or intersecting) algorithms assign each object a set of real values ​​showing the degree of relation of the object to clusters. Thus, each object belongs to each cluster with some probability.

There are two main types of hierarchical clustering algorithms: ascending and descending algorithms. Top-down algorithms work on a top-down basis: first, all objects are placed in one cluster, which is then divided into smaller and smaller clusters. More common are bottom-up algorithms that initially place each feature in a separate cluster and then merge the clusters into larger and larger clusters until all of the sampled features are contained in the same cluster. Thus, a system of nested partitions is constructed. The results of such algorithms are usually presented in the form of a tree.

The disadvantage of hierarchical algorithms is the system of complete partitions, which may be redundant in the context of the problem being solved.

Let us now consider flat algorithms. The simplest among this class are quadratic error algorithms. The clustering problem for these algorithms can be considered as the construction of an optimal partition of objects into groups. In this case, optimality can be defined as the requirement to minimize the root-mean-square partitioning error:

,

where c j - "center of mass" of the cluster j(point with average values ​​of characteristics for a given cluster).

The most common algorithm in this category is the k-means method. This algorithm builds a given number of clusters located as far apart as possible. The work of the algorithm is divided into several stages:

Randomly choose k points that are the initial "centers of mass" of the clusters.

2. Assign each object to a cluster with the nearest "center of mass".

If the criterion for stopping the algorithm is not satisfied, return to step 2.

As a criterion for stopping the operation of the algorithm, the minimum change in the mean square error is usually chosen. It is also possible to stop the algorithm if at step 2 there were no objects that moved from cluster to cluster. The disadvantages of this algorithm include the need to specify the number of clusters for splitting.

The most popular fuzzy clustering algorithm is the c-means algorithm. It is a modification of the k-means method. Algorithm steps:

1. Choose an initial fuzzy partition n objects on k clusters by choosing a membership matrix U size n x k.

2. Using the matrix U, find the value of the fuzzy error criterion:

,

where c k - "center of mass" of a fuzzy cluster k:

3. Regroup the objects in order to reduce this value of the fuzzy error criterion.

4. Return to step 2 until the matrix changes U will not become insignificant.

This algorithm may not be suitable if the number of clusters is not known in advance, or if it is necessary to uniquely attribute each object to one cluster.

The next group of algorithms are algorithms based on graph theory. The essence of such algorithms is that the selection of objects is represented as a graph G=(V, E), whose vertices correspond to objects, and whose edges have a weight equal to the "distance" between objects. The advantage of graph clustering algorithms is the visibility, relative ease of implementation and the possibility of making various improvements based on geometric considerations. The main algorithms are the algorithm for extracting connected components, the algorithm for constructing a minimum spanning (spanning) tree, and the algorithm for layered clustering.

To select a parameter R usually a histogram of distributions of pairwise distances is constructed. In tasks with a well-defined cluster data structure, the histogram will have two peaks - one corresponds to intra-cluster distances, the second - to inter-cluster distances. Parameter R is selected from the zone of minimum between these peaks. At the same time, it is quite difficult to control the number of clusters using the distance threshold.

The minimum spanning tree algorithm first builds a minimum spanning tree on the graph and then sequentially removes the edges with the highest weight. The layer-by-layer clustering algorithm is based on the selection of connected graph components at a certain level of distances between objects (vertices). The distance level is set by the distance threshold c. For example, if the distance between objects is , then .

The layered clustering algorithm generates a sequence of graph subgraphs G, which reflect the hierarchical relationships between clusters:

,

where G t = (V, E t ) - level graph With t , ,

With t - t-th distance threshold, m - number of hierarchy levels,
G 0 = (V, o), o - empty set of graph edges obtained by t 0 = 1,
G m = G, that is, a graph of objects without restrictions on the distance (the length of the edges of the graph), since t m = 1.

By changing the distance thresholds ( With 0 , …, With m ), where 0 = With 0 < With 1 < …< With m = 1, it is possible to control the depth of the hierarchy of the resulting clusters. Thus, the layer-by-layer clustering algorithm is able to create both a flat data partition and a hierarchical one.

Clustering achieves the following goals:

Improves understanding of data by identifying structural groups. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision making by applying its own analysis method to each cluster;

Allows for compact storage of data. To do this, instead of storing the entire sample, one typical observation from each cluster can be left;

· detection of new atypical objects that did not fall into any cluster.

Usually, clustering is used as an auxiliary method in data analysis.

2.6 Genetic algorithms

Genetic algorithms are among the universal optimization methods that allow solving problems of various types (combinatorial, general problems with and without restrictions) and varying degrees of complexity. At the same time, genetic algorithms are characterized by the possibility of both single-criteria and multi-criteria search in a large space, the landscape of which is not smooth.

This group of methods uses an iterative process of evolution of a sequence of generations of models, including the operations of selection, mutation, and crossing. At the beginning of the algorithm, the population is formed randomly. To assess the quality of encoded solutions, the fitness function is used, which is necessary to calculate the fitness of each individual. Based on the results of evaluating individuals, the fittest of them are selected for crossing. As a result of crossing the selected individuals through the use of the genetic crossover operator, offspring are created, the genetic information of which is formed as a result of the exchange of chromosomal information between parent individuals. The created descendants form a new population, and some of the descendants mutate, which is expressed in a random change in their genotypes. The stage, which includes the sequence "Estimation of the population" - "Selection" - "Crossing" - "Mutation", is called a generation. The evolution of a population consists of a sequence of such generations.

The following algorithms for selecting individuals for crossing are distinguished:

Panmixia. Both individuals that make up the parent pair are randomly selected from the entire population. Any individual can become a member of several pairs. This approach is universal, but the efficiency of the algorithm decreases with the growth of the population.

· Selection. Individuals with fitness not lower than average can become parents. This approach provides faster convergence of the algorithm.

Inbreeding. The method is based on the formation of a pair based on close relationship. Kinship here refers to the distance between members of a population, both in the sense of the geometric distance of individuals in the parameter space and the Heming distance between genotypes. Therefore, there are genotypic and phenotypic inbreeding. The first member of the pair for crossing is chosen randomly, and the second is more likely to be the individual closest to him. Inbreeding can be characterized by the property of concentration of search in local nodes, which actually leads to the splitting of the population into separate local groups around areas of the landscape suspicious of extremum.

Outbreeding. Formation of a pair on the basis of distant relationship, for the most distant individuals. Outbreeding is aimed at preventing the convergence of the algorithm to already found solutions, forcing the algorithm to explore new, unexplored areas.

Algorithms for the formation of a new population:

Selection with displacement. Of all individuals with the same genotypes, preference is given to those whose fitness is higher. Thus, two goals are achieved: the best found solutions with different chromosome sets are not lost, sufficient genetic diversity is constantly maintained in the population. Displacement forms a new population of far-flung individuals, instead of individuals clustering around the current found solution. This method is used for multi-extremal problems.

Elite selection. Elite selection methods ensure that the best members of a population are sure to survive when selected. At the same time, some of the best individuals pass without any changes to the next generation. The fast convergence provided by elite selection can be compensated by an appropriate method of selecting parent pairs. In this case, outbreeding is often used. It is this combination of "outbreeding - elite selection" that is one of the most effective.

· Tournament selection. Tournament selection implements n tournaments to select n individuals. Each tournament is built on a selection of k elements from the population, and the choice of the best individual among them. Tournament selection with k = 2 is the most common.

One of the most demanded applications of genetic algorithms in the field of Data Mining is the search for the most optimal model (search for an algorithm that corresponds to the specifics of a particular area). Genetic algorithms are primarily used to optimize the topology of neural networks and weights. However, they can also be used as a standalone tool.

3. Applications

Data Mining technology has a really wide range of applications, being, in fact, a set of universal tools for analyzing data of any type.

Marketing

One of the very first areas where data mining technologies were applied was the field of marketing. The task with which the development of Data Mining methods began is called shopping cart analysis.

This task is to identify products that buyers tend to purchase together. Knowledge of the shopping basket is necessary for advertising campaigns, the formation of personal recommendations to customers, the development of a strategy for creating stocks of goods and ways to lay them out in the trading floors.

Also in marketing, such tasks are solved as determining the target audience of a particular product for its more successful promotion; research on time patterns that helps businesses make inventory decisions; creation of predictive models, which enables enterprises to recognize the nature of the needs of various categories of customers with certain behavior; predicting customer loyalty, which allows you to identify in advance the moment of customer departure when analyzing his behavior and, possibly, prevent the loss of a valuable customer.

Industry

One of the important areas in this area is monitoring and quality control, where, using analysis tools, it is possible to predict equipment failure, the appearance of malfunctions, and plan repair work. Predicting the popularity of certain features and knowing which features are usually ordered together helps to optimize production, orienting it to the real needs of consumers.

The medicine

In medicine, data analysis is also used quite successfully. An example of tasks can be the analysis of examination results, diagnostics, comparison of the effectiveness of treatments and drugs, analysis of diseases and their spread, identification of side effects. Data mining technologies such as association rules and sequential patterns have been successfully used to identify relationships between drug use and side effects.

Molecular genetics and genetic engineering

Perhaps the most acute and at the same time clear task of discovering regularities in experimental data is in molecular genetics and genetic engineering. Here it is formulated as a definition of markers, which are understood as genetic codes that control certain phenotypic traits of a living organism. Such codes may contain hundreds, thousands, or more related items. The result of the analytical analysis of the data is also the relationship discovered by geneticists between changes in the human DNA sequence and the risk of developing various diseases.

Applied chemistry

Data mining methods are also used in the field of applied chemistry. Here, the question often arises of elucidating the features of the chemical structure of certain compounds that determine their properties. This task is especially relevant in the analysis of complex chemical compounds, the description of which includes hundreds and thousands of structural elements and their bonds.

Fight against crime

In security, Data Mining tools are used relatively recently, but practical results have already been obtained that confirm the effectiveness of data mining in this area. Swiss scientists have developed a system for analyzing protest activity in order to predict future incidents and a system for tracking emerging cyber threats and actions of hackers in the world. The latter system makes it possible to predict cyber threats and other information security risks. Also, Data Mining methods are successfully used to detect credit card fraud. By analyzing past transactions that later turned out to be fraudulent, the bank identifies some stereotypes of such fraud.

Other applications

· Risk analysis. For example, by identifying combinations of factors associated with paid claims, insurers can reduce their liability losses. There is a well-known case in the United States when a large insurance company found that the amounts paid out on the applications of people who are married are twice the amount on the applications of single people. The company has responded to this new knowledge by revisiting its general family discount policy.

· Meteorology. Weather prediction by neural network methods, in particular, Kohonen's self-organizing maps are used.

· Personnel policy. Analysis tools help HR departments to select the most successful candidates based on the analysis of their resume data, model the characteristics of ideal employees for a particular position.

4. Producers of Data Mining Tools

Data Mining tools traditionally belong to expensive software products. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were the assessment of credit and insurance risks and the development of a marketing policy, tariff plans and other principles of working with clients. In recent years, the situation has undergone certain changes: relatively inexpensive Data Mining tools and even free distribution systems have appeared on the software market, which has made this technology available to small and medium-sized businesses.

Among the paid tools and systems for data analysis, the leaders are SAS Institute (SAS Enterprise Miner), SPSS (SPSS, Clementine) and StatSoft (STATISTICA Data Miner). Well-known solutions are from Angoss (Angoss KnowledgeSTUDIO), IBM(IBM SPSS Modeler), Microsoft (Microsoft Analysis Services) and (Oracle) Oracle Data Mining.

The choice of free software is also varied. There are both universal analysis tools, such as JHepWork, KNIME, Orange, RapidMiner, and specialized tools, such as Carrot2 - a framework for clustering text data and search query results, Chemicalize.org - a solution in the field of applied chemistry, NLTK (Natural Language Toolkit) natural language processing tool.

5. Criticism of methods

The results of Data Mining largely depend on the level of data preparation, and not on the "wonderful capabilities" of some algorithm or set of algorithms. About 75% of the work on Data Mining consists of collecting data, which is done even before the use of analysis tools. Illiterate use of tools will lead to a waste of the company's potential, and sometimes millions of dollars.

The opinion of Herb Edelstein, a world-famous expert in the field of Data Mining, Data Warehousing and CRM: “A recent study by Two Crows showed that Data Mining is still at an early stage of development. Many organizations are interested in this technology, but only a few are actively implementing such projects. Another important point was made clear: the process of implementing Data Mining in practice turns out to be more complicated than expected. The teams were carried away by the myth that Data Mining tools are easy to use. It is assumed that it is enough to run such a tool on a terabyte database, and useful information will instantly appear. In fact, a successful data mining project requires an understanding of the essence of the activity, knowledge of data and tools, as well as the process of data analysis. Thus, before using Data Mining technology, it is necessary to carefully analyze the limitations imposed by the methods and the critical issues associated with it, as well as soberly assess the capabilities of the technology. Critical questions include:

1. Technology cannot provide answers to questions that have not been asked. It cannot replace the analyst, but only gives him a powerful tool to facilitate and improve his work.

2. The complexity of the development and operation of the Data Mining application.

Because the this technology is a multidisciplinary field, to develop an application that includes Data Mining, it is necessary to involve specialists from different fields, as well as to ensure their high-quality interaction.

3. User qualification.

Various Data Mining tools have a different degree of "friendliness" of the interface and require a certain user skill. Therefore, the software must correspond to the user's level of training. The use of Data Mining should be inextricably linked with the improvement of the user's skills. However, there are currently few Data Mining specialists who are well versed in business processes.

4. Extracting useful information is impossible without a good understanding of the essence of the data.

Careful model selection and interpretation of the dependencies or patterns that are found are required. Therefore, working with such tools requires close cooperation between a domain expert and a specialist in Data Mining tools. Persistent models must be well integrated into business processes to be able to evaluate and update models. Recently, Data Mining systems have been supplied as part of data warehousing technology.

5. Complexity of data preparation.

Successful analysis requires high-quality data preprocessing. According to analysts and database users, the preprocessing process can take up to 80% of the entire Data Mining process.

Thus, for the technology to work for itself, it will take a lot of effort and time that goes into preliminary data analysis, model selection and its adjustment.

6. A large percentage of false, unreliable or useless results.

With the help of Data Mining technologies, you can find really very valuable information that can give a significant advantage in further planning, management, and decision making. However, the results obtained using Data Mining methods quite often contain false and meaningless conclusions. Many experts argue that Data Mining tools can produce a huge amount of statistically unreliable results. To reduce the percentage of such results, it is necessary to check the adequacy of the obtained models on test data. However, it is impossible to completely avoid false conclusions.

7. High cost.

A high-quality software product is the result of significant labor costs on the part of the developer. Therefore, Data Mining software is traditionally referred to as expensive software products.

8. Availability of sufficient representative data.

Data mining tools, unlike statistical ones, theoretically do not require a strictly defined amount of historical data. This feature can cause the detection of unreliable, false models and, as a result, making incorrect decisions based on them. It is necessary to control the statistical significance of the discovered knowledge.

neural network algorithm clustering data mining

Conclusion

Dana a brief description of areas of application and criticism of Data Mining technology and the opinion of experts in this field.

Listliterature

1. Han and Micheline Kamber. Data Mining: Concepts and Techniques. second edition. - University of Illinois at Urbana-Champaign

Berry, Michael J. A. Data mining techniques: for marketing, sales, and customer relationship management - 2nd ed.

Siu Ning Lam. Discovering Association Rules in Data Mining. - Department of Computer Science University of Illinois at Urbana-Champaign

Data mining tools

Currently, Data Mining technology is represented by a number of commercial and freely distributed software products. A fairly complete and regularly updated list of these products can be found on the website. www. kdnuggets. com, dedicated to Data Mining. You can classify Data Mining software products according to the same principles that underlie the classification of the technology itself. However, such a classification would not be of practical value. Due to high competition in the market and the desire for completeness of technical solutions, many of the Data Mining products cover literally all aspects of the application of analytical technologies. Therefore, it is more expedient to classify Data Mining products according to how they are implemented and, accordingly, what potential for integration they provide. Obviously, this is also a convention, since such a criterion does not allow us to draw clear boundaries between products. However, this classification has one undeniable advantage. It allows you to quickly make a decision about choosing one or another ready-made solution when initializing projects in the field of data analysis, developing decision support systems, creating data warehouses, etc.

So, Data Mining products can be conditionally divided into three broad categories:

    included, as an integral part, in database management systems;

    libraries of Data Mining algorithms with related infrastructure;

    boxed or desktop solutions ("black boxes").

The products of the first two categories provide the greatest opportunities for integration and allow you to realize the analytical potential in almost any application in any field. Boxed applications, in turn, may provide some unique data mining advances or be specialized for a particular application. However, in most cases it is problematic to integrate them into broader solutions.

The inclusion of analytical capabilities in commercial database management systems is a natural trend with great potential. Indeed, where, if not in places of data concentration, it makes the most sense to place the means of processing them. Based on this principle, Data Mining functionality is currently implemented in the following commercial databases:

    Microsoft SQL Server

Main points

  • Data mining allows you to automatically, based on a large amount of accumulated data, generate hypotheses that can be tested by other analysis tools (for example, OLAP).

    Data Mining - research and detection by a machine (algorithms, artificial intelligence) in raw data of hidden knowledge that was not previously known, non-trivial, practically useful and accessible for human interpretation.

    Three main tasks are solved by Data Mining methods: the problem of classification and regression, the problem of finding association rules, and the problem of clustering. By purpose, they are divided into descriptive and predictive. According to the methods of solving problems, they are divided into supervised learning (learning with a teacher) and unsupervised learning (learning without a teacher).

    The task of classification and regression is reduced to determining the value of the dependent variable of an object by its independent variables. If the dependent variable takes on numerical values, then one speaks of a regression problem, otherwise it is a classification problem.

    When searching for association rules, the goal is to find frequent dependencies (or associations) between objects or events. The found dependencies are presented in the form of rules and can be used both for a better understanding of the nature of the analyzed data and for predicting events.

    The task of clustering is to search for independent groups (clusters) and their characteristics in the entire set of analyzed data. Solving this problem helps to better understand the data. In addition, the grouping of homogeneous objects makes it possible to reduce their number and, consequently, facilitate analysis.

    Data mining methods are at the intersection of different areas of information technology: statistics, neural networks, fuzzy sets, genetic algorithms, etc.

    Intellectual analysis includes the following steps: understanding and formulating the analysis problem, preparing data for automated analysis, applying Data Mining methods and building models, checking the built models, interpreting models by a person.

    Before applying Data Mining methods, the original data must be transformed. The type of transformation depends on the applied methods.

    Data Mining methods can be effectively used in various areas of human activity: in business, medicine, science, telecommunications, etc.

3. Analysis of text information - Text Mining

Analysis of structured information stored in databases requires preliminary processing: database design, input of information according to certain rules, its placement in special structures (for example, relational tables), etc. Thus, directly to analyze this information and obtain from it new knowledge requires more effort. However, they are not always associated with the analysis and do not necessarily lead to the desired result. Because of this, the efficiency of the analysis of structured information is reduced. In addition, not all types of data can be structured without losing useful information. For example, text documents are almost impossible to convert to a tabular view without losing the semantics of the text and the relationships between entities. For this reason, such documents are stored in the database without transformations, like text fields (BLOB fields). At the same time, a huge amount of information is hidden in the text, but its unstructured nature does not allow the use of Data Mining algorithms. The solution to this problem is the methods of analysis of unstructured text. In Western literature, such an analysis is called Text Mining.

Analysis methods in unstructured texts lie at the intersection of several areas: Data Mining, natural language processing, information retrieval, information extraction and knowledge management.

Definition of Text Mining: Knowledge discovery in text is the non-trivial process of discovering truly new, potentially useful and understandable patterns in unstructured text data.

As you can see, it differs from the definition of Data Mining only by the new concept of "unstructured text data". Such knowledge is understood as a set of documents that are a logically combined text without any restrictions on its structure. Examples of such documents are: web pages, e-mail, regulatory documents, etc. n. In general, such documents can be complex and large and include not only text, but also graphic information. Documents that use Extensible Markup Language (XML), Standard Generalized Markup Language (SGML), and other similar text structure conventions are called semi-structured documents. They can also be processed by Text Mining methods.

The process of analyzing text documents can be represented as a sequence of several steps

    Search for information. The first step is to identify which documents need to be reviewed and make them available. As a rule, users can determine the set of documents to be analyzed on their own - manually, but with a large number of documents, it is necessary to use automated selection options according to specified criteria.

    Document preprocessing. At this step, the simplest but necessary transformations are performed with documents to present them in the form that Text Mining methods work with. The purpose of such transformations is to remove unnecessary words and give the text a more rigorous form. The preprocessing methods will be described in more detail in Sec.

    Information extraction. Extracting information from selected documents involves highlighting key concepts in them, over which further analysis will be performed.

Application of Text Mining methods. At this step, the patterns and relationships that exist in the texts are extracted. This step is the main one in the process of text analysis, and the practical tasks solved at this step.

Interpretation of results. The last step in the knowledge discovery process involves interpreting the results. As a rule, interpretation consists either in presenting the results in a natural language, or in their visualization in a graphical form.

Visualization can also be used as a text analysis tool. To do this, key concepts are extracted, which are presented graphically. This approach helps the user to quickly identify the main topics and concepts, as well as determine their importance.

Text Preprocessing

One of the main problems of text analysis is the large number of words in a document. If each of these words is subjected to analysis, then the search time for new knowledge will increase dramatically and will hardly meet the requirements of users. At the same time, it is obvious that not all words in the text carry useful information. In addition, due to the flexibility of natural languages, formally different words (synonyms, etc.) actually mean the same concepts. Thus, the removal of non-informative words, as well as the reduction of words that are similar in meaning to a single form, significantly reduce the time of text analysis. The elimination of the described problems is performed at the stage of preprocessing the text.

The following methods are usually used to remove uninformative words and increase the severity of texts:

    Removing stop words. Stop words are words that are auxiliary and carry little information about the content of the document.

    Stamming - morphological search. It consists in converting each word to its normal form.

    L-grams are an alternative to morphological parsing and stopword removal. They allow to make the text more strict, do not solve the problem of reducing the number of non-informative words;

    Register cast. This trick is to convert all characters to upper or lower case.

The most effective combination of these methods.

Tasks of Text Mining

At present, many applied problems are described in the literature that can be solved using the analysis of text documents. These are the classic Data Mining tasks: classification, clustering, and tasks typical only for text documents: automatic annotation, extraction of key concepts, etc.

Classification is a standard task from the field of Data Mining. Its purpose is to define for each document one or more predefined categories to which the document belongs. A feature of the classification problem is the assumption that the set of classified documents does not contain "garbage", i.e., each of the documents corresponds to some given category.

A special case of the classification problem is the task of determining the subject of a document.

The purpose of clustering documents is to automatically identify groups of semantically similar documents among a given fixed set. Note that groups are formed only on the basis of pairwise similarity of document descriptions, and no characteristics of these groups are specified in advance.

Automatic annotation (summarization) allows you to shorten the text while maintaining its meaning. The solution to this problem is usually controlled by the user by determining the number of sentences to be extracted or the percentage of text to be extracted in relation to the entire text. The result includes the most significant sentences in the text.

The primary goal of feature extraction is to identify facts and relationships in a text. In most cases, such concepts are nouns and common nouns: first and last names of people, names of organizations, etc. Concept extraction algorithms can use dictionaries to identify some terms and linguistic patterns to define others.

Text-base navigation allows users to navigate through documents in relation to topics and significant terms. This is done by identifying key concepts and some relationships between them.

Trend analysis allows you to identify trends in document sets over a period of time. A trend can be used, for example, to detect changes in a company's interests from one market segment to another.

The search for associations is also one of the main tasks of Data Mining. To solve it, in a given set of documents, associative relations between key concepts are identified.

There are a fairly large number of varieties of these problems, as well as methods for solving them. This once again confirms the importance of text analysis. The rest of this chapter discusses solutions to the following tasks: key concept extraction, classification, clustering, and automatic annotation.

Classification of text documents

Classification of text documents, as well as in the case of classification of objects, consists in assigning a document to one of the previously known classes. Often, classification in relation to text documents is called categorization or rubrication. Obviously, these names come from the task of organizing documents into catalogs, categories and headings. In this case, the directory structure can be either single-level or multi-level (hierarchical).

Formally, the task of classifying text documents is described by a set of sets.

In the classification problem, it is required to build a procedure based on these data, which consists in finding the most probable category from the set C for the document under study.

Most text classification methods are somehow based on the assumption that documents belonging to the same category contain the same features (words or phrases), and the presence or absence of such features in a document indicates its belonging or non-belonging to a particular topic.

Such a set of features is often called a dictionary, since it consists of lexemes that include words and/or phrases that characterize a category.

It should be noted that these sets of features are a distinctive feature of the classification of text documents from the classification of objects in Data Mining, which are characterized by a set of attributes.

The decision to assign document d to category c is made on the basis of the intersection of common features

The task of classification methods is to select such features in the best possible way and to formulate the rules on the basis of which a decision will be made to assign a document to a rubric.

Text information analysis tools

    Oracle Tools - Oracle Text2

Beginning with Oracle 7.3.3, text analysis tools are an integral part of Oracle products. In Oracle, these tools have developed and received a new name - Oracle Text - a software package integrated into the DBMS that allows you to work effectively with queries related to unstructured texts. At the same time, text processing is combined with the capabilities that are provided to the user for working with relational databases. In particular, when writing applications for word processing it became possible to use SQL.

The main task that Oracle Text tools are aimed at is the task of searching for documents by their content - by words or phrases, which, if necessary, are combined using Boolean operations. The search results are ranked by importance, taking into account the frequency of occurrence of query words in the found documents.

    Funds from IBM - Intelligent Miner for Text1

The IBM Intelligent Miner for Text product is a set of separate utilities that run from the command line or from scripts independently of each other. The system contains a combination of some utilities for solving problems of text information analysis.

IBM Intelligent Miner for Text combines a powerful set of tools based primarily on information retrieval mechanisms, which is specific to the entire product. The system consists of a number of basic components that have independent value outside the Text Mining technology:

    SAS Institute - Text Miner Tools

The American company SAS Institute has released the SAS Text Miner system for comparing certain grammatical and verbal sequences in written speech. Text Miner is very versatile, because it can work with text documents of various formats - in databases, file systems, and then on the web.

Text Miner provides logical text processing within the SAS Enterprise Miner package environment. This allows users to enrich the data analysis process by integrating unstructured textual information with existing structured data such as age, income, and shopping patterns.

Main points

    Knowledge discovery in text is a non-trivial process of discovering really new, potentially useful and understandable patterns in unstructured text data.

    The process of analyzing text documents can be represented as a sequence of several steps: searching for information, preprocessing documents, extracting information, applying Text Mining methods, and interpreting the results.

    Usually, the following methods are used to remove non-informative words and increase the severity of texts: removing stop words, stemming, L-grams, case reduction.

    The tasks of text information analysis are: classification, clustering, automatic annotation, extraction of key concepts, text navigation, trend analysis, association search, etc.

    Extraction of key concepts from texts can be considered both as a separate applied task and as a separate stage of text analysis. In the latter case, the facts extracted from the text are used to solve various problems of analysis.

    The process of extracting key concepts using templates is carried out in two stages: at the first stage, individual facts are extracted from text documents using lexical analysis, at the second stage, the integration of the extracted facts and / or the derivation of new facts is performed.

    Most text classification methods are somehow based on the assumption that documents belonging to the same category contain the same features (words or phrases), and the presence or absence of such features in a document indicates its belonging or non-belonging to a particular topic.

    Most clustering algorithms require that the data be represented as a vector space model, which is widely used for information retrieval and uses a metaphor to represent semantic similarity as spatial proximity.

    There are two main approaches to automatic annotation of text documents: extraction (highlighting the most important fragments) and generalization (using pre-collected knowledge).

Conclusion

Data mining is one of the most relevant and popular areas of applied mathematics. Today's business and manufacturing processes generate huge amounts of data, and it is becoming increasingly difficult for people to interpret and react to a large amount of data that changes dynamically in runtime, not to mention the prevention of critical situations. "Data mining" to extract the maximum of useful knowledge from multidimensional, heterogeneous, incomplete, inaccurate, contradictory, indirect data. It helps to do this efficiently if the amount of data is measured in gigabytes or even terabytes. Helps to build algorithms that can learn to make decisions in various professional fields.

Data mining prevents people from information overload by turning operational data into useful information so that the right actions can be taken at the right times.

Applied developments are carried out in the following areas: forecasting in economic systems; automation of marketing research and analysis of client environments for manufacturing, trade, telecommunications and Internet companies; automation of credit decision making and credit risk assessment; monitoring of financial markets; automatic trading systems.

Bibliography

    “Data Analysis Technologies: Data Mining. visual mining. Text Mining, OLAP” A. A. Barseghyan. M. S. Kupriyanov, V. V. Stenanenko, I. I. Kholod. - 2nd ed., revised. and additional

    http://inf.susu.ac.ru/~pollak/expert/G2/g2.htm - Internet article

    http://www.piter.com/contents/978549807257/978549807257_p.pdf - Data analysis technologies

    Thesis >> Banking

    Borrower using cluster, verbal analysis, adjustment factors, etc., also ... the creditworthiness of the borrower based on intellectual analysis Data Mining data (si... On initial stage analysis held analysis own funds and...

  1. Analysis and classification of the modern market of information systems that implement discretionary, m

    Abstract >> Informatics

    1.3 Role differentiation 6 2. Comparative analysis various types of systems 7 Operating systems ... systems, including: analysis security policies and their characteristics, ... applications or implementing more intellectual analysis data. Besides...

  2. intellectual abilities of gifted children in relation to school performance

    Thesis >> Psychology

    The relationship of academic performance and features intellectual development. Based on the theoretical analysis research problem was... to intellect without analysis his psychological structure. Decisive for evaluation intellectual ability is...

2022 wisemotors.com. How it works. Iron. Mining. Cryptocurrency.