STATISTICS products. III. An overview of using the statistica Neural Networks package Statistica neural networks lab

Size: px

Start impression from page:

transcript

2 UDC BBK N45 N45 Neural networks. STATISTICA Neural Networks: Methodology and technologies of modern data analysis / Edited by V. P. Borovikov. 2nd ed., revised. and additional M .: Hotline Telecom, s, ill. ISBN Neural network methods of data analysis based on the use of the STATISTICA Neural Networks package (StatSoft manufacturer), fully adapted for the Russian user, are outlined. The foundations of the theory of neural networks are given; Much attention is paid to solving practical problems, the methodology and technology of conducting research using the STATISTICA Neural Networks package, a powerful tool for analyzing and predicting data, which has wide applications in business, industry, management, and finance, is comprehensively considered. The book contains many examples of data analysis, practical recommendations for analysis, forecasting, classification, pattern recognition, production process control using neural networks. For a wide range of readers involved in research in banking, industry, economics, business, exploration, management, transport and other areas. Address of the publisher on the Internet. Mikhin Preparation of the original layout E.V. Kormakova Cover by artist V.G. Sitnikov Signed for printing Format70 100/16. Conv. ed. l. 32.5. Published by Scientific and Technical Publishing House "Hot Line Telecom" Printed in the printing house "Til-2004" Order 05 ISBN STATISTICA Neural Networks (SNN), 2008 V. P. Borovikov, 2008 Design by the publishing house "Hot Line Telecom", 2008

3 Preface to the Second Edition The second edition of the acclaimed book has been significantly expanded and revised. New chapters have been written on introduction to data analysis, probability theory, and neural network theory. The material contained in these chapters allows you to deeply understand the methodology of using neural networks. Currently, neural networks are intensively used in banking, industry, marketing, economics, medicine and other areas where forecasting and in-depth understanding of data is required. It is generally accepted that neural networks are a natural addition to classical methods of analysis and are used where standard procedures do not give the desired effect. STATISTICA Neural Networks is the only software product in the world for conducting neural network research, fully translated into Russian. This means that the entire interface (tens of dialog boxes and study scenarios) and the STATIST1CA Neural Networks help system are translated into Russian and are available to the user in a single environment. We have included an additional chapter on classical methods of analysis in the book, which allows the reader to compare different approaches. A separate chapter of the book is devoted to data mining methods, modern data analysis technologies that combine classical and neural network models. Employees of StatSoft Russia took part in the work on the book: B.C. Pactunkov, A.K. Petrov, V.A. Panov. To all of them we express our sincere gratitude. Our special thanks go to Lyudmila Ekatova for her hard and painstaking work in preparing the manuscript for publication. Scientific Director of StatSoft Russia V.P. Borovikov

4 Introduction An invitation to neural networks Over the past few years, interest in neural networks has increased significantly: they are used in finance, business, medicine, industry, engineering, exploration and other fields. Neural networks are used wherever it is required to solve prediction, classification or control problems, since they are applicable in almost any situation where there is a relationship between predictor variables (input variables) and predicted variables (output variables), even if this relationship is complex in nature and its difficult to express in conventional terms correlations or differences between groups. Neural network methods can be used independently or serve as a great addition to traditional data analysis methods. Most statistical methods are associated with the construction of models based on certain assumptions and theoretical conclusions (for example, under the assumption that the desired relationship is linear or the variables have a normal distribution). The neural network approach is free from model constraints, it is equally suitable for linear and complex non-linear relationships, and is especially effective in exploratory data analysis when it is necessary to find out whether there are any relationships between variables at all. The power of neural networks lies in their ability to self-learn. The training procedure consists in adjusting the synaptic weights in order to minimize the loss function. This book uses the STATISTICA Neural Networks package to build neural networks, which has a convenient interface and allows you to conduct research in an interactive mode. All dialog boxes and tips, including the online help system, are fully translated into Russian and are available to users. Neural networks STATISTICA is the world's only software product for neural network research, fully translated into Russian. A significant benefit of STATISTICA Neural Networks is that it is naturally built into the powerful arsenal of analysis tools in STATISTICA. It is the combination of classical and neural network methods that gives the desired effect. This book consists of eleven chapters. In the first chapter we describe the basic concepts of data analysis, in the second we give an introduction to probability theory. The third chapter contains a theoretical introduction to neural networks. Note that probability theory is the foundation of neural networks. This chapter is necessary for an in-depth understanding of the methods and principles of neural networks. In her

5 we describe the famous Bayes formula and the rule of optimal Bayesian classification. The fourth chapter contains a general overview of neural networks implemented in STATISTICA Neural Networks, introduces the reader to the program interface, options, and helps to understand the main directions of analysis. Chapter 5 teaches the reader how to take their first steps in STATISTICA Neural Networks. The sixth chapter describes further possibilities of neural networks. Networks based on radial basis functions are considered in detail, multilayer perceptrons, self-organizing maps, probabilistic and generalized probabilistic models are described. Describes how to build a network using the Solve Wizard, a handy neural network analysis tool for beginners; an idea of genetic algorithms for dimension reduction is given. The seventh chapter presents practical tips for solving problems using neural networks. The eighth chapter contains solutions to specific problems (case studies). This chapter is of particular interest to a wide range of readers, as it shows neural network technology in action. The examples cover a wide range of applications, from geology and industry to finance; the problems of classification, pattern recognition, forecasting, production process control are considered. In the ninth chapter, the reader will find a brief guide to using the STATISTICA Neural Networks neural network package. The tenth chapter is devoted to statistical methods, alternative to neural networks. Discriminant analysis, factor analysis, and logistic regression methods are described here. Obviously, the user should be able to compare methods and choose the most appropriate ones. In the eleventh chapter, we briefly describe modern data mining technologies that combine neural network methods with classical analysis methods. Let us give typical examples of the use of neural networks. In industry, the task of managing production processes (production installation) is relevant. For example, in the gas industry, you can set up a neural network and automatically change parameters to control the quality of the output product. Similar problems arise in oil refining. It is possible to control the quality of gasoline based on spectral characteristics, by measuring the spectrum, to attribute the produced product to a certain class. Since the dependencies are non-linear, neural networks are a suitable tool for classification. In the financial sector, consumer lending is an urgent task. In recent years, consumer lending has developed rapidly and has become one of the fastest growing sectors of the banking business. The number of financial institutions providing goods and services on credit is growing

6 day after day. The risk of these institutions depends on how well they can distinguish “good” loan applicants from “bad” applicants. By analyzing the borrower's credit history, you can predict the way he will act and decide whether to grant a loan or refuse a loan. An interesting problem is to distinguish electronic signature, voice recognition, a variety of tasks related to geological exploration. Neural networks can be used to solve these problems. Next, we will present a chain of dialog boxes in the STATISTICA neural networks package and show how the dialog with the system user is organized. Let's pay attention to the user-friendly interface and the presence of the Decision Wizard and Network Builder tools that allow users to design own networks and choose the best. So, first of all, let's run the STATISTICA Neural Networks package. Step 1. Let's start with the launch pad (Fig. 1). Rice. 1. Launch panel of neural networks In this panel, you can select different kinds analysis to be performed: regression, classification, time series forecasting, cluster analysis. Select, for example, time series if you want to build a forecast. Next, select a solution tool in the Tool section. For novice users, it is recommended to select the Solution Wizard, for an advanced user, use the Network Builder. We

7 select Decision Wizard. Step 2. Click the Data button to open the data file. If the file is already open, this button should not be pressed. When you click the Advanced button, a window appears where additional tools are available, in particular, dimensionality reduction procedures, a code generator, etc. (Fig. 2). Rice. 2. STATISTICA Neural Networks Launchpad Step 3. From open file select variables for analysis. Variables can be continuous and categorical; in addition, observations may belong to different samples (Fig. 3).

8 Fig. Fig. 3. Variable selection window Step 4. Set the duration of the analysis by specifying the number of tested networks or the solution time (Fig. 4). Rice. 4. Decision Wizard Quick Step tab 5. Let's select the type of networks offered by the program with which we will work: linear network, probabilistic network, network based on radial basis functions, multilayer perceptron. You can choose any type of networks or combination (Fig. 5).

9 Fig. Fig. 5. Decision Wizard Network type tab Step 6. Set the format for presenting the final results (Fig. 6). Rice. Fig. 6. Decision Wizard, Feedback tab Step 7. Start the procedure for training neural networks by pressing the OK button (Fig. 7).

10 Fig. 7. Displaying the learning process Step 8. In the results window, you can analyze the obtained solutions. The program will select the best networks and show the quality of the solution (Fig. 8). Rice. Fig. 8. Result window tab Quick Step 9. Select a specific network (Fig. 9).

11 Fig. Figure 9. Model Selection Dialog Box Step 10. One way to check is to compare the observed values with the predicted results. Comparison of the observed and predicted values for the selected network is shown in fig. ten.

12 Pic. Fig. 10. Table of observed and predicted values Step 11. Save the best networks for further use, for example, for automatic forecasting (Fig. 11 and 12). Rice. 11. Launchpad select and save networks/ensembles

13 Fig. 12. Standard window for saving a network file This is a typical research scenario in the STATISTICA neural networks package. A more systematic presentation is contained in the remaining chapters of the book.

14 Chapter 9 QUICK GUIDE In this chapter you will find a quick guide to working with the STATISTICA Neural Networks system. The STATISTICA Neural Networks package implements all types of neural networks that are currently used to solve practical problems, as well as the most advanced algorithms for fast learning, automatic construction and selection of significant predictors. DATA Introduction Recall once again that neural networks learn from examples and build a model from training data. The training data is a certain number of observations (samples), for each of which the values of several variables are indicated. Most of these variables will be set as inputs, and the network will learn to match the values of the input and output variables (most often there is only one output variable) using the information contained in the training data. Once the network is trained, it can be used to predict unknown output values given input values. Thus, the first stage of working with a neural network is associated with the formation of a data set. You can create a data table in the STATISTICA (Neural Networks) package using the New command on the File menu (or the corresponding button on the toolbar), specifying the number of variables and observations. The resulting new data file will initially contain only empty cells, and the values of all variables in it will be set as missing (Fig. 9.1).

15 Fig. 9.1 The choice of input/output variables, as well as the sets into which the variables are divided, is made inside the Neural Networks module (but after the data table has been prepared). However, this is usually not done: the data file is imported from some other package using the Open command (you will need to specify the data format) or the External Data command of the File menu, which allows you to create complex queries against various databases (Fig. 9.2 and 9.3).

16 Fig. 9.2 Fig. 9.3 In the Neural Networks module, it is possible to directly read the data files of the STATISTICA system, while nominal variables are automatically determined (i.e. variables that can take one of several specified text values, for example, Gender = (Male, Feminine)), and such data types like dates and times are converted to numeric representation (input

17, only numerical data can be fed to the neural network). If you receive data in some other program (for example, spreadsheets), then, first of all, you will need to import the data using the STATISTICA system. In addition to the import function, STATISTICA provides other options for accessing external sources of information: using the Windows clipboard (STATISTICA understands clipboard data formats used in applications such as Excel and Lotus); access to various databases using the STATISTICA Query query building tool. Text files delimited by tabs or commas can be imported directly into the STATISTICA package. In this case, if desired, the first line of the file can be left for variable names, and the first column for observation names (Fig. 9.4). Rice. 9.4 Once a data file has been opened or newly created, its contents can be edited like a regular table in the STATISTICA environment. STATISTICA implements basic data operations specific to spreadsheet processors, including: editing, selecting a block of cells,

18 transfer to clipboard, etc. In addition, there are special operations for setting the type and names of variables and observations, adding, deleting, moving and copying them. Types of Variables and Observations In the STATISTICA Neural Networks program, all observations from the data file are divided into four groups (sets): training, control, test, and unaccounted for. The training set is used to train the neural network, the control set is used for independent evaluation of the learning progress, the test set is used for the final evaluation after the completion of a series of experiments. The disregarded set is not used at all (it may be needed if some of the data is corrupted, unreliable, or there is simply too much of it). Similarly, all variables are divided into input, output, input / output (for example, in time series analysis) and unaccounted (the latter are usually “candidates for the role of input variables”, whose usefulness for making a forecast is not clear in advance, and therefore, in the process of experimentation, some of them turn off). The type of variables and observations is set in the Neural Networks module. The number of input and output variables, as well as training, control and test cases, is displayed in the corresponding fields at the top of the STATISTICA Neural Networks start window. The proportions between types can be changed by editing the settings in these fields. This will not add new or remove existing observations or variables: only the type of existing observations or variables will change. A similar operation is used to form an unbiased check set. First you need to specify the size of this set (usually half of the entire data set is allocated to it, and the other half to the training set; if you also need a test set, then the file must be divided into three parts). Then, using the Random selection option, all available observations are randomly distributed over different types. The first time you read a data file in STATISTICA Neural Networks, you need to determine which of the variables will be input and which will be output; in the same way, for observations, it is necessary to set the parameters of samples for training, verification, and testing. Settings related to variables must be made on the start window of the Neural Networks module, and settings related to observations must be made using the Selections tool in the dialog box for setting analysis parameters (you go to it after the start window). Note, however, that if a sample identification variable is set, then it must be set on the start window when setting the input/output variables. Variable and Observation Names It is possible to assign names to individual variables and/or

19 observations. This is done using the Variable Specifications command, All Variable Specifications, Observation Name Manager command of the Data menu. Alternatively, you can simply double-click the name in the row or column heading field and the name can be entered directly into the table. In the STATISTICA Neural Networks program, it is not mandatory to assign names to observations or variables. If the name was not specified, then the table displays the conditional name, which is accepted by default. Variable Definition (Nominal Values) STATISTICA Neural Networks has special features for working with categorical (nominal) variables. For nominal variables, there are special methods for converting values, and the type of output variables allows you to distinguish classification problems (where nominal variables are used) from regression problems (where numeric variables are used). A variable can be either numeric or nominal, but not both. To define a nominal variable in the STATISTICA Neural Networks package, you need to select this variable as a categorical variable (either on the Quick tab by clicking the Variables button, or go to the Advanced tab and click the Variable Type button). When importing files delimited by tabs or commas, if they contain nominal values (represented by lines of text), the STATISTICA Neural Networks program automatically recognizes them and determines the required nominal values itself. Adding and Removing Cases and Variables You can add, delete, copy, and move cases and variables using the Data menu or directly in the table. Various commands on the Data Variables and Data Observations menus help to achieve greater efficiency, and the tools for working with the table directly are more convenient to use. There are two ways to add new observations: 1. Select an observation. Left-click on the title of this observation and select Add Observations. You can also do this: go to the Observations Data Add menu. 2. Observations can also be pasted from the clipboard. To do this, left-click on the name of the observation or variable in which we want to insert data. To delete an observation or a group of observations, you need to select them in the usual way through the line headers, and then press Ctrl + X. This will actually place the observations on the clipboard, so if you move the cursor

20 to another location and press Ctrl+V, observations will be placed at the cursor position, and using the keyboard shortcuts Ctrl+C and Ctrl+V, observations can be copied and pasted. Moving and copying variables is done in a similar way. Missing data The STATISTICA Neural Networks module has special tools for processing missing data, which are similar to those used in other STAT1STICA modules. Despite the fact that the STATISTICA Neural Networks program can work with missing data, substituting reasonable estimates instead, nevertheless, it is not recommended to use missing values when training the network and during its operation, if possible. Although it happens that the volume of available training observations is too small, and we are forced to use all available observations. STATISTICA Neural Networks can automatically mark all variables or observations containing missing data as unaccounted for (so that they are not used in the analysis). What exactly will be declared unaccounted for observations or variables is determined by the user's choice. If any of the variables has too many values missing, then perhaps it should be excluded from consideration. If the variable is missing only a few values, it makes sense to declare the corresponding observations disregarded. We can recommend the following sequence of actions: first declare the variable unaccounted for and see how many values are actually missing. If there are few such rows, then again make the input variable, and declare observations unaccounted for. In a tab-delimited or comma-delimited import file, missing data can be indicated by a gap. NETWORKS Introduction After you have created or imported a dataset, you can start building and training neural networks. The network of the STAT1STICA Neural Networks package can contain layers for pre- and post-processing, in which, respectively, the source data is converted to a form suitable for feeding the network input, and the output data to a form convenient for interpretation. In this case, the nominal values are converted into a numerical form, the numerical values are scaled to a suitable range, the missing values are substituted, and in problems with time series, blocks of consecutive observations are formed. The pre- and post-processing data includes a set of input and output variables, for each of which its name and type are specified, as in the original data set.

21 A note about input and output variables The set of input and output variables in the STATISTICA Neural Networks package exists separately from the data file. To simplify the process of building a network, the STATISTICA Neural Networks program automatically copies the names and definitions of variables from the data set to the network being created, and then separates the network and data from each other. Thanks to this, the network can be used to analyze new data without resorting to source file(because the network remembers the names and types of its variables, it will know what to do). Building a network To create new network, you should use the Decision Wizard or Network Builder tool. A series of dialog boxes in the Solution Wizard and Network Builder provide tools for setting and editing pre- and post-processing variable settings. First, of course, you need to define the variables and choose the appropriate transformation method for them, as well as the network architecture. To go to the dialog for setting analysis parameters, press the OK button on the start window. Depending on which tool we are using, the option to select the type of network will be on the Network Type tab (for the Solution Wizard) or on the Quick tab (Network Builder). If the task of modeling a time series is being solved, then when using any tool, the Time Series tab will be available. In the Decision Wizard, this tab sets the bounds for the forecast window (i.e., the number of observations that are used to predict one observation ahead). In the Network Builder, this tab provides options for specifying the exact value of the forecast window and the parameter for the number of steps forward. In tasks not related to time series, these options are not available. In the problems of time series analysis, the number of steps forward is taken equal to 1 or more (most often 1, which corresponds to the forecast one step ahead), and the time window is the number of previous values of the series, by which its next value will be predicted. In addition, in time series analysis tasks, before running the corresponding tool, you should select a variable containing the values of the time series as input and output at the same time, since you are going to predict the next values of the variable from its previous values. If a multilayer perceptron is being built, then the number of layers in the network can be changed; for networks of other types, this parameter cannot be changed (with one exception: a probabilistic network can consist of three or four layers, depending on whether it includes a loss matrix). The Edit option (available for the Network Builder tool in the Network Parameters dialog) provides information about the pre and post processing variables, including their names and definitions, as well as the function

22 transformation, which is used to prepare data for input to the neural network. You can change the way you substitute missing values and the control parameters of the conversion. As a rule, the suggested default values are quite suitable. The same dialog shows the current network architecture parameters: the number of elements in each layer and (if you scroll the table to the right) the width of the layers. The number of input and output variables is usually strictly related to the number of input and output variables of pre and post processing, the transformation function, and (in time series analysis problems) the size of the time window. The STATISTICA Neural Networks program itself determines the appropriate parameters and displays them in gray, thus showing that they cannot be edited. The number of intermediate layers can be changed arbitrarily at your discretion, but usually the program offers heuristically determined reasonable default values for them. The layer width has no functional meaning except for the output layer of the Kohonen network and is usually ignored. In order to create a network, having already loaded a set of training data, it is usually enough to: 1) Set the types of variables in the start window (Input or Output). 2) Select the type of network and time series. 3) Set the values of the parameters Time window and Forecast forward (only in the problems of time series analysis). 4) Set the number of layers (only for multilayer perceptrons). 5) Set the number of hidden elements (if using Network Builder). 6) Set the number of elements and the width of the output layer (only for Kohonen networks). 7) Click OK. Editing networks Once a network has been built, its design can be modified using the Model Editor tool. In this case, you can change all the parameters used in its construction, as well as a number of additional characteristics. The tool also allows you to change the names and definitions of input and output variables, their functions and conversion parameters, and methods for replacing missing values. There are also options for adding new and deleting existing variables and changing the parameters of the time series (Time Window and Forecast). These opportunities are rarely used. In addition, the pre and post-processing editor makes it possible to change the classification parameters that are not set when building the network, while during operation it may be necessary to adjust them. The values of the classification parameters are used only when solving classification problems, i.e. when at least one of the output variables is nominal. When the network is running, STATISTICA Neural Networks makes a classification decision based on the values of these output variables. Thus, if there is a nominal output variable with three possible

23 values and coding 1 out of N is applied, the program must decide whether, for example, to interpret the output vector (0.03; 0.98; 0.02) as belonging to the second class (Fig. 9.5). Rice. 9.5 This issue is resolved by setting acceptance and rejection thresholds. In 1 out of N encoding, a classification decision is made if one of the N output values exceeds the acceptance threshold and the rest fall below the reject threshold; if this condition is not met, then the result is considered undefined (and returned as a missing value). With the default acceptance threshold (0.95) and rejection threshold (0.05) set in the program, the above example will indeed be assigned to the second class. Choosing less stringent thresholds will give a more efficient classification, but may result in a higher error rate. How the values of the Accept and Reject parameters are interpreted depends on the type of network. For some types of networks (for example, Kohonen networks), large values lead to large errors, and the classification decision is made if the output value is below the acceptance threshold (Figures 9.6 and 9.7).

24 Fig. 9.6

25 Fig. 9.7 The network editor allows you to change some other network parameters. So, you can change the type of error function that is used to train the network and to evaluate the quality of its work. You can also select specific layers of the network and modify their activation functions and post-synaptic potential (PSP) functions. It is also possible to add or remove network elements. Typically, this can only be done with intermediate layers, since the input and output elements are bound to pre and post processing variables (when variables are added or removed, the corresponding elements will be added or removed). Eliminate make up Kohonen networks where you can add and remove output elements. To add or remove hidden elements, you need to go to the Layers tab and delete the elements of the hidden layer. You can also use the tools to cut, copy and paste the columns of the weight table, which can be edited on the Weights tab. All this allows you to experiment with different network architectures without creating a new network every time. You can remove a whole layer from the network. This is required in rare cases, for example, to separate the preprocessing half of the auto-associative network during dimensionality reduction. The weight table shows all weights and thresholds either for a selected layer or for the entire network. If desired, weights and thresholds can be edited directly, but this is very uncommon (weight values are set by learning algorithms). This data is output mainly so that the weight values can be sent to another program for further analysis. TRAINING THE NETWORKS Once the network is built, it needs to be trained on the available data. The STATISTICA Neural Networks package has special algorithms for training networks of each type, grouped by type in the Training menu (these options are available only when using the Network Builder). Multilayer perceptron To train multilayer perceptrons in the STATISTICA Neural Networks package, five different learning algorithms are implemented. These are the well-known backpropagation algorithm, fast second-order conjugate gradient descent and Levenberg Markar methods, as well as fast propagation methods and delta delta with bar (which are variations of backpropagation that are faster in some cases). All these methods are iterative and the methods of their application are largely similar. In most situations, you should stop at the conjugate gradient method, since here learning is much faster (sometimes

26 order of magnitude) than backpropagation. The latter should be preferred only when, in a very complex problem, it is required to quickly find a satisfactory solution, or when there is a lot of data (of the order of tens of thousands of observations) and even there is a known excess of them. The Levenberg-Markar method can be much more efficient than the conjugate gradient method for some types of problems, but it can only be used in networks with a single output, an rms error function, and a not very large number of weights, so that in fact its scope is limited to small regression problems. Iterative learning The iterative learning algorithm sequentially goes through a series of so-called epochs, at each of which the entire set of training data is fed to the network input, errors are calculated, and the network weights are adjusted according to them. Algorithms of this class are subject to the undesirable phenomenon of overfitting (when the network learns well to produce the same output values as in the training set, but is unable to generalize the pattern to new data). Therefore, the quality of the network should be checked at each epoch using a special control set (cross-validation). The training progress can be monitored in the Training Error Graph window, where the graph shows the mean squared error on the training set at a given epoch. If cross-validation is enabled, the standard error on the control set is also displayed. Using the controls located under the chart, you can change the scale of the image, and if the chart does not fit entirely in the window, scroll bars appear under it (Fig. 9.8).

27 Fig. 9.8 If you want to compare the results of different training runs, then you need to click the Advanced button in the training window, and then click the Train button again (clicking the Train without initialization button again will simply continue the network training from where it was interrupted). At the end of training, using the buttons located above the field symbols, the graph can be sent to the STATISTICA system (button). It is important that the effect of overfitting can be easily seen on the graph. Initially, both the training error and the control error decrease. With the onset of retraining, the learning error continues to decrease, while the control error begins to grow. An increase in the validation error signals the beginning of overfitting and indicates that the learning algorithm is starting to be destructive (and at the same time that a smaller network may be more suitable). If overtraining is observed, then the training procedure can be interrupted by clicking the Stop button in the training window or by pressing the Esc key. You can also set STATISTICA Neural Networks to automatically stop using stop conditions. Stopping conditions are set in the window of the same name, which is accessed through the menu Training End of analysis. In addition to the maximum number of epochs allowed for training (which is set on the Fast tab), here you can require that training stop when a certain level of error is reached or when the error stops decreasing by a certain amount. The target value and the minimum reduction may be set separately for the training error and the control error. The best way to deal with overfitting is to set the minimum improvement level to zero (i.e., not allow any degradation). However, since there is noise during training, it is usually not recommended to stop training just because the error has worsened at one successive epoch. Therefore, the system introduced a special improvement parameter Window, which specifies the number of epochs during which deterioration should be observed, and only after that training will be stopped. In most cases, a value of 5 is fine for this parameter. Preserving the Best Network Regardless of whether early stopping is used, overfitting can result in a network that has already deteriorated. In this case, you can restore the best network configuration from all obtained during the training process using the Best network command (Training Advanced menu) (Fig. 9.9).

28 Fig. 9.9 If the Best Network option is enabled, STATISTICA Neural Networks automatically saves the best network obtained during training (in terms of control error). This takes into account all training runs. Thus, the STATISTICA Neural Networks program automatically stores the best result of all your experiments. You can also set a unit penalty (Unit Penalty) in order to penalize networks with a large number of elements when comparing ( best network usually represents a compromise between scan quality and network size). Backpropagation Before applying the backpropagation algorithm, it is necessary to set the values of a number of control parameters. The most important control parameters are the learning rate, inertia, and mixing of observations during the learning process (note here that the advantage of the conjugate gradient method lies not only in speed, but also in a small number of control parameters) (Fig. 9.10).

29 Fig Parameter P Learning rate sets the step size when changing the weights: if the speed is insufficient, the algorithm converges slowly, and if it is too fast, it is unstable and prone to oscillations. Unfortunately, the value of the best speed depends on the specific task; for fast and rough learning, values from 0.1 to 0.6 are suitable; much smaller values are required to achieve exact convergence (for example, 0.01 or even 0.001 if there are many thousands of epochs). Sometimes it is useful to reduce the speed in the learning process. In the STATISTICA Neural Networks program, you can set the initial and final speed values, in this case, as training progresses, interpolation between them is performed. The initial speed is set in the left field, the final speed in the right field (Fig. 9.11).

Fig. 30 Inertia coefficient (Moment) helps the algorithm not to get stuck in lowlands and local minima. This coefficient can have values ranging from zero to one. Some authors recommend changing it during the learning process. Unfortunately, here, too, the "correct" value depends on the task and can only be found empirically. When using backpropagation, it is generally recommended to change the order of observations from epoch to epoch, as this reduces the chance of the algorithm getting stuck in a local minimum and also reduces the effect of overfitting. To take advantage of this feature, set the mode to Shuffle Observations. Assessing the quality of network performance Once the network has been trained, it is worth checking how well it performs. The rms error, which is displayed in the Training Error Plot window, is only a rough measure of performance. More useful features are displayed in the Classification Statistics and Regression Statistics windows (both accessed through the Analysis Results window). The Classification Statistics window is valid for nominal output variables. This gives you information about how many of each class in the data file (each of which corresponds to a nominal value) were classified correctly, how many were incorrectly classified, how many were not classified, and details about the classification errors. Having trained the network, you just need to open the Descriptive Statistics window (Fig. 9.12).

Fig. 31 Statistics can be obtained separately for the training, control and test sets. The upper part of the table shows summary statistics (the total number of observations in each class, the number of classified correctly, incorrectly, and unclassified), and the lower part of the cross-classification results (how many observations from this column were assigned to this row) (Fig. 9.13). Fig. If there are a lot of answers in this table Unknown, but few or no answers Incorrect, then, probably, the acceptance and rejection thresholds should be relaxed (Edit PrelPost Processing menu) (Fig. 9.14).

32 Figure The Regression Statistics window is used in case of numeric output variables. It summarizes the accuracy of the regression estimates. The most important statistic is the standard deviation ratio (S.D. ratio), shown at the bottom of the table. It is the ratio of the standard deviation of the forecast error to the standard deviation of the original data. If we had no input data at all, then the best that we could take as a forecast for the output variable is its average value over the available sample, and the error of such a forecast would be equal to the standard deviation of the sample. If a neural network performs well, we can expect that its average error on the available observations will be close to zero, and the standard deviation of this error will be less than the standard deviation of the sample values (otherwise the network would give a result no better than simple guessing). Thus, the ratio of standard deviations significantly less than one indicates the efficiency of the network. value, equal to one minus the ratio of standard deviations, equals the fraction of the model variance explained. Kohonen Networks The learning algorithm for Kohonen networks is in some respects similar to the learning algorithms for multilayer perceptrons: it is iterative and carried out over epochs, and the mean squared learning error can be plotted on the graph (although in fact it is the mean square of a completely different measure of error than in multilayer perceptrons). However, Kohonen's algorithm has a number of features. The most significant of them is that learning here is unmanaged, i.e. the data may not contain any output values at all, and if there are, they are ignored. The operation of the algorithm is determined by two parameters: Learning rate and Neighborhood. Learning goes like this: the next observation is fed to the input of the network, processed by it, the winning (most active) radial element (i.e., the element of the second layer of the network) is selected, and then it and its nearest neighbors are corrected so as to better reproduce the training observation. The learning rate controls the degree of adaptation, and the neighborhood determines

33 is the number of elements to be corrected. Usually, the work of the Kohonen algorithm is divided into two stages - ordering and fine tuning, in each of which the learning rate and the size of the neighborhood gradually change from their initial values to the final ones. In STATISTICA Neural Networks, you can set start and end values for both learning rate and neighborhood size. The size of the neighborhood determines the square centered on the winning element; zero "size" corresponds to one winning element; "size 1" to a 3 3 square centered on the winning element; "size 2" squared 5 5, etc. If the winning element is located close to the edge, then the neighborhood is cut off (rather than flipped to the opposite). Although by its very nature such a parameter is an integer, you can set it in real form in order to more accurately control it when the algorithm begins to reduce the size of the neighborhood. In this case, STATISTICA Neural Networks first corrects this number and then rounds it to the nearest integer. After the completion of the Kohonen learning algorithm, you need to mark the radial elements with the icons of their corresponding classes (see the Topological Map section). OTHER NETWORK TYPES Other types of networks are fairly easy to train; in each case, there are only a few training parameters that can be set, and all of them are described below. Radial Basis Functions (RBFs) Training consists of three stages: placing the centers of the radial elements, selecting their deviations, and optimizing the linear output layer. For the first two stages, there are several variants of the algorithm operation, the choice of which is carried out in the Radial basis function window (accessed through the Training menu); the most popular combination is the K means method for the first stage and the K nearest neighbors method for the second stage. The linear output layer is optimized using the classic pseudo-inverse matrix algorithm (singular value decomposition). The STATISTICA Neural Networks program also allows you to build hybrid RBF networks by choosing other activation functions for the output layer (for example, logistic), and in this case, to train this layer, you can use any of the learning algorithms for multilayer perceptrons, for example, the conjugate gradient method. Linear networks Here, under the guise of a two-layer network, an ordinary linear model is implemented, which is optimized using the pseudoinverse matrix algorithm in the Radial Basis Function window.

34 A linear network can also be used for principal component analysis to try to reduce the number of variables before data is processed by a different type of network. Probabilistic and Generalized Regression Neural Networks PNN/GRNN Probabilistic (PNN) and Generalized Regression Neural Networks (GRNN) are based on statistical methods of kernel estimates of probability density and are intended for classification and regression problems, respectively. They are characterized by simple and fast learning algorithms, but the resulting neural network models are large and relatively slow. Automatic Network Builder The process of choosing the right type of network and its architecture can be long and unproductive, as it involves a lot of trial and error. Moreover, since there is noise in the training and the algorithm can get stuck in local minima, each experiment needs to be repeated several times. This tedious work can be minimized by using the automatic network design capabilities implemented in the STATISTICA Neural Networks package, while rather complex optimization algorithms are used to automatically conduct large series of experiments and select the best network architecture and size. Automated network design functions are invoked when the Decision Wizard tool is selected. Here you just need to specify the types of architectures that should be considered, set the number of iterations (or analysis time) that determines the duration of the search (since the algorithm can take a long time, it makes sense to first set a small number of iterations to estimate how long it can take to the entire search), and select the Criteria for choosing a retained network, which will penalize a network with an unreasonably large number of elements. The algorithm will perform the required series of experiments and indicate the best of the resulting networks. If this algorithm is used in the problem of time series analysis, then you must additionally set the value of the Time window parameter. Genetic Algorithm for Input Selection One of the most difficult questions to solve when using neural networks is the question of which input variables should be used (it is rarely known in advance which of them are important for solving the problem and which are not). With the Dimension Down tool,

35 available from the Advanced tab of the start menu, you can automatic mode find a suitable set of input variables. By building and testing a large number of PNN or GRNN networks (for classification or regression problems, respectively) with different sets of input variables, the genetic algorithm (as well as algorithms with inclusion and exclusion) selects combinations of inputs and searches for the best of them. As in the case of the automatic network designer, this procedure can be time consuming, but, nevertheless, it is often the only way to solve the problem. In the process of operation, the genetic algorithm generates a large number of test bit strings (their number is set by the Population parameter) and artificially "crosses" them over a given number of generations, using the artificial selection operations of mutation and crossing, the intensity of which can be controlled. A PNN or GRNN network is trained with set parameter smoothing (it is wise to skip a few tests before applying the genetic algorithm to determine the appropriate smoothing factor), and to give advantage to small sets of input variables, you can set the Penalty per element parameter. The algorithm looks at all input variables available in the dataset. To start the algorithm, you need to click the OK button. When his work is over, in the table at the bottom of the window, the word Yes will be displayed opposite the useful variables, and a dash will be displayed opposite the useless ones. To use the results of the algorithm, you must first select Run Network Builder for Selected Variables or Run Decision Wizard for Selected Variables from the Dimension Reductions menu on the End of Analysis tab. WORKING WITH THE NETWORK Obtaining output values Once the network has been trained, it can be used to perform data analysis: run the network on individual observations from the current dataset, on the entire dataset, or on arbitrary user-defined observations. The network can also process any other compatible data set that has input variables with the same names and definitions as in the network. This means that having built the network, we are no longer tied to the training set. If a dataset is being analyzed that, in addition to the inputs, has compatible output values, then the STATISTICA Neural Networks program will calculate the error values. When you open a network or dataset, STATISTICA Neural Networks checks to see if there are variables in the dataset that are compatible with the network's input variables. If there are, then their type in the data set is automatically set as needed, and all other variables are ignored. Thus, it is possible to have several networks (in the form of files) working with

CHAPTER 10 Add-ins Some Excel utilities become available only after adding add-ins. First of all, let's focus on the add-ons Find a solution and the Analysis package. Let's demonstrate what

Laboratory work 105. Application of the clustering algorithm: self-organizing Kohonen maps Main goal To learn how to use the data processing method "Self-organizing Kohonen maps". theoretical

Lab 3. Deductor Analytical Platform (data import and data cleansing) (this work uses a demo version of the product) Deductor consists of five components: an analytical application

CONTENTS Chapter 13: MULTIPLE PART PLACEMENT

DATABASE SYSTEM SECURITY topic 10 Lecture 10. Using macros in Access A macro is a set of one or more commands that perform certain, frequently used operations, for example, opening

Lek 6 Kons svodn 1 COURSE ISE 1 LECTURE Topic 8: Technology and methods of processing economic information using consolidated and pivot tables Plan 1. The concept of a consolidated table. consolidation methods.

LAB 14 Automatic classification policy-related articles This example is based on the "standard" set of news documents published by the website lenta.ru. From this site

Electronic Science Magazine"RESEARCHED IN RUSSIA" 270 http://zhurnalaperelarnru/articles/2006/36pdf Application of neural networks for solving forecasting problems Soldatova OP, Semenov VV ( [email protected])

Laboratory work 2 Topic: Technology of analytical modeling in DSS. Technologies of analysis and forecasting based on trends Purpose: to study the possibilities and develop the ability to use the universal

Lecture 11 CALCULATIONS IN THE SPEECH PROCESSOR MS EXCEL 2010 The purpose of the lecture. To study the features of performing calculations using formulas in the spreadsheet Ms Excel 2010. Lecture questions: 1. Formulas

WORK TIME: 2 hours. 1. Extra-curricular preparation Make a title page. See APPENDIX 1 2. Work in the lab Basics As soon as Word starts, it automatically creates a new document.

Instructions for working with the program "Neurosimulator" (on the example of multiplication table simulation) The program "Neurosimulator" allows you to create and apply neural networks of the perceptron type. Rice. 1. Working

Chapter 1 Charting Basics Data in a spreadsheet is represented as rows and columns. When adding a chart, the value of this data can be enhanced by highlighting relationships and trends that are not

Lecture 7 course 1 DSS 1 12/15/2012 Lecture TOPIC 9: Information technologies for creating decision support systems and forecasting methods Plan: 1. Forecasting methods in MS spreadsheets

Contents 1. Summary of costs, form KS-3... 2 1.1. Document creation, general description... 2 1.2. Adding data to a document... 5 1.2.1. Synchronization of total values in a line... 6 1.2.2. Added types

Training of neural networks In the course of work, the neural network implements some data transformation, which general view can be described by a function of many variables Y = f (X), where = x x,...,

Chapter 8 Customizing Views What are Views What are Views A view is a way of visualizing (or in other words, presenting) information to the user based on stored data

Work 9 Forms in Access The purpose of the work: to learn how to create and edit forms using autoforms and in the form wizard mode Contents of work 1 Types of forms 2 Creating forms 1 Types of forms Entering and viewing data

To create a test you need: WHERE TO START? Step 1. Add questions (for tests) to the bank of questions. Two ways to add: import from notepad (create a notepad file (.txt format), add questions there,

Working with spreadsheet Microsoft Excel Brief theoretical information Windows Excel application allows you to generate and print documents presented in tabular form, perform calculations

1. Introduction Laboratory work 3 Selection of parameters When solving various problems, one often has to deal with the problem of selecting one value by changing another. For this purpose, it is very effectively used

general information When planning the release of the 2007 Microsoft Office system, the developers were tasked with making the core Microsoft Office applications easier to use. As a result, a custom

Main Objective Lab 104. Logistic Regression and ROC Analysis To learn how to process data and predict events using the power of logistic regression and ROC analysis. theoretical

GROUP METHOD OF ARGUMENTS Group Method of Data Handling (GMDH) is a method of generating and selecting regression models of optimal complexity. Under Model Complexity

Lesson 1: Interface Excel * version 2010 * 1.0 Introduction Data in Excel is arranged in "cells", which in turn form columns and rows. This helps us better perceive this data and allows us to

Binary Response Models In the previous section, we performed a regression analysis under the assumption that the response variable Test is a continuous random variable with a normal distribution. On the

Acquaintance with Access program Access is a database application or database management system (DBMS). Computer databases are used in almost all fields of activity. Skill

Instructions for filling the site of the Department of Culturology and Sociology (Part 2 "site content editor") 1 Table of contents 1 Editor interface... 3 2 Resizing the editor... 4 3 Toolbar...

Chapter 17 Finding, Sorting, and Displaying Information in a Database IN THIS CHAPTER...» Finding and Filtering Data» Sorting a Database» Creating and Applying Queries

6 Frequencies 97 Step-by-Step Calculations 102 Presenting Results 105 Completing the Analysis and Exiting the Program graphic representation(columnar and circular

PRACTICAL WORK 5 TOPIC: Comprehensive use of MS Word features to create large documents WORK PURPOSE: To learn how to comprehensively use MS Word features to create large documents

MOU "Lyceum 43" Saransk Methodical development "RESEARCH OF ACCESS DBMS IN CREATING AND EDITING A DATABASE" Author teacher of computer science Zhebanov A. A. Saransk 2014 RESEARCH OF ACCESS DBMS IN

CHAPTER 1 Preparing to Use Excel Many readers are more or less familiar with electronic Excel spreadsheets. However, it is necessary to define the terms most frequently encountered

Practical lesson 3 Creation of reporting documentation. Connection and consolidation of data. Pivot tables The purpose of the work: to learn how to create a consolidation of data in tables, create and apply pivot tables

Practical work 3 Creating a form A form is a database object that can be used to enter, modify, or display data from a table or query. Forms can be used to manage

ITV Group ArpEdit Utility Manual Version 1.4 Moscow, 2014 Contents CONTENTS... 2 1 INTRODUCTION... 4 1.1 Purpose of the document... 4 1.2 Purpose of the ArpEdit utility... 4 2 GENERAL

Working with Standard Document Templates User Guide for Cognitive Technologies Moscow, 2015 2 ABSTRACT This document provides information about the use in the E1 Euphrates software package

I APPROVE Director of SBEI DPO TsPKS SPb "Regional center for assessing the quality of education and information technologies» E.V. Mikhailov AISU "Paragraph" for educational institutions Service NEW LIST Manual

Lab 6 Pivot Tables Theoretical Section Understanding Pivot Tables Excel uses so-called Pivot Tables to analyze data from large tables in a comprehensive and efficient manner.

OpenOffice.org Impress Impress is an OpenOffice.org program for working with slide shows (presentations). You can create slides that contain many different elements, including text, bulleted

Introduction to ACCESS First of all, Access is a database management system (DBMS). Like other products in this category, Access is designed to store and retrieve presentation data in a convenient form.

Electronic platform FINTENDER.RU System STAR Substantiation service NMC Moscow 2017

Placement editing subsystem Section. Placement editing subsystem Window Placement of cells...-1 Layout mode...-2 Placement mode...-2 Link length...-2 Active subcircuit...-2 Table

STO MI of the user "Setting up reports in 1C: Enterprise" Description The user manual describes how to work with reports in the 1C: Enterprise program. This manual allows you to get the skills to set up

On the Margins tab of the Page Setup dialog box, the top, bottom, left, and right margins are set to indent from the edge of the pages to the table. The height and width depends on the size of the indents. table fields,

90 Chapter 5 Inscription This tab is available for objects in which text has been typed. With it, you can adjust the internal margins and specify whether the size of the object will change if the text does not fit.

1. Inserting and creating tables in Word 2007 Word tables are used to structure page content. In addition, tables are used for calculations. Word uses insert and create technology

Working with Notes About Notes A note is information (data) that is related to a cell and is stored independently of the contents of that cell. It may be some explanatory information,

Organization of document protection by means of a package Microsoft office 2010 The purpose of the work is to learn how to organize the protection of text documents, the protection of spreadsheets, the protection of databases. Having done this work,

Finding and replacing data Finding data You can search for data on the entire sheet or in a selected area of the sheet, for example, only in some columns or rows, or in the entire workbook at once. 1. In a group

1. word processor OpenOffice.org Writer. Entering and formatting text General information The Writer word processor is by far the best-known OpenOffice.org application. Like text

Module "Traffic schedule control". Brief information...3 First settings...3 Control panel of the "Traffic schedule control" module...4 Working with the route editor...4 Editor contents...4 Points

LABORATORY WORK MULTILAYER SIGMOIDAL NETWORKS Multilayer perceptron In a multilayer perceptron, neurons are arranged in several layers Neurons in the first layer receive input signals and transform them

How to update configuration releases MAKE A BACKUP OF YOUR DATABASE. Before making any changes, make a backup copy of your information base on the hard drive

PRACTICE Basic skills of working in Deductor Studio 5.2 Lesson 7. Using scripts Introduction Scripts are designed to automate the process of adding processing branches of the same type to a script.

System Integration Module Material Rationing and CAD TP VERTICAL User Manual The information contained in this document, subject to change without prior notice. None

Task 2. Creating and editing tables. Working with a data schema Purpose of the assignment: To learn how to create new base data, create and edit the structure of tables and establish relationships between them using

Edited by V.P. Borovikov

2nd ed., revised. and additional

2008 G.

Circulation 1000 copies.

Format 70x100/16 (170x240 mm)

Version: paperback

ISBN 978-5-9912-0015-8

BBC 32.973

UDC 004.8.032.26

annotation

Neural network methods for data analysis based on the use of the STATISTICA Neural Networks package (StatSoft manufacturer), fully adapted for the Russian user, are presented. The foundations of the theory of neural networks are given; Much attention is paid to solving practical problems, the methodology and technology of conducting research using the STATISTICA Neural Networks package, a powerful tool for analyzing and predicting data, which has wide applications in business, industry, management, and finance, is comprehensively considered. The book contains many examples of data analysis, practical recommendations for analysis, forecasting, classification, pattern recognition, production process control using neural networks.

For a wide range of readers involved in research in banking, industry, economics, business, exploration, management, transport and other areas.

Preface to the second edition

Introduction. An invitation to neural networks

Chapter 1. BASIC CONCEPTS OF DATA ANALYSIS

Chapter 2. INTRODUCTION TO PROBABILITY THEORY

Chapter 3. INTRODUCTION TO THE THEORY OF NEURAL NETWORKS

Chapter 4. OVERVIEW OF NEURAL NETWORKS
Parallels from biology
Basic artificial model
Application of neural networks
Pre- and post-processing
Multilayer perceptron
Radial basis function
Probabilistic neural network
Generalized Regression Neural Network
Line network
Kohonen network
Classification tasks
Regression tasks
Time series forecasting
Selection of variables and dimension reduction

Chapter 5 FIRST STEPS IN STATISTICA NEURAL NETWORKS
Getting Started
Create a dataset
Create a new network
Create a dataset and network
Network training
Running a neural network
Carrying out classification

Chapter 6. FURTHER CAPABILITIES OF NEURAL NETWORKS
Classic example: Fisher's irises
Cross-validation training
Stop conditions
Solving regression problems
Radial basis functions
Linear Models
Kohonen networks
Probabilistic and generalized regression networks
Network constructor
Genetic algorithm for selection of input data
Time series

Chapter 7
Data representation
Extraction of useful input variables.
Downscaling
Choice of network architecture
Custom network architectures
Time series

Chapter 8 CASE STUDIES
Example 1 Downsizing in a Geological Survey
Example 2: Pattern Recognition
Example 3. Nonlinear classification of two-dimensional sets
Example 4. Segmentation of various fuel samples according to laboratory data
Example 5: Building a Behavioral Scoring Model
Example 6. Approximation of functions
Example 7: Forecasting oil sales
Example 8: Monitoring and Prediction
temperature conditions at the installation
Example 9. Determining the validity of a digital signature

Chapter 9. QUICK GUIDE
Data
networks
Network training
Other types of networks
Networking
Sending results to STATISTICA

Chapter 10. CLASSICAL METHODS ALTERNATIVE TO NEURAL NETWORKS
Classical discriminant analysis in STATISTICA
Classification
logit regression
Factor analysis in STATISTICA

Chapter 11. DATA MINING IN STATISTICA

Appendix 1 Code Generator

Annex 2. Integration of STATISTICA with ERP systems

Bibliography

Subject index

These books can be purchased from the StatSoft office.

A popular introduction to modern data analysis and machine learning at Statistica

V.P. Borovikov

Volume: 354 pages

Price: 1000 rubles.

The modern possibilities of data analysis and machine learning, which is a trend in modern computer analytics, are covered in a popular and fascinating way. The presentation focuses on understanding the methods and their application to practical problems. "Do after us, and you will learn how to analyze data!" is the main theme of the book.

Classical statistical methods are described in detail, including multivariate methods: cluster, discriminant analysis, multiple regression, factor analysis, principal component analysis, survival analysis and Cox regression. In separate chapters, neural network methods, data mining methods, classification and regression trees (CART - models) are presented. Examples from various areas of human activity are considered: industry, retail, infocommunications, business, medicine. Special chapters are devoted to the theory of probability and optimization methods underlying machine learning methods.

For a wide range of readers: engineers, technologists, managers, analysts, doctors, researchers who are interested in modern analytical methods and technologies for data analysis and machine learning and their application in practice.

A Popular Introduction to Modern Data Analysis in a System STATISTICS

V.P. Borovikov

Volume: 288 pages

The unique book by StatSoft Scientific Director Vladimir Borovikov contains all the best that is known in the field of data analysis.

Using simple, clear examples from business, marketing, and medicine, modern methods of data analysis are described - visual analysis and graphical representation of data, descriptive statistics, classification and forecasting methods.

The book is an educational standard in the field of data analysis in the leading universities of Russia: NRU MIEM HSE, Moscow State University, Kuban State University and etc.

Much attention is paid to the systematics of data analysis, ranging from descriptive analysis, data cleaning and verification, visual representation, grouping and classification methods to the latest technologies neural networks and data mining to find patterns in your data.

Probability theory, mathematical statistics and data analysis: Fundamentals of theory and practice on the computer. STATISTICS. EXCEL. Over 150 problem solving examples

Khalafyan A.A., Borovikov V.P., Kalaidina G.V.

Volume: 320 pages

Price: 600 rubles.

You can send an application to

The current level of development of computer technology allows the study of probability theory and mathematical statistics to be brought to a new educational level, with an emphasis on the applied part of the discipline - mathematical statistics and computer data analysis.

The textbook outlines the elements of combinatorics, various methods for calculating probabilities, gives the concepts of a random variable, its functional and numerical characteristics. The theoretical material is accompanied by examples and specially selected tasks that allow you to study the material in depth. A separate chapter describes using excel and STATISTICS for solving applied problems. Excel is part of Microsoft Office and is one of the most popular applications in the world today. STATISTICS occupies a leading position among data analysis programs, has more than a million users worldwide. The program has been fully Russified, an Intellectual Knowledge Portal has been created, which represents a global multimedia resource for a wide range of users: schoolchildren, students, graduate students - everyone who wants to develop their intellect, get acquainted with modern technologies computer data analysis.

The textbook is addressed to a wide range of students and teachers, students, bachelors of humanitarian and natural science specialties of non-mathematical direction, studying higher mathematics.

Neural networks STATISTICA Neural Networks: Methodology and technology of modern data analysis

Ed. V.P. Borovikov

Volume: 392 pages

You can send an application to

The book outlines neural network methods for data analysis based on the use of the package STATISTICA Neural Networks, fully adapted for the Russian user.

The foundations of the theory of neural networks are given; much attention is paid to solving practical problems, the methodology and technology of conducting research using the package STATISTICA Neural Networks- a powerful tool for data analysis, dependency building, forecasting, classification.

Currently, neural networks are intensively used in banking, industry, marketing, economics, medicine and other areas where forecasting and in-depth understanding of data is required. It is generally accepted that neural networks are a natural addition to classical methods of analysis and are used where standard procedures do not give the desired effect.

The book contains many examples of data analysis, practical recommendations for analysis, forecasting, classification, pattern recognition, production process control using neural networks.

The book will be useful for a wide range of readers involved in research in the banking sector, industry, business, exploration, management, transport and other areas.

STATISTICS: the art of computer data analysis (2nd edition)

+ StatSoft Multimedia Tutorial

V. P. Borovikov

Volume: 700 pages

The book is currently out of stock. A new edition of the book is planned for the near future. Please send your applications to:

The book is the most fundamental text on modern data analysis and includes about 700 pages of data analysis methods and procedures. The second edition of the book is supplemented with new materials not included in the previous version of the book, in particular: power analysis, sample size estimation, partial correlations, principal component analysis, a new interpretation of neural networks, and much more are described. The book comes with a CD with demos. software products StatSoft, data analysis examples, the acclaimed StatSoft e-textbook, industrial statistics textbook, course materials, and a wealth of data for study and research.

The main feature of the second edition is the new chapter on language STATISTICS Visual Basic (SVB) that extends system capabilities STATISTICS and allowing users to create their own applications.

The book describes in detail the basic concepts of data analysis in the system using real data as an example. STATISTICS: descriptive and visual analysis, contingency table analysis, dependency building, multiple regression, survival analysis, non-parametric methods, correspondence analysis, neural networks, classification and prediction using neural networks, quality control, experiment design, including a wide variety of designs, and much more.

The peculiarity of the book is that you not only see the results of the analysis, but you can also repeat them after us on the system STATISTICS Thus, using the latest data analysis computer technology from StatSoft, you learn to analyze and understand data step by step.

This fundamental publication is designed for the widest range of readers and users of the system. STATISTICS who want to become professionals in data analysis in various fields: business, marketing, finance, management, economics, industry, insurance, medicine and other applications.

Forecasting in the system STATISTICS in WINDOWS environment

V.P. Borovikov, G.I. Ivchenko

Volume: 368 pages

The book is currently out of stock.

First hand forecasting secrets.

A feature of the book is the combination of two interrelated and mutually complementary parts: practical, in which in detail, with the translation of the main options and dialog boxes, forecasting in the modern version of the system is described STATISTICS, and theoretical, which outlines the main ideas, methods and results of the theory of stochastic forecasting.

According to the authors, the synthesis of theory and practice should lead to the fact that the reader not only mechanically masters the methods and techniques of forecasting, but also gets a connected idea about them: from getting acquainted with the mathematical foundations to acquiring practical skills in the system STATISTICS.

The book is based on a course taught by the authors at the Moscow State Institute of Electronics and Mathematics (MGIEM - Technical University). The application contains a comprehensive English-Russian dictionary of basic forecasting terms.
The book is aimed at scientists, analysts and specialists who use forecasting methods in their daily activities, and can also be used by teachers of higher educational institutions when reading courses on forecasting and mathematical statistics.

Geostatistics. Theory and practice

V.V. Demyanov, E.A. Savelyeva

Volume: 327 pages

The book is currently out of stock.

This book will answer the questions:
What is geostatistics?
- What are the methods of spatial interpolation?
- what is kriging?
How useful is a variogram?
Why is stochastic modeling necessary?
and many others

The monograph describes in detail the methods of geostatistics and related sections of spatial modeling. The presentation of the theory is accompanied by examples of the use of models in various fields: ecology, geology, hydrogeology, oil production, energy, fish stock assessment, etc. The final section outlines the main directions in the development of modern geostatistical theory. The publication can be used as a teaching aid.

The material of the book is presented with gradual complication. To consolidate the knowledge gained, there are questions and exercises. The book includes appendices that allow it to be used as a reference book on geostatistics.

StatSoft Academy of Data Analysis also offers a wide range of courses on modern methods and technologies of data analysis in the field of geoanalytics.

Industrial statistics. Quality control, process analysis, experiment planning in a package STATISTICS

Khalafyan A.A.

Volume: 384 pages

The book is currently out of stock.

This publication is devoted to the description of statistical methods that allow, with limited volumes of analyzed products, to judge the quality of products with a given degree of accuracy and reliability. Statistical analysis of product quality ensures the adoption of correct management decisions not on the basis of intuition, but with the help of scientific methods to identify patterns in the accumulated arrays of numerical information.

The textbook covers such sections of industrial statistics as: quality control cards; process analysis; six sigma; planning experiments in the environment of a package that is widely known all over the world STATISTICS. A detailed description of the technology of working with program modules is given.

The publication is addressed to students of the directions "Economics", "Quality Management", "Standardization and Metrology", "Metrology, Standardization and Certification", graduate students, researchers, university professors, analysts and managers, as well as everyone who is interested in statistical methods in quality management .

How to win the world championship. Methods of mathematical statistics in the management of national football

Petrunin Yu.Yu., Ryazanov M.A.

Volume: 56 pages

The book is currently out of stock.

Modern methods statistics and data analysis have led to the creation of new scientific disciplines - footballonomy and footballmetry. Using the apparatus developed in them, it is possible to assess the quality of work of state (Ministry of Sports) and non-profit organizations (football associations and unions), develop and apply methods of regulatory influence that can raise the level of national football and its prestige on the world stage.

STATISTICS– Quick User Guide

Volume: 250 pages

The book is currently out of stock.

The book outlines the basic principles of working with the system, discusses toolbars, user interface, data files, practical examples of using the package. A separate chapter is devoted to setting up the system. The book also contains a comprehensive reference, which is a summary of the most commonly used conventions, functions and capabilities of the system. STATISTICS, and subject index.

Neural network methods for data analysis based on the use of the Statistica Neural Networks package (StatSoft manufacturer), fully adapted for the Russian user, are outlined. The foundations of the theory of neural networks are given; Much attention is paid to solving practical problems, the methodology and technology of conducting research using the Statistica Neural Networks package, a powerful tool for analyzing and predicting data, which has wide applications in business, industry, management, and finance, is comprehensively considered. The book contains many examples of data analysis, practical recommendations for analysis, forecasting, classification, pattern recognition, production process control using neural networks. For a wide range of readers involved in research in banking, industry, economics, business, exploration, management, transport and other areas. ContentsPreface to the second editionIntroduction. Invitation to Neural Networks 1. Basic concepts of analysis of these heads 2. Introduction to the theory of probability of the head 3. Introduction to the theory of neural networks 4. General viewing of neural networks from biology bazais artificial modeling of neural networks and post-proceedings. Mine-layered basket-dimensional functional neural network-regional network. KohonenClassification ProblemsRegression ProblemsTime Series PredictionVariable Selection and Dimension ReductionChapter 5. FIRST STEPS IN STATISTICA NEURAL NETWORKS.Getting StartedCreating a DatasetCreating a New NetworkCreating a Dataset and a NetworkTraining a NetworkRunning a Neural NetworkPerforming ClassificationChapter 6. FURTHER POSSIBILITIES OF NEURAL NETWORKSClassic Example: Fisher’s IrisesBase Regression Regression Learning with Cross-Testing functionsLinear models.Kohonen networksProbabilistic and generalized regression networksNetwork constructorGenetic Data Selection Time SeriesChapter 7. PRACTICAL TIPS FOR PROBLEM SOLVING Data RepresentationIsolation of Useful Input VariablesDimensionality DownscalingChoose Network ArchitectureCustom Network ArchitecturesTime SeriesChapter 8. CASE STUDIESExample 1. Dimensionality Reduction in Geological SurveyExample 2. Pattern RecognitionExample 3.Nonlinear Classification of 2D Sets Segmentation of various fuel samples according to laboratory research Example 5. Building a Behavioral Scoring ModelExample 6. Function ApproximationExample 7. Oil Sales ForecastingExample 8. Monitoring and Predicting Temperature Conditions at a PlantExample 9. Determining the Validity of a Digital SignatureChapter 9. QUICK GUIDEDataNetworksTraining NetworksOther Types of NetworksWorking with a NetworkSending Results to the STATISTICA SystemChapter 10. CLASSICAL METHODS analysis in STATISTICAClassificationLogistic regressionFactor analysis in STATISTICAChapter 11. DATA MINING IN STATISTICAAppendix 1. Code generatorAppendix 2. Integration of STATISTICA with ERP systems

R is a free software environment for statistical computing and graphics.
It is a GNU project similar to the S language and environment that was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be thought of as a different implementation of S. There are some important differences, but most of the code written for S runs unchanged under R.

Free Open source Mac Windows Linux BSD

RStudio

RStudio™ is an integrated development environment (IDE) for the R programming language. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R.

Free Open source Mac Windows Linux Xfce

PSPP

PSPP is free software application for sample data analysis. It has GUI user and normal interface command line. It is written in C, uses the GNU scientific library for its math routines, and plotutils for graph generation. It is intended for free replacement proprietary program SPSS.

Free Open source Mac Windows Linux

IBM SPSS Statistics

The IBM SPSS software platform offers advanced statistical analysis, a rich library of machine learning algorithms, text analytics, open source extensibility, big data integration, and seamless application deployment.

Paid Mac Windows Linux

SOFA Statistics

SOFA Statistics is an open source statistical package in which Special attention Ease of use, learning as you go, and great results. The name stands for "Statistics open to all". It has a graphical user interface and can connect directly to MySQL, SQLite, MS Access and MS SQL Server

Free Open source Mac Windows Linux

What's on this list?

The list contains programs that can be used to replace STATISTICA on Windows platforms. This list contains 6 apps similar to STATISTICA.