Kendall's tau correlation level of significance table. Kendall rank correlation coefficient. See what “Kendall rank correlation coefficient” is in other dictionaries

Kendall's correlation coefficient is used when variables are represented on two ordinal scales, provided that there are no associated ranks. The calculation of the Kendall coefficient involves counting the number of matches and inversions. Let's consider this procedure using the example of the previous problem.

The algorithm for solving the problem is as follows:

    We rearrange the data in the table. 8.5 so that one of the rows (in this case the row x i) turned out to be ranked. In other words, we rearrange the pairs x And y in the right order and We enter the data in columns 1 and 2 of the table. 8.6.

Table 8.6

x i

y i

2. Determine the “degree of ranking” of the 2nd row ( y i). This procedure is carried out in the following sequence:

a) take the first value of the unranked series “3”. Counting the number of ranks below given number, which more compared value. There are 9 such values ​​(numbers 6, 7, 4, 9, 5, 11, 8, 12 and 10). Enter the number 9 in the “matches” column. Then we count the number of values ​​that less three. There are 2 such values ​​(ranks 1 and 2); We enter the number 2 in the “inversion” column.

b) discard the number 3 (we have already worked with it) and repeat the procedure for the next value “6”: the number of matches is 6 (ranks 7, 9, 11, 8, 12 and 10), the number of inversions is 4 (ranks 1, 2 , 4 and 5). We enter the number 6 in the “coincidence” column, and the number 4 in the “inversion” column.

c) the procedure is repeated in a similar way until the end of the row; it should be remembered that each “worked out” value is excluded from further consideration (only ranks that lie below this number are calculated).

Note

In order not to make mistakes in calculations, it should be borne in mind that with each “step” the sum of coincidences and inversions decreases by one; This is understandable given that each time one value is excluded from consideration.

3. The sum of matches is calculated (R) and the sum of inversions (Q); the data is entered into one and three interchangeable formulas for the Kendall coefficient (8.10). The corresponding calculations are carried out.

t (8.10)

In our case:

In table XIV Appendix contains the critical values ​​of the coefficient for this sample: τ cr. = 0.45; 0.59. The empirically obtained value is compared with the tabulated one.

Conclusion

τ = 0.55 > τ cr. = 0.45. The correlation is statistically significant at level 1.

Note:

If necessary (for example, if there is no table of critical values), statistical significance t Kendall can be determined by the following formula:

(8.11)

Where S* = P – Q+ 1 if P< Q , And S* = P – Q – 1 if P>Q.

Values z for the corresponding significance level correspond to the Pearson measure and are found in the corresponding tables (not included in the appendix. For standard significance levels z kr = 1.96 (for β 1 = 0.95) and 2.58 (for β 2 = 0.99). Kendall's correlation coefficient is statistically significant if z > z cr

In our case S* = P – Q– 1 = 35 and z= 2.40, i.e. the initial conclusion is confirmed: the correlation between the characteristics is statistically significant for the 1st level of significance.

To calculate the coefficient rank correlation Kendall r k it is necessary to rank the data according to one of the characteristics in ascending order and determine the corresponding ranks for the second characteristic. Then, for each rank of the second attribute, the number of subsequent ranks greater in value than the taken rank is determined, and the sum of these numbers is found.

Kendall's rank correlation coefficient is given by


Where R i– number of ranks of the second variable, starting from i+1, the value of which is greater than the value i-th rank of this variable.

There are tables of percentage points of coefficient distribution r k, allowing you to test the hypothesis about the significance of the correlation coefficient.

For large sample sizes, critical values r k are not tabulated, and they have to be calculated using approximate formulas, which are based on the fact that under the null hypothesis H 0: r k=0 and larger n random value

distributed approximately according to the standard normal law.

40. Dependence between traits measured on a nominal or ordinal scale

Often the task arises of checking the independence of two characteristics measured on a nominal or ordinal scale.

Let some objects have two characteristics measured X And Y with the number of levels r And s respectively. It is convenient to present the results of such observations in the form of a table called a contingency table of characteristics.

In the table u i(i = 1, ..., r) And v j (j= 1, ..., s) – values ​​​​accepted by the characteristics, value n ij– number of objects from total number objects that have the characteristic X accepted the value u i, and the sign Y- meaning v j

Let's introduce the following random variables:

u i


– the number of objects that have a value v j


In addition, there are obvious equalities



Discrete random variables X And Y independent if and only if

for all couples i, j

Therefore, the hypothesis about the independence of discrete random variables X And Y can be written like this:

As an alternative, as a rule, the hypothesis is used

The validity of the hypothesis H 0 should be judged on the basis of sample frequencies n ij contingency tables. In accordance with the law of large numbers when n→∞ relative frequencies are close to the corresponding probabilities:



Statistics are used to test the hypothesis H 0

which, if the hypothesis is true, has a distribution χ 2 s rs − (r + s− 1) degrees of freedom.

Independence criterion χ 2 rejects the hypothesis H 0 with significance level α if:


41. Regression analysis. Basic concepts of regression analysis

For mathematical description statistical relationships between the studied variables, the following tasks should be solved:

ü select a class of functions in which it is advisable to look for the best (in a certain sense) approximation of the dependence of interest;

ü find estimates of the unknown values ​​of the parameters included in the equations of the desired dependence;

ü establish the adequacy of the resulting equation for the desired relationship;

ü identify the most informative input variables.

The totality of the listed tasks is the subject of regression analysis research.

The regression function (or regression) is the dependence of the mathematical expectation of one random variable on the value taken by another random variable, forming a two-dimensional system of random variables with the first.

Let there be a system of random variables ( X,Y), then the regression function Y on X

And the regression function X on Y

Regression functions f(x) And φ (y), are not mutually reversible, unless the relationship between X And Y is not functional.

When n-dimensional vector with coordinates X 1 , X 2 ,…, Xn one can consider the conditional mathematical expectation for any component. For example, for X 1


called regression X 1 per X 2 ,…, Xn.

To fully define the regression function, it is necessary to know the conditional distribution of the output variable for fixed values ​​of the input variable.

Since in a real situation such information is not available, they are usually limited to searching for a suitable approximating function f a(x) For f(x), based on statistical data of the form ( x i, y i), i = 1,…, n. This data is the result n independent observations y 1 ,…, y n random variable Y for the values ​​of the input variable x 1 ,…, x n, while in regression analysis it is assumed that the values ​​of the input variable are specified exactly.

The problem of choosing the best approximating function f a(x), being the main one in regression analysis, and does not have formalized procedures for its solution. Sometimes the choice is determined based on the analysis of experimental data, more often from theoretical considerations.

If the regression function is assumed to be sufficiently smooth, then the function approximating it f a(x) can be represented as a linear combination of a certain set of linearly independent basis functions ψ k(x), k = 0, 1,…, m−1, i.e. in the form


Where m– number of unknown parameters θk(in the general case, the quantity is unknown, refined during the construction of the model).

Such a function is linear in its parameters, so in the case under consideration we speak of a regression function model that is linear in its parameters.

Then the task of finding the best approximation for the regression line f(x) reduces to finding such parameter values ​​at which f a(x;θ) is most adequate to the available data. One of the methods that allows you to solve this problem is the least squares method.

42. Least square method

Let the set of points ( x i, y i), i= 1,…, n located on a plane along some straight line

Then as a function f a(x), which approximates the regression function f(x) = M [Y|x] it is natural to take a linear function of the argument x:


That is, the basis functions chosen here are ψ 0 (x)≡1 and ψ 1 (x)≡x. This type of regression is called simple linear regression.

If the set of points ( x i, y i), i= 1,…, n located along some curve, then as f a(x) it’s natural to try to choose a family of parabolas

This function is nonlinear in parameters θ 0 and θ 1, however, by means of a functional transformation (in this case logarithm) it can be reduced to new feature f' a(x) linear in parameters:


43. Simple Linear Regression

The simplest model regression is simple (univariate, one-factor, paired) linear model, having the following form:


Where εi– random variables (errors) that are uncorrelated with each other, having zero mathematical expectations and identical variances σ 2 , a And b– constant coefficients (parameters) that need to be estimated from the measured response values y i.

To find parameter estimates a And b linear regression, determining the straight line that best satisfies the experimental data:


The least squares method is used.

According to least squares method parameter estimates a And b found from the condition of minimizing the sum of squared deviations of values y i vertically from the “true” regression line:

Let ten observations of a random variable be made Y for fixed variable values X

To minimize D let us equate to zero the partial derivatives with respect to a And b:



As a result we get the following system equations for finding estimates a And b:


Solving these two equations gives:



Expressions for parameter estimates a And b can also be represented as:

Then the empirical equation of the regression line Y on X can be written as:


Unbiased variance estimator σ 2 value deviations y i from the fitted straight regression line is given by

Let's calculate the parameters of the regression equation


Thus, the regression line looks like:


And the estimate of the variance of deviations of values y i from the fitted straight regression line


44. Checking the significance of the regression line

Found estimate b≠ 0 may be a realization of a random variable whose mathematical expectation is equal to zero, that is, it may turn out that there is actually no regression dependence.

To deal with this situation, you should test the hypothesis H 0: b= 0 with competing hypothesis H 1: b ≠ 0.

Testing the significance of a regression line can be done using analysis of variance.

Consider the following identity:

Magnitude y iŷi = εi is called the remainder and is the difference between two quantities:

ü deviation of the observed value (response) from the overall average response;

ü deviation of the predicted response value ŷi from the same average

The written identity can be written in the form


By squaring both sides and summing over i, we get:


Where the quantities are named:

the complete (total) sum of squares SC n, which is equal to the sum of squared deviations of observations relative to the average value of observations

the sum of squares due to the regression of the SC p, which is equal to the sum of squared deviations of the values ​​of the regression line relative to the average of observations.

residual sum of squares SC 0 . which is equal to the sum of squared deviations of observations relative to the values ​​of the regression line

Thus, the spread Y-kov relative to their mean can be attributed to some extent to the fact that not all observations lie on the regression line. If this were the case, then the sum of squares relative to the regression would be zero. It follows that the regression will be significant if the sum of squares of SC p is greater than the sum of squares of SC 0.

Calculations to test the significance of regression are carried out in the following ANOVA table

If errors εi are distributed according to the normal law, then if the hypothesis H 0 is true: b= 0 statistics:


distributed according to Fisher's law with the number of degrees of freedom 1 and n−2.

The null hypothesis will be rejected at significance level α if the calculated value of the statistic F will be greater than the α percentage point f 1;n−2;α Fisher distributions.

45. Checking the adequacy of the regression model. Residual method

Under the adequacy of the built regression model it is understood that no other model provides significant improvement in response prediction.

If all response values ​​are obtained at different meanings x, i.e. there are no several response values ​​obtained at the same x i, then only limited testing of the adequacy of the linear model can be carried out. The basis for such a check is the balances:

Deviations from the established pattern:

Because the X– one-dimensional variable, points ( x i, d i) can be depicted on a plane in the form of a so-called residual graph. This representation sometimes makes it possible to detect some kind of pattern in the behavior of residues. In addition, residual analysis allows one to analyze the assumption regarding the error distribution law.

In the case when the errors are distributed according to the normal law and there is an a priori estimate of their variance σ 2 (an assessment obtained on the basis of previously performed measurements), then a more accurate assessment of the adequacy of the model is possible.

By using F-Fisher's test can be used to check whether the residual variance is significant s 0 2 differs from the a priori estimate. If it is significantly greater, then there is inadequacy and the model should be revised.

If the a priori estimate σ 2 no, but response measurements Y repeated two or more times with the same values X, then these repeated observations can be used to obtain another estimate σ 2 (the first is the residual variance). Such an estimate is said to represent a “pure” error, since if x identical for two or more observations, then only random changes can affect the results and create scatter between them.

The resulting estimate turns out to be a more reliable estimate of the variance than estimates obtained by other methods. For this reason, when planning experiments, it makes sense to perform experiments with repetitions.

Let's assume that there is m different meanings X : x 1 , x 2 , ..., x m. Let for each of these values x i available n i response observations Y. The total observations are:

Then the simple linear regression model can be written as:


Let's find the variance of “pure” errors. This variance is the pooled variance estimate σ 2 if we imagine the response values y ij at x = x i as sample volume n i. As a result, the variance of “pure” errors is equal to:

This variance serves as an estimate σ 2 regardless of whether the fitted model is correct.

Let us show that the sum of squares “ pure mistakes” is part of the residual sum of squares (the sum of squares included in the expression for the residual variance). Remaining for j th observation at x i can be written as:

If we square both sides of this equation and then sum them over j and by i, then we get:

On the left in this equality is the residual sum of squares. The first term on the right side is the sum of squares of “pure” errors, the second term can be called the sum of squares of inadequacy. The last amount has m−2 degrees of freedom, hence the variance of inadequacy

The test statistic for testing the hypothesis H 0: the simple linear model is adequate, against the hypothesis H 1: the simple linear model is inadequate, is a random variable

If the null hypothesis is true, the value F has a Fisher distribution with degrees of freedom m−2 and nm. The linearity hypothesis of the regression line should be rejected at the significance level α if the resulting statistic value is greater than the α percentage point of the Fisher distribution with degrees of freedom m−2 and nm.

46. Checking the adequacy of the regression model (see 45). Analysis of variance

47. Checking the adequacy of the regression model (see 45). Determination coefficient

Sometimes a sample determination coefficient is used to characterize the quality of a regression line R 2, showing what part (share) the sum of squares due to regression, SC p, makes up in the total sum of squares SC p:

The closer R 2 to unity, the better the regression approximates the experimental data, the closer the observations are to the regression line. If R 2 = 0, then changes in the response are entirely due to the influence of unaccounted factors, and the regression line is parallel to the axis x-s. In the case of simple linear regression, the coefficient of determination R 2 is equal to the square of the correlation coefficient r 2 .

The maximum value of R 2 =1 can be achieved only in the case when observations were carried out at different values ​​of x-s. If the data contains repeated experiments, then the value of R 2 cannot reach unity, no matter how good the model is.

48. Confidence intervals for simple linear regression parameters

Just as the sample mean is an estimate of the true mean (the population mean), so are the sample parameters of a regression equation a And b- nothing more than estimates of true regression coefficients. Different samples will produce different estimates of the mean, just as different samples will produce different estimates of regression coefficients.

Assuming that the error distribution law εi are described by a normal law, parameter estimation b will have a normal distribution with the parameters:


Since the parameter estimate a is a linear combination of independent normal distributed quantities, it will also have a normal distribution with mathematical expectation and variance:


In this case, (1 − α) confidence interval for estimating the dispersion σ 2 taking into account that the ratio ( n−2)s 0 2 /σ 2 distributed according to law χ 2 with degrees of freedom n−2 will be determined by the expression


49. Confidence intervals for the regression line. Confidence interval for dependent variable values

We usually do not know the true values ​​of regression coefficients A And b. We only know their estimates. In other words, the true regression line may be higher or lower, steeper or flatter, than the one built from sample data. We calculated confidence intervals for the regression coefficients. You can also calculate the confidence region for the regression line itself.

Let for simple linear regression we need to construct (1− α ) confidence interval for the mathematical expectation of the response Y at value X = X 0 . This mathematical expectation is equal to a+bx 0 and its score

Because, then.

The resulting estimate of the mathematical expectation is a linear combination of uncorrelated normally distributed values ​​and therefore also has a normal distribution centered at the point of the true value of the conditional mathematical expectation and variance

Therefore, the confidence interval for the regression line at each value x 0 can be represented as


As you can see, the minimum confidence interval is obtained when x 0 equal to the average value and increases as x 0 “moves away” from the average in any direction.

To obtain a set of joint confidence intervals suitable for the entire regression function, throughout its entire length, in the above expression instead tn −2,α /2 must be substituted

The needs of economic and social practice require the development of methods for quantitative description of processes that make it possible to accurately record not only quantitative, but also qualitative factors. Provided that the values ​​of qualitative characteristics can be ordered or ranked according to the degree of decreasing (increasing) of the characteristic, it is possible to assess the closeness of the relationship between qualitative characteristics. By qualitative we mean a characteristic that cannot be measured accurately, but it allows you to compare objects with each other and, therefore, arrange them in order of decreasing or increasing quality. And the real content of measurements in ranking scales is the order in which objects are arranged according to the degree of expression of the characteristic being measured.

For practical purposes, the use of rank correlation is very useful. For example, if a high rank correlation is established between two qualitative characteristics of products, then it is enough to control products only by one of the characteristics, which reduces the cost and speeds up control.

As an example, we can consider the existence of a connection between the availability of commercial products of a number of enterprises and overhead costs for sales. In the course of 10 observations, the following table was obtained:

Let's order the values ​​of X in ascending order, and each value will be assigned its serial number (rank):

Thus,

Let's build the following table, where the pairs X and Y are recorded, obtained as a result of observation with their ranks:

Denoting the rank difference as, we write the formula for calculating the sample Spearman correlation coefficient:

where n is the number of observations, which is also the number of pairs of ranks.

The Spearman coefficient has the following properties:

If there is a complete direct relationship between the qualitative characteristics X and Y in the sense that the ranks of objects coincide for all values ​​of i, then the sample Spearman correlation coefficient is equal to 1. Indeed, substituting it into the formula, we get 1.

If there is a complete inverse relationship between qualitative characteristics X and Y in the sense that rank corresponds to rank, then the sample Spearman correlation coefficient is equal to -1.

Indeed, if

Substituting the value into the Spearman correlation coefficient formula, we get -1.

If there is neither a complete straight line nor a complete feedback, then the sample Spearman correlation coefficient lies between -1 and 1, and the closer its value is to 0, the smaller the relationship between the characteristics.

Using the data from the above example, we will find the value of P; to do this, we will complete the table with the values ​​and:

Sample Kendall correlation coefficient. You can evaluate the relationship between two qualitative characteristics using the Kendall rank correlation coefficient.

Let the ranks of objects in a sample of size n be equal to:

by characteristic X:

by characteristic Y: . Let us assume that to the right there are ranks, large, to the right there are ranks, large, to the right there are ranks, large. Let us introduce the notation for the sum of ranks

Similarly, we introduce the notation as the sum of the number of ranks lying to the right, but smaller.

The sample Kendall correlation coefficient is written as:

Where n is the sample size.

The Kendall coefficient has the same properties as the Spearman coefficient:

If there is a complete direct relationship between the qualitative features X and Y in the sense that the ranks of objects coincide for all values ​​of i, then the sample Kendall correlation coefficient is equal to 1. Indeed, to the right there are n-1 ranks, large, therefore, in the same way we establish, What. Then. And the Kendall coefficient is equal to: .

If there is a complete inverse relationship between qualitative characteristics X and Y in the sense that rank corresponds to rank, then the sample Kendall correlation coefficient is equal to -1. There are no higher ranks to the right, that’s why. Likewise. Substituting the value R+=0 into the Kendall coefficient formula, we get -1.

With a sufficiently large sample size and with values ​​of rank correlation coefficients not close to 1, there is an approximate equality:

Does the Kendall coefficient provide a more conservative estimate of correlation than the Spearman coefficient? (numeric value? always less than). Although calculating the coefficient? less labor-intensive than calculating the coefficient; the latter is easier to recalculate if a new term is added to the series.

An important advantage of the coefficient is that it can be used to determine the partial rank correlation coefficient, which allows one to assess the degree of “pure” relationship between two ranking characteristics, eliminating the influence of the third:

Significance of rank correlation coefficients. When determining the strength of rank correlation from sample data, the following question must be considered: how confidently can one rely on the conclusion that a correlation exists in the population if a certain sample rank correlation coefficient is obtained. In other words, the significance of the observed rank correlations should be tested based on the hypothesis of statistical independence of the two rankings under consideration.

With a relatively large sample size n, checking the significance of the rank correlation coefficients can be carried out using the normal distribution table (Appendix Table 1). To test the significance of the Spearman coefficient? (for n>20) calculate the value

and to test the significance of the Kendall coefficient? (for n>10) calculate the value

where S=R+- R-, n - sample size.

Next, they set the significance level?, determine the critical value tcr(?,k) from the table of critical points of the Student distribution and compare the calculated value or with it. The number of degrees of freedom is assumed to be k = n-2. If or > tcr, then the values ​​or are considered significant.

Fechner correlation coefficient.

Finally, we should mention the Fechner coefficient, which characterizes the elementary degree of closeness of the connection, which is advisable to use to establish the existence of a connection when there is a small amount of initial information. The basis of its calculation is taking into account the direction of deviations from the arithmetic mean of each variation series and determining the consistency of the signs of these deviations for the two series, the relationship between which is measured.

This coefficient is determined by the formula:

where na is the number of coincidences of signs of deviations of individual values ​​from their arithmetic mean; nb - respectively, the number of mismatches.

The Fechner coefficient can vary within -1.0<= Кф<= +1,0.

Applied aspects of rank correlation. As already noted, rank correlation coefficients can be used not only for qualitative analysis of the relationship between two rank characteristics, but also in determining the strength of the relationship between rank and quantitative characteristics. In this case, the values ​​of the quantitative characteristic are ordered and corresponding ranks are assigned to them.

There are a number of situations when calculating rank correlation coefficients is also advisable when determining the strength of the connection between two quantitative characteristics. Thus, if the distribution of one of them (or both) significantly deviates from the normal distribution, determining the significance level of the sample correlation coefficient r becomes incorrect, while the rank coefficients? And? are not subject to such restrictions when determining the level of significance.

Another situation of this kind arises when the relationship between two quantitative characteristics is nonlinear (but monotonic) in nature. If the number of objects in the sample is small or if the sign of the connection is important for the researcher, then use a correlation relationship? may be inadequate here. Calculating the rank correlation coefficient allows one to circumvent these difficulties.

Practical part

Task 1. Correlation and regression analysis

Statement and formalization of the problem:

An empirical sample is given, compiled on the basis of a number of observations of the condition of the equipment (for failure) and the number of manufactured products. The sample implicitly characterizes the relationship between the volume of failed equipment and the number of manufactured products. Based on the meaning of the sample, it is clear that manufactured products are produced on the equipment that remains in service, since the higher the percentage of failed equipment, the fewer manufactured products. It is required to conduct a study of the sample for correlation-regression dependence, that is, to establish the form of the dependence, evaluate the regression function (regression analysis), and also identify the relationship between random variables and evaluate its tightness (correlation analysis). An additional task of correlation analysis is to estimate the regression equation of one variable on another. In addition, it is necessary to predict the number of products produced at a 30% equipment failure.

Let us formalize the given sample in the table, designating the data “Equipment failure, %” as X, the data “Number of products” as Y:

Initial data. Table 1

From the physical meaning of the problem, it is clear that the number of manufactured products Y directly depends on the % of equipment failure, that is, there is a dependence of Y on X. When performing regression analysis, it is necessary to find a mathematical relationship (regression) connecting the values ​​of X and Y. In this case, regression analysis, in in contrast to the correlation, it assumes that the value X acts as an independent variable, or factor, the value Y - as a dependent variable, or an effective attribute. Thus, it is necessary to synthesize an adequate economic and mathematical model, i.e. determine (find, select) the function Y = f(X), characterizing the relationship between the values ​​of X and Y, using which it will be possible to predict the value of Y at X = 30. The solution to this problem can be performed using correlation-regression analysis.

A brief overview of methods for solving correlation-regression problems and justification for the chosen solution method.

Methods of regression analysis based on the number of factors influencing the resulting characteristic are divided into single- and multifactorial. Single-factor - number of independent factors = 1, i.e. Y = F(X)

multifactorial - number of factors > 1, i.e.

Based on the number of dependent variables (resultative features) being studied, regression problems can also be divided into problems with one and many resultant features. In general, a problem with many effective characteristics can be written:

The method of correlation-regression analysis consists in finding the parameters of the approximating (approximating) dependence of the form

Since the above problem involves only one independent variable, i.e., the dependence on only one factor influencing the result is studied, a study on one-factor dependence, or paired regression, should be used.

If there is only one factor, the dependence is defined as:

The form of writing a specific regression equation depends on the choice of function that displays the statistical relationship between the factor and the resulting characteristic and includes the following:

linear regression, equation of the form,

parabolic, equation of the form

cubic, equation of the form

hyperbolic, equation of the form

semilogarithmic, equation of the form

exponential, equation of the form

power equation of the form.

Finding the function comes down to determining the parameters of the regression equation and assessing the reliability of the equation itself. To determine the parameters, you can use both the least squares method and the least modulus method.

The first of them is to ensure that the sum of squared deviations of the empirical values ​​of Yi from the calculated average Yi is minimal.

The method of least moduli consists in minimizing the sum of the moduli of the difference between the empirical values ​​of Yi and the calculated average Yi.

To solve the problem, we will choose the least squares method, as it is the simplest and gives good estimates in terms of statistical properties.

Technology for solving the problem of regression analysis using the least squares method.

You can determine the type of relationship (linear, quadratic, cubic, etc.) between variables by estimating the deviation of the actual value y from the calculated one:

where are empirical values, are calculated values ​​using the approximating function. By estimating the values ​​of Si for various functions and choosing the smallest of them, we select an approximating function.

The type of a particular function is determined by finding the coefficients that are found for each function as a solution to a certain system of equations:

linear regression, equation of the form, system -

parabolic, equation of the form, system -

cubic, equation of the form, system -

Having solved the system, we find, with the help of which we arrive at a specific expression of the analytical function, having which, we find the calculated values. Next, there is all the data for finding an estimate of the magnitude of the deviation S and analyzing the minimum.

For a linear relationship, we estimate the closeness of the connection between factor X and the resulting characteristic Y in the form of the correlation coefficient r:

Average value of the indicator;

Average factor value;

y is the experimental value of the indicator;

x is the experimental value of the factor;

Standard deviation in x;

Standard deviation in y.

If the correlation coefficient is r = 0, then it is considered that the connection between the characteristics is insignificant or absent; if r = 1, then there is a very high functional connection between the characteristics.

Using the Chaddock table, you can make a qualitative assessment of the closeness of the correlation between the characteristics:

Chaddock table Table 2.

For a nonlinear dependence, the correlation ratio (0 1) and the correlation index R are determined, which are calculated from the following dependencies.

where value is the value of the indicator calculated from the regression dependence.

To assess the accuracy of calculations, we use the value of the average relative error of approximation

With high accuracy it is in the range of 0-12%.

To evaluate the selection of the functional dependence, we use the coefficient of determination

The coefficient of determination is used as a “generalized” measure of the quality of fit of a functional model, since it expresses the relationship between factor and total variance, or more precisely, the share of factor variance in the total.

To assess the significance of the correlation index R, Fisher's F test is used. The actual value of the criterion is determined by the formula:

where m is the number of parameters of the regression equation, n is the number of observations. The value is compared with the critical value, which is determined from the F-criterion table, taking into account the accepted level of significance and the number of degrees of freedom and. If, then the value of the correlation index R is considered significant.

For the selected form of regression, the coefficients of the regression equation are calculated. For convenience, the calculation results are included in a table with the following structure (in general, the number of columns and their type vary depending on the type of regression):

Table 3

The solution of the problem.

Observations were made of an economic phenomenon - the dependence of product output on the percentage of equipment failure. A set of values ​​is obtained.

The selected values ​​are described in Table 1.

We build a graph of the empirical dependence based on the given sample (Fig. 1)

Based on the appearance of the graph, we determine that the analytical dependence can be represented as a linear function:

Let's calculate the pair correlation coefficient to assess the relationship between X and Y:

Let's build an auxiliary table:

Table 4

We solve the system of equations to find the coefficients and:

from the first equation, substituting the value

into the second equation, we get:

We find

We get the form of the regression equation:

9. To assess the tightness of the found connection, we use the correlation coefficient r:

Using the Chaddock table, we establish that for r = 0.90 the relationship between X and Y is very high, therefore the reliability of the regression equation is also high. To assess the accuracy of calculations, we use the value of the average relative error of approximation:

We believe that the value provides a high degree of reliability of the regression equation.

For a linear relationship between X and Y, the index of determination is equal to the square of the correlation coefficient r: . Consequently, 81% of the total variation is explained by changes in factor trait X.

To assess the significance of the correlation index R, which in the case of a linear relationship is equal in absolute value to the correlation coefficient r, the Fisher F test is used. We determine the actual value using the formula:

where m is the number of parameters of the regression equation, n is the number of observations. That is, n = 5, m = 2.

Taking into account the accepted significance level =0.05 and the number of degrees of freedom, we obtain the critical table value. Since the value of the correlation index R is considered significant.

Let's calculate the predicted value of Y at X = 30:

Let's plot the found function:

11. Determine the error of the correlation coefficient by the value of the standard deviation

and then determine the value of the normalized deviation

From a ratio > 2 with a probability of 95% we can speak about the significance of the resulting correlation coefficient.

Problem 2. Linear optimization

Option 1.

The regional development plan plans to put into operation 3 oil fields with a total production volume of 9 million tons. At the first field, the production volume is at least 1 million tons, at the second - 3 million tons, at the third - 5 million tons. To achieve such productivity it is necessary to drill at least 125 wells. To implement this plan, 25 million rubles have been allocated. capital investments (indicator K) and 80 km of pipes (indicator L).

It is necessary to determine the optimal (maximum) number of wells to ensure the planned productivity of each field. The initial data for the task are given in the table.

Initial data

The problem statement is given above.

Let us formalize the conditions and restrictions specified in the problem. The goal of solving this optimization problem is to find maximum value oil production with an optimal number of wells for each field, taking into account existing constraints on the problem.

The objective function, in accordance with the requirements of the problem, will take the form:

where is the number of wells for each field.

Existing task restrictions on:

pipe laying length:

number of wells in each field:

cost of building 1 well:

Linear optimization problems are solved, for example, by the following methods:

Graphically

Simplex method

Usage graphic method convenient only when solving linear optimization problems with two variables. With a larger number of variables, it is necessary to use algebraic apparatus. Let's consider a general method for solving linear optimization problems called the simplex method.

The Simplex method is a typical example of iterative calculations used in solving most optimization problems. We consider iterative procedures of this kind that provide solutions to problems using operations research models.

To solve an optimization problem using the simplex method, it is necessary that the number of unknowns Xi be greater than the number of equations, i.e. system of equations

satisfied the relation m

A=was equal to m.

Let us denote the column of matrix A as, and the column of free terms as

The basic solution of system (1) is a set of m unknowns that are a solution to system (1).

Briefly, the algorithm of the simplex method is described as follows:

The original constraint, written as an inequality of type<= (=>) can be expressed as an equality by adding the residual variable to the left side of the constraint (subtracting the excess variable from the left side).

For example, to the left side of the original constraint

a residual variable is introduced, as a result of which the original inequality turns into equality

If the initial constraint determines the flow rate of the pipes, then the variable should be interpreted as the remainder, or unused portion of that resource.

Maximizing an objective function is equivalent to minimizing the same function taken with the opposite sign. That is, in our case

equivalent

A simplex table is compiled for a basic solution of the following form:

This table indicates that after solving the problem, these cells will contain the basic solution. - quotients from dividing a column by one of the columns; - additional multipliers for resetting values ​​in table cells related to the resolution column. - min value of the objective function -Z, - values ​​of the coefficients in the objective function for unknowns.

Any positive value is found among the values. If this is not the case, then the problem is considered solved. Select any column of the table that contains, this column is called the “permissive” column. If there are no positive numbers among the elements of the resolution column, then the problem is unsolvable due to the unboundedness of the objective function on the set of its solutions. If there are positive numbers in the resolution column, go to step 5.

The column is filled with fractions, the numerator of which is the elements of the column, and the denominator is the corresponding elements of the resolving column. The smallest of all values ​​is selected. The line that produces the smallest is called the “resolving” line. At the intersection of the resolving row and the resolving column, a resolving element is found, which is highlighted in some way, for example, by color.

Based on the first simplex table, the next one is compiled, in which:

Replaces a row vector with a column vector

the enabling string is replaced by the same string divided by the enabling element

each of the remaining rows of the table is replaced by the sum of this row with the resolving one, multiplied by a specially selected additional factor in order to obtain 0 in the cell of the resolving column.

We refer to point 4 with the new table.

The solution of the problem.

Based on the formulation of the problem, we have the following system of inequalities:

and objective function

Let's transform the system of inequalities into a system of equations by introducing additional variables:

Let us reduce the objective function to its equivalent:

Let's build the initial simplex table:

Let's select the resolution column. Let's calculate the column:

We enter the values ​​into the table. Using the smallest of them = 10, we determine the resolution string: . At the intersection of the resolving row and the resolving column, we find the resolving element = 1. We fill part of the table with additional factors, such that: the resolving row multiplied by them, added to the remaining rows of the table, forms 0s in the elements of the resolving column.

Let's create the second simplex table:

In it, we take the resolution column, calculate the values, and enter them into the table. At the minimum we get the resolution line. The resolving element will be 1. We find additional factors and fill in the columns.

We create the following simplex table:

In a similar way, we find the resolving column, resolving row and resolving element = 2. We build the following simplex table:

Since there are no positive values ​​in the -Z line, this table is finite. The first column gives the desired values ​​of the unknowns, i.e. optimal basic solution:

In this case, the value of the objective function is -Z = -8000, which is equivalent to Zmax = 8000. The problem is solved.

Task 3. Cluster analysis

Formulation of the problem:

Split objects based on the data given in the table. Select a solution method yourself and build a data dependency graph.

Option 1.

Initial data

Review of methods for solving this type of problem. Justification of the solution method.

Cluster analysis problems are solved using the following methods:

The union or tree clustering method is used in the formation of "dissimilarity" or "distance between objects" clusters. These distances can be defined in one-dimensional or multi-dimensional space.

Two-way join is used (relatively rarely) in circumstances where the data is interpreted not in terms of "objects" and "object properties" but in terms of observations and variables. Both observations and variables are expected to simultaneously contribute to the discovery of meaningful clusters.

K-means method. Used when there is already a hypothesis regarding the number of clusters. You can tell the system to form exactly, for example, three clusters so that they are as different as possible. In general, the K-means method constructs exactly K different clusters located at the greatest possible distances from each other.

There are the following methods for measuring distances:

Euclidean distance. This is the most common type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original data, not the standardized data.

City block distance (Manhattan distance). This distance is simply the average of the differences over the coordinates. In most cases, this distance measure produces the same results as the ordinary Euclidean distance. However, we note that for this measure the influence of individual large differences (outliers) is reduced (since they are not squared). The Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wishes to define two objects as "different" if they differ in any one coordinate (in any one dimension). The Chebyshev distance is calculated using the formula:

Power distance. Sometimes one wishes to progressively increase or decrease a weight related to a dimension for which the corresponding objects are very different. This can be achieved using power-law distance. Power distance is calculated using the formula:

where r and p are user-defined parameters. A few example calculations can show how this measure “works”. The p parameter is responsible for the gradual weighting of differences along individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters r and p are equal to two, then this distance coincides with the Euclidean distance.

Percentage of disagreement. This measure is used when the data is categorical. This distance is calculated by the formula:

To solve the problem, we will choose the method of unification (tree clustering) as the one that best meets the conditions and formulation of the problem (splitting objects). In turn, the joining method can use several variants of communication rules:

Single link (nearest neighbor method). In this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters. That is, any two objects in two clusters are closer to each other than the corresponding communication distance. This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "chains".

Full link (most distant neighbors method). In this method, distances between clusters are determined by the largest distance between any two objects in different clusters (i.e., "most distant neighbors").

There are also many other methods for joining clusters like these (for example, unweighted pairwise joining, weighted pairwise joining, etc.).

Solution method technology. Calculation of indicators.

At the first step, when each object is a separate cluster, the distances between these objects are determined by the selected measure.

Since the problem does not specify the units of measurement of the features, it is assumed that they coincide. Consequently, there is no need to normalize the source data, so we immediately proceed to calculating the distance matrix.

The solution of the problem.

Let's build a dependence graph based on the initial data (Figure 2)

We will take the usual Euclidean distance as the distance between objects. Then according to the formula:

where l are signs; k is the number of features, the distance between objects 1 and 2 is equal to:

We continue to calculate the remaining distances:

Let's build a table from the obtained values:

Shortest distance. This means we combine elements 3,6 and 5 into one cluster. We get the following table:

Shortest distance. Elements 3,6,5 and 4 are combined into one cluster. We get a table of two clusters:

The minimum distance between elements 3 and 6 is equal. This means that elements 3 and 6 are combined into one cluster. We select the maximum distance between the newly formed cluster and the remaining elements. For example, the distance between cluster 1 and cluster 3.6 is max(13.34166, 13.60147)= 13.34166. Let's create the following table:

In it, the minimum distance is the distance between clusters 1 and 2. Combining 1 and 2 into one cluster, we get:

Thus, using the “distant neighbor” method, we obtained two clusters: 1,2 and 3,4,5,6, the distance between which is 13.60147.

The problem is solved.

Applications. Solving problems using application packages (MS Excel 7.0)

The task of correlation and regression analysis.

We enter the initial data into the table (Fig. 1)

Select the menu “Service / Data Analysis”. In the window that appears, select the line “Regression” (Fig. 2).

Let's set the input intervals in X and Y in the next window, leave the reliability level at 95%, and place the output data on a separate sheet “Report Sheet” (Fig. 3)

After the calculation, we receive the final regression analysis data on the “Report Sheet” sheet:

A scatter plot of the approximating function, or “Fit Graph”, is also displayed here:


The calculated values ​​and deviations are displayed in the table in the “Predicted Y” and “Residuals” columns, respectively.

Based on the initial data and deviations, a residual graph is constructed:

Optimization problem


We enter the initial data as follows:

We enter the required unknowns X1, X2, X3 in cells C9, D9, E9, respectively.

The coefficients of the objective function for X1, X2, X3 are entered into C7, D7, E7, respectively.

We enter the objective function in cell B11 as the formula: =C7*C9+D7*D9+E7*E9.

Existing task limitations

For pipe laying length:

enter in cells C5, D5, E5, F5, G5

Number of wells at each field:

X3 Ј 100; enter in cells C8, D8, E8.

Cost of construction of 1 well:

enter in cells C6, D6, E6, F6, G6.

The formula for calculating the total length C5*C9+D5*D9+E5*E9 is placed in cell B5, the formula for calculating the total cost C6*C9+D6*D9+E6*E9 is placed in cell B6.


Select “Service/Search for a solution” in the menu, enter parameters for searching for a solution in accordance with the entered initial data (Fig. 4):

Using the “Parameters” button, set the following parameters for searching for a solution (Fig. 5):


After searching for a solution, we receive a report on the results:

Microsoft Excel 8.0e Results Report

Report created: 11/17/2002 1:28:30 AM

Target Cell (Maximum)

Result

Total production

Changeable cells

Result

Number of wells

Number of wells

Number of wells

Restrictions

Meaning

Length

Related

Project cost

not connected.

Number of wells

not connected.

Number of wells

Related

Number of wells

Related

The first table shows the initial and final (optimal) value of the target cell in which the objective function of the problem being solved was placed. In the second table we see the initial and final values ​​of the optimized variables, which are contained in the changeable cells. The third table in the results report contains information about the limitations. The “Value” column contains the optimal values ​​of the required resources and optimized variables. The "Formula" column contains restrictions on consumed resources and optimized variables, written in the form of links to cells containing this data. The “Status” column determines whether certain restrictions are bound or unbound. Here, “bound” are constraints implemented in the optimal solution in the form of strict equalities. The “Difference” column for resource restrictions determines the balance of used resources, i.e. the difference between the required amount of resources and their availability.

Similarly, by recording the result of the search for a solution in the “Stability Report” form, we obtain the following tables:

Microsoft Excel 8.0e Sustainability Report

Worksheet: [Solving the optimization problem.xls]Solving the production optimization problem

Report created: 11/17/2002 1:35:16 AM

Changeable cells

Acceptable

Acceptable

meaning

price

Coefficient

Increase

Decrease

Number of wells

Number of wells

Number of wells

Restrictions

Limitation

Acceptable

Acceptable

meaning

Right part

Increase

Decrease

Length

Project cost

The sustainability report contains information about the variables being changed (optimized) and the model constraints. The specified information is related to the simplex method used in the optimization of linear problems, described above in the part of solving the problem. It allows you to evaluate how sensitive the resulting optimal solution is to possible changes in the model parameters.

The first part of the report contains information about changeable cells containing values ​​for the number of wells in the fields. The “Resulting value” column indicates the optimal values ​​of the optimized variables. The “Target Coefficient” column contains the initial data for the coefficient values ​​of the target function. The next two columns illustrate how these factors can be increased and decreased without changing the optimal solution found.

The second part of the sustainability report contains information on the restrictions imposed on the optimized variables. The first column indicates the resource requirements for the optimal solution. The second contains shadow prices for the types of resources used. The last two columns contain data on a possible increase or decrease in the volume of available resources.

Clustering problem.

A step-by-step method for solving the problem is given above. Here are Excel tables illustrating the progress of solving the problem:

"nearest neighbor method"

Solving the problem of cluster analysis - "NEAREST NEIGHBOR METHOD"

Initial data

where x1 is the volume of output;

x2 - average annual cost of fixed assets

Industrial production assets

"far neighbor method"

Solving the problem of cluster analysis - "FAR NEIGHBOR METHOD"

Initial data

where x1 is the volume of output;

x2 - average annual cost of fixed assets

Industrial production assets

Brief theory

Kendall's correlation coefficient is used when variables are represented on two ordinal scales, provided that there are no associated ranks. The calculation of the Kendall coefficient involves counting the number of matches and inversions.

This coefficient varies within limits and is calculated using the formula:

For calculation, all units are ranked according to ; according to a row of another characteristic, for each rank the number of subsequent ranks exceeding the given one (we denote them by ), and the number of subsequent ranks below the given one (we denote them by ).

It can be shown that

and Kendall's rank correlation coefficient can be written as

In order to test the null hypothesis at the significance level that the general Kendall rank correlation coefficient is equal to zero under a competing hypothesis, it is necessary to calculate the critical point:

where is the sample size;

– critical point of the two-sided critical region, which is found from the table of the Laplace function by equality

If – there is no reason to reject the null hypothesis. The rank correlation between the characteristics is insignificant.

If – the null hypothesis is rejected. There is a significant rank correlation between the characteristics.

Example of problem solution

The task

During the recruitment process, seven candidates for vacant positions were given two tests. The test results (in points) are shown in the table: Test 1 2 3 4 5 6 7 1 31 82 25 26 53 30 29 2 21 55 8 27 32 42 26

Candidate

Calculate the Kendall rank correlation coefficient between the test results for two tests and evaluate its significance at the level.

The solution of the problem

Let's calculate the Kendall coefficient

1 1 6 0 2 4 3 2 3 3 3 1 4 6 1 2 5 2 2 0 6 5 1 0 7 7 0 0 The ranks of the factor characteristic are arranged strictly in ascending order and the corresponding ranks of the resultant characteristic are recorded in parallel. For each rank, from the number of ranks following it, the number of ranks larger than it in value is counted (entered in the column) and the number of ranks smaller in value (entered in the column). 16 5

Sum

KENDALL'S RANK CORRELATION COEFFICIENT One of the sample measures of dependence of two random variables (features) Xi Y, based on the ranking of sample elements (X 1,), .. ., (Y x). X n, Y n K. k. r. thus refers to ranking statisticians

Where and is determined by the formula r i - U, belonging to that couple (), X, Y for cut Xequal i, S = 2N -(n-1)/2, N is the number of sample elements, for which both j>i and r j >r i .

Always

As a selective measure of the dependence of K. k.r. K. was widely used by M. Kendall (M. Kendall, see). . K. k. r. k. is used to test the hypothesis of independence of random variables. If the independence hypothesis is true, then E t =0 and D t =2(2n+5)/9n(n-1). With a small sample size, checking the statistical independence hypotheses are made using special tables (see). For n>10, use the normal approximation for the distribution m: if - U, belonging to that couple ( have a joint normal with the correlation coefficient p, then the relationship between K. k.r. k. and has the form:

see also Spearman rank correlation, rank test.

Lit.: Kendal M., Rank correlations, trans. from English, M., 1975; Van der Waerden B. L., Mathematical, trans. from German, M., 1960; Bolshev L. N., Smirnov N. V., Tables of mathematical statistics, M., 1965.

A. V. Prokhorov.


Mathematical encyclopedia. - M.: Soviet Encyclopedia.

I. M. Vinogradov.

    1977-1985. See what "KENDALL'S RANK CORRELATION COEFFICIENT" is in other dictionaries:

    English with efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. A correlation coefficient that determines the degree of agreement between the ordering of all pairs of objects according to two variables. Antinazi. Encyclopedia of Sociology, 2009 ... Encyclopedia of Sociology KENDALL'S RANK CORRELATION COEFFICIENT

    - English coefficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. The correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects according to two variables... Explanatory dictionary of sociology

    A measure of the dependence of two random variables (features) X and Y, based on the ranking of independent observation results (X1, Y1), . . ., (Xn,Yn). If the ranks of the X values ​​are in natural order i=1, . . ., n,a Ri rank Y, corresponding to... ... Mathematical Encyclopedia Correlation coefficient

    - (Correlation coefficient) The correlation coefficient is a statistical indicator of the dependence of two random variables. Definition of the correlation coefficient, types of correlation coefficients, properties of the correlation coefficient, calculation and application... ... Explanatory dictionary of sociology

    Investor Encyclopedia

    A dependence between random variables that, generally speaking, does not have a strictly functional character. In contrast to the functional dependence, K., as a rule, is considered when one of the quantities depends not only on the other, but also... ... Correlation (correlation dependence) is a statistical relationship between two or more random variables (or variables that can be considered as such with some acceptable degree of accuracy). In this case, changes in the values ​​of one or ... ... Wikipedia Correlation coefficient

    It is generally accepted that the beginning of S. m.v. or, as it is often called, statistics of “small n”, was founded in the first decade of the 20th century with the publication of the work of W. Gosset, in which he placed the t distribution, postulated by the one that received a little later worldwide... ... Psychological Encyclopedia

    Maurice Kendall Sir Maurice George Kendall Date of birth: September 6, 1907 (1907 09 06) Place of birth: Kettering, UK Date of death ... Wikipedia

    Forecast- (Forecast) Definition of forecast, tasks and principles of forecasting Definition of forecast, tasks and principles of forecasting, forecasting methods Contents Contents Definition Basic concepts of forecasting Tasks and principles of forecasting... ... Correlation coefficient



2024 wisemotors.ru. How it works. Iron. Mining. Cryptocurrency.