Spearman and Kendall rank correlation coefficients. Kendall rank correlation coefficient. See what “Kendall rank correlation coefficient” is in other dictionaries

Rank correlation coefficient characterizes the general nature of the nonlinear relationship: an increase or decrease in the resultant attribute with an increase in the factorial one. This is an indicator of the tightness of a monotonic nonlinear connection.

Purpose of the service. Using this online calculator you can calculate coefficient rank correlation Kendal according to all basic formulas, as well as an assessment of its significance.

Instructions. Specify the amount of data (number of rows). The resulting solution is saved in a Word file.

The coefficient proposed by Kendal is based on relationships of the “more-less” type, the validity of which was established when constructing the scales.
Let's select a couple of objects and compare their ranks according to one characteristic and another. If the ranks for a given characteristic form a direct order (i.e., the order of the natural series), then the pair is assigned +1, if the reverse, then –1. For the selected pair, the corresponding plus and minus units (by attribute X and by attribute Y) are multiplied. The result is obviously +1; if the ranks of a pair of both features are located in the same sequence, and –1 if in the opposite order.
If the rank orders for both characteristics are the same for all pairs, then the sum of units assigned to all pairs of objects is maximum and equal to the number of pairs. If the rank orders of all pairs are reversed, then –C 2 N . In the general case, C 2 N = P + Q, where P is the number of positive and Q the number of negative units assigned to pairs when comparing their ranks on both criteria.
The value is called the Kendall coefficient.
It is clear from the formula that the coefficient τ represents the difference between the proportion of pairs of objects whose order is the same on both grounds (relative to the number of all pairs) and the proportion of pairs of objects whose order does not coincide.
For example, a coefficient value of 0.60 means that 80% of pairs have the same order of objects, and 20% do not (80% + 20% = 100%; 0.80 – 0.20 = 0.60). Those. τ can be interpreted as the difference in the probabilities of matching and not matching orders for both characteristics for a randomly selected pair of objects.
In the general case, the calculation of τ (more precisely P or Q) even for N of the order of 10 turns out to be cumbersome.
We'll show you how to simplify the calculations.


Example. The relationship between the volume of industrial output and investment in fixed capital in 10 regions of one of the federal districts of the Russian Federation in 2003 is characterized by the following data:


Calculate ranking coefficients Spearman and Kendal correlations. Check their significance at α=0.05. Formulate a conclusion about the relationship between the volume of industrial output and investment in fixed capital for the regions of the Russian Federation under consideration.

Solution. Let us assign ranks to feature Y and factor X.


Let's sort the data by X.
In the row Y to the right of 3 there are 7 ranks greater than 3, therefore, 3 will generate the term 7 in P.
To the right of 1 are 8 ranks greater than 1 (these are 2, 4, 6, 9, 5, 10, 7, 8), i.e. P will include 8, etc. As a result, P = 37 and using the formulas we have:

XYrank X, d xrank Y, d yPQ
18.4 5.57 1 3 7 2
20.6 2.88 2 1 8 0
21.5 4.12 3 2 7 0
35.7 7.24 4 4 6 0
37.1 9.67 5 6 4 1
39.8 10.48 6 9 1 3
51.1 8.58 7 5 3 0
54.4 14.79 8 10 0 2
64.6 10.22 9 7 1 0
90.6 10.45 10 8 0 0
37 8


Using simplified formulas:




where n is the sample size; z kp is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality Ф(z kp)=(1-α)/2.
If |τ|< T kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками незначима. Если |τ| >T kp - the null hypothesis is rejected. There is a significant rank correlation between qualitative characteristics.
Let's find the critical point z kp
Ф(z kp) = (1-α)/2 = (1 - 0.05)/2 = 0.475

Let's find the critical point:

Since τ > T kp - we reject the null hypothesis; the rank correlation between the scores on the two tests is significant.

Example. According to data on the volume of construction and installation work performed on our own, and the number of employees in 10 construction companies in one of the cities of the Russian Federation, determine the relationship between these characteristics using the Kendel coefficient.

Solution find using a calculator.
Let us assign ranks to feature Y and factor X.
Let's arrange the objects so that their ranks in X represent the natural series. Since the estimates assigned to each pair of this series are positive, the “+1” values ​​included in P will be generated only by those pairs whose ranks in Y form a direct order.
They can be easily calculated by sequentially comparing the ranks of each object in the Y row with the steel ones.
Kendal coefficient.

In the general case, the calculation of τ (more precisely P or Q) even for N of the order of 10 turns out to be cumbersome. We'll show you how to simplify the calculations.

or

Solution.
Let's sort the data by X.
In the row Y to the right of 2 there are 8 ranks greater than 2, therefore, 2 will generate the term 8 in P.
To the right of 4 are 6 ranks greater than 4 (these are 7, 5, 6, 8, 9, 10), i.e. P will include 6, etc. As a result, P = 29 and using the formulas we have:

XYrank X, d xrank Y, d yPQ
38 292 1 2 8 1
50 302 2 4 6 2
52 366 3 7 3 4
54 312 4 5 4 2
59 359 5 6 3 2
61 398 6 8 2 2
66 401 7 9 1 2
70 298 8 3 1 1
71 283 9 1 1 0
73 413 10 10 0 0
29 16


Using simplified formulas:


In order to test the null hypothesis at the significance level α that the general Kendall rank correlation coefficient is equal to zero under the competing hypothesis H 1: τ ≠ 0, it is necessary to calculate the critical point:

where n is the sample size; z kp is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality Ф(z kp)=(1 - α)/2.
If |τ| T kp - the null hypothesis is rejected. There is a significant rank correlation between qualitative characteristics.
Let's find the critical point z kp
Ф(z kp) = (1 - α)/2 = (1 - 0.05)/2 = 0.475
Using the Laplace table we find z kp = 1.96
Let's find the critical point:

Since τ

Kendall's correlation coefficient is used when variables are represented on two ordinal scales, provided that there are no associated ranks. The calculation of the Kendall coefficient involves counting the number of matches and inversions. Let's consider this procedure using the example of the previous problem.

The algorithm for solving the problem is as follows:

    We rearrange the data in the table. 8.5 so that one of the rows (in this case the row x i) turned out to be ranked. In other words, we rearrange the pairs x And y in the right order and We enter the data in columns 1 and 2 of the table. 8.6.

Table 8.6

x i

y i

2. Determine the “degree of ranking” of the 2nd row ( y i). This procedure is carried out in the following sequence:

a) take the first value of the unranked series “3”. Counting the number of ranks below given number, which more compared value. There are 9 such values ​​(numbers 6, 7, 4, 9, 5, 11, 8, 12 and 10). Enter the number 9 in the “matches” column. Then we count the number of values ​​that less three. There are 2 such values ​​(ranks 1 and 2); We enter the number 2 in the “inversion” column.

b) discard the number 3 (we have already worked with it) and repeat the procedure for the next value “6”: the number of matches is 6 (ranks 7, 9, 11, 8, 12 and 10), the number of inversions is 4 (ranks 1, 2 , 4 and 5). We enter the number 6 in the “coincidence” column, and the number 4 in the “inversion” column.

c) the procedure is repeated in a similar way until the end of the row; it should be remembered that each “worked out” value is excluded from further consideration (only ranks that lie below this number are calculated).

Note

In order not to make mistakes in calculations, it should be borne in mind that with each “step” the sum of coincidences and inversions decreases by one; This is understandable given that each time one value is excluded from consideration.

3. The sum of matches is calculated (P) and the sum of inversions (Q); the data is entered into one and three interchangeable formulas for the Kendall coefficient (8.10). The corresponding calculations are carried out.

t (8.10)

In our case:

In table XIV Appendix contains the critical values ​​of the coefficient for this sample: τ cr. = 0.45; 0.59. The empirically obtained value is compared with the tabulated one.

Conclusion

τ = 0.55 > τ cr. = 0.45. The correlation is statistically significant at level 1.

Note:

If necessary (for example, if there is no table of critical values), statistical significance t Kendall can be determined by the following formula:

(8.11)

Where S* = P – Q+ 1 if P< Q , And S* = P – Q – 1 if P>Q.

Values z for the corresponding significance level correspond to the Pearson measure and are found in the corresponding tables (not included in the appendix. For standard significance levels z kr = 1.96 (for β 1 = 0.95) and 2.58 (for β 2 = 0.99). Kendall's correlation coefficient is statistically significant if z > z cr

In our case S* = P – Q– 1 = 35 and z= 2.40, i.e. the initial conclusion is confirmed: the correlation between the characteristics is statistically significant for the 1st level of significance.

To calculate the Kendall rank correlation coefficient r k it is necessary to rank the data according to one of the characteristics in ascending order and determine the corresponding ranks for the second characteristic. Then, for each rank of the second attribute, the number of subsequent ranks greater in value than the taken rank is determined, and the sum of these numbers is found.

Kendall's rank correlation coefficient is given by


Where R i– number of ranks of the second variable, starting from i+1, the value of which is greater than the value i-th rank of this variable.

There are tables of percentage points of coefficient distribution r k, allowing you to test the hypothesis about the significance of the correlation coefficient.

For large sample sizes, critical values r k are not tabulated, and they have to be calculated using approximate formulas, which are based on the fact that under the null hypothesis H 0: r k=0 and larger n random variable

distributed approximately according to the standard normal law.

40. Dependence between traits measured on a nominal or ordinal scale

Often the task arises of checking the independence of two characteristics measured on a nominal or ordinal scale.

Let some objects have two characteristics measured X And Y with the number of levels r And s respectively. It is convenient to present the results of such observations in the form of a table called a contingency table of characteristics.

In the table u i(i = 1, ..., r) And v j (j= 1, ..., s) – values ​​​​accepted by the characteristics, value n ij– number of objects from total number objects that have the characteristic X accepted the value u i, and the sign Y- meaning v j

Let's introduce the following random variables:

u i


– the number of objects that have a value v j


In addition, there are obvious equalities



Discrete random variables X And Y independent if and only if

for all couples i, j

Therefore, the hypothesis about the independence of discrete random variables X And Y can be written like this:

As an alternative, as a rule, the hypothesis is used

The validity of the hypothesis H 0 should be judged on the basis of sample frequencies n ij contingency tables. According to the law of large numbers, when n→∞ relative frequencies are close to the corresponding probabilities:



Statistics are used to test the hypothesis H 0

which, if the hypothesis is true, has a distribution χ 2 s rs − (r + s− 1) degrees of freedom.

Independence criterion χ 2 rejects the hypothesis H 0 with significance level α if:


41. Regression analysis. Basic concepts of regression analysis

For mathematical description statistical relationships between the studied variables, the following tasks should be solved:

ü select a class of functions in which it is advisable to look for the best (in a certain sense) approximation of the dependence of interest;

ü find estimates of the unknown values ​​of the parameters included in the equations of the desired dependence;

ü establish the adequacy of the resulting equation for the desired relationship;

ü identify the most informative input variables.

The totality of the listed tasks is the subject of regression analysis research.

The regression function (or regression) is the dependence of the mathematical expectation of one random variable on the value taken by another random variable, forming with the first a two-dimensional system of random variables.

Let there be a system of random variables ( X,Y), then the regression function Y on X

And the regression function X on Y

Regression functions f(x) And φ (y), are not mutually reversible, unless the relationship between X And Y is not functional.

In case n-dimensional vector with coordinates X 1 , X 2 ,…, X n one can consider the conditional mathematical expectation for any component. For example, for X 1


called regression X 1 per X 2 ,…, X n.

To fully define the regression function, it is necessary to know the conditional distribution of the output variable for fixed values ​​of the input variable.

Since in a real situation they do not have such information, they usually limit themselves to searching for a suitable approximating function f a(x) For f(x), based on statistical data of the form ( x i, y i), i = 1,…, n. This data is the result n independent observations y 1 ,…, y n random variable Y for the values ​​of the input variable x 1 ,…, x n, while in regression analysis it is assumed that the values ​​of the input variable are specified exactly.

The problem of choosing the best approximating function f a(x), being the main one in regression analysis, and does not have formalized procedures for its solution. Sometimes the choice is determined based on the analysis of experimental data, more often from theoretical considerations.

If the regression function is assumed to be sufficiently smooth, then the function approximating it f a(x) can be represented as a linear combination of a certain set of linearly independent basis functions ψk(x), k = 0, 1,…, m−1, i.e. in the form


Where m– number of unknown parameters θk(in the general case, the quantity is unknown, refined during the construction of the model).

Such a function is linear in its parameters, so in the case under consideration we speak of a regression function model that is linear in its parameters.

Then the task of finding the best approximation for the regression line f(x) reduces to finding such parameter values ​​at which f a(x;θ) is most adequate to the available data. One of the methods that allows you to solve this problem is the least squares method.

42. Least squares method

Let the set of points ( x i, y i), i= 1,…, n located on a plane along some straight line

Then as a function f a(x), which approximates the regression function f(x) = M [Y|x] it is natural to take a linear function of the argument x:


That is, the basis functions chosen here are ψ 0 (x)≡1 and ψ 1 (x)≡x. This type of regression is called simple linear regression.

If the set of points ( x i, y i), i= 1,…, n located along some curve, then as f a(x) it’s natural to try to choose a family of parabolas

This function is nonlinear in parameters θ 0 and θ 1, however, by means of a functional transformation (in this case logarithm) it can be reduced to new feature f'a(x) linear in parameters:


43. Simple Linear Regression

The simplest model regression is simple (univariate, one-factor, paired) linear model, having the following form:


Where ε i– random variables (errors) that are uncorrelated with each other, having zero mathematical expectations and identical variances σ 2 , a And b– constant coefficients (parameters) that need to be estimated from the measured response values y i.

To find parameter estimates a And b linear regression, determining the straight line that best satisfies the experimental data:


The least squares method is used.

According to least squares method parameter estimates a And b found from the condition of minimizing the sum of squared deviations of values y i vertically from the “true” regression line:

Let ten observations of a random variable be made Y for fixed values ​​of the variable X

To minimize D let us equate to zero the partial derivatives with respect to a And b:



As a result we get the following system equations for finding estimates a And b:


Solving these two equations gives:



Expressions for parameter estimates a And b can also be represented as:

Then the empirical equation of the regression line Y on X can be written as:


Unbiased variance estimator σ 2 value deviations y i from the fitted straight regression line is given by

Let's calculate the parameters of the regression equation


Thus, the regression line looks like:


And the estimate of the variance of deviations of values y i from the fitted straight regression line


44. Checking the significance of the regression line

Found estimate b≠ 0 may be a realization of a random variable whose mathematical expectation is equal to zero, that is, it may turn out that there is actually no regression dependence.

To deal with this situation, you should test the hypothesis H 0: b= 0 with competing hypothesis H 1: b ≠ 0.

Testing the significance of a regression line can be done using analysis of variance.

Consider the following identity:

Magnitude y iŷi = ε i is called the remainder and is the difference between two quantities:

ü deviation of the observed value (response) from the overall average response;

ü deviation of the predicted response value ŷi from the same average

The written identity can be written in the form


By squaring both sides and summing over i, we get:


Where the quantities are named:

the complete (total) sum of squares SC n, which is equal to the sum of squared deviations of observations relative to the average value of observations

the sum of squares determined by the regression of the SC p, which is equal to the sum of squared deviations of the values ​​of the regression line relative to the average of observations.

residual sum of squares SC 0 . which is equal to the sum of squared deviations of observations relative to the values ​​of the regression line

Thus, the spread Y-kov relative to their mean can be attributed to some extent to the fact that not all observations lie on the regression line. If this were the case, then the sum of squares relative to the regression would be zero. It follows that the regression will be significant if the sum of squares of SC p is greater than the sum of squares of SC 0.

Calculations to test the significance of regression are carried out in the following ANOVA table

If errors ε i are distributed according to the normal law, then if the hypothesis H 0 is true: b= 0 statistics:


distributed according to Fisher's law with the number of degrees of freedom 1 and n−2.

The null hypothesis will be rejected at significance level α if the calculated value of the statistic F will be greater than the α percentage point f 1;n−2;α Fisher distributions.

45. Checking the adequacy of the regression model. Residual method

Under the adequacy of the built regression model it is understood that no other model provides significant improvement in response prediction.

If all response values ​​are obtained at different meanings x, i.e. there are no several response values ​​obtained at the same x i, then only limited testing of the adequacy of the linear model can be carried out. The basis for such a check is the balances:

Deviations from the established pattern:

Since X– one-dimensional variable, points ( x i, d i) can be depicted on a plane in the form of a so-called residual graph. This representation sometimes makes it possible to detect some kind of pattern in the behavior of residues. In addition, residual analysis allows one to analyze the assumption regarding the error distribution law.

In the case when the errors are distributed according to the normal law and there is an a priori estimate of their variance σ 2 (an assessment obtained on the basis of previously performed measurements), then a more accurate assessment of the adequacy of the model is possible.

By using F-Fisher's test can be used to check whether the residual variance is significant s 0 2 differs from the a priori estimate. If it is significantly greater, then there is inadequacy and the model should be revised.

If the a priori estimate σ 2 no, but response measurements Y repeated two or more times with the same values X, then these repeated observations can be used to obtain another estimate σ 2 (the first is the residual variance). Such an estimate is said to represent a “pure” error, because if x identical for two or more observations, then only random changes can affect the results and create scatter between them.

The resulting estimate turns out to be a more reliable estimate of the variance than estimates obtained by other methods. For this reason, when planning experiments, it makes sense to perform experiments with repetitions.

Let's assume that there is m different meanings X : x 1 , x 2 , ..., x m. Let for each of these values x i available n i response observations Y. The total observations are:

Then the simple linear regression model can be written as:


Let's find the variance of “pure” errors. This variance is the pooled variance estimate σ 2 if we imagine the response values y ij at x = x i as sample volume n i. As a result, the variance of “pure” errors is equal to:

This variance serves as an estimate σ 2 regardless of whether the fitted model is correct.

Let us show that the sum of squares “ pure mistakes” is part of the residual sum of squares (the sum of squares included in the expression for the residual variance). Remaining for j th observation at x i can be written as:

If we square both sides of this equation and then sum them over j and by i, then we get:

On the left in this equality is the residual sum of squares. The first term on the right side is the sum of squares of “pure” errors, the second term can be called the sum of squares of inadequacy. The last amount has m−2 degrees of freedom, hence the variance of inadequacy

The test statistic for testing the hypothesis H 0: the simple linear model is adequate, against the hypothesis H 1: the simple linear model is inadequate, is a random variable

If the null hypothesis is true, the value F has a Fisher distribution with degrees of freedom m−2 and nm. The hypothesis of linearity of the regression line should be rejected at the significance level α if the obtained statistic value is greater than the α percentage point of the Fisher distribution with degrees of freedom m−2 and nm.

46. Checking the adequacy of the regression model (see 45). Analysis of variance

47. Checking the adequacy of the regression model (see 45). Determination coefficient

Sometimes a sample determination coefficient is used to characterize the quality of a regression line R 2, showing what part (share) the sum of squares due to regression, SC p, makes up in the total sum of squares SC p:

The closer R 2 to unity, the better the regression approximates the experimental data, the closer the observations are to the regression line. If R 2 = 0, then changes in the response are entirely due to the influence of unaccounted factors, and the regression line is parallel to the axis x-s. In the case of simple linear regression, the coefficient of determination R 2 is equal to the square of the correlation coefficient r 2 .

Maximum value R 2 =1 can be achieved only in the case when observations were carried out at different x values. If the data contains repeated experiments, then the value of R 2 cannot reach unity, no matter how good the model is.

48. Confidence intervals for simple linear regression parameters

Just as the sample mean is an estimate of the true mean (the population mean), so are the sample parameters of a regression equation a And b- nothing more than estimates of true regression coefficients. Different samples will produce different estimates of the mean, just as different samples will produce different estimates of regression coefficients.

Assuming that the error distribution law ε i are described by a normal law, parameter estimation b will have a normal distribution with the parameters:


Since the parameter estimate a is a linear combination of independent normal distributed quantities, it will also have a normal distribution with mathematical expectation and variance:


In this case, (1 − α) confidence interval for estimating the dispersion σ 2 taking into account that the ratio ( n−2)s 0 2 /σ 2 distributed according to law χ 2 with degrees of freedom n−2 will be determined by the expression


49. Confidence intervals for the regression line. Confidence interval for dependent variable values

Usually we do not know the true values ​​of regression coefficients A And b. We only know their estimates. In other words, the true regression line may be higher or lower, steeper or flatter, than the one built from sample data. We calculated confidence intervals for the regression coefficients. You can also calculate the confidence region for the regression line itself.

Let for simple linear regression we need to construct (1− α ) confidence interval for the mathematical expectation of the response Y at value X = X 0 . This mathematical expectation is equal to a+bx 0 and its score

Because, then.

The resulting estimate of the mathematical expectation is a linear combination of uncorrelated normally distributed values ​​and therefore also has a normal distribution centered at the point of the true value of the conditional mathematical expectation and variance

Therefore, the confidence interval for the regression line at each value x 0 can be represented as


As you can see, the minimum confidence interval is obtained when x 0 equal to the average value and increases as x 0 “moves away” from the average in any direction.

To obtain a set of joint confidence intervals suitable for the entire regression function, throughout its entire length, in the above expression instead tn −2,α /2 must be substituted

It is used to identify the relationship between quantitative or qualitative indicators, if they can be ranked. The values ​​of the X indicator are displayed in ascending order and assigned ranks. The values ​​of the Y indicator are ranked and the Kendall correlation coefficient is calculated:

Where S = PQ.

P big the value of ranks Y.

Q- the total number of observations following the current observations with smaller the value of ranks Y. (equal ranks are not taken into account!)

If the data under study is repeated (has the same ranks), then the adjusted Kendall correlation coefficient is used in the calculations:

t- the number of related ranks in the series X and Y, respectively.

19.What should we proceed from when determining the topic, object, subject, purpose, objectives and hypothesis of the study?

The research program, as a rule, has two sections: methodological and procedural. The first includes justification of the relevance of the topic, formulation of the problem, definition of the object and subject, goals and objectives of the study, formulation of basic concepts (categorical apparatus), preliminary systemic analysis of the object of study and the formulation of a working hypothesis. The second section reveals the strategic design of the study, as well as the design and basic procedures for collecting and analyzing primary data.

First of all, when choosing a research topic, one must proceed from relevance. Justification of relevance includes an indication of the need and timeliness of studying and solving the problem for the further development of the theory and practice of teaching and education. Current research provides answers to the most pressing questions at this time, reflects the social order of society for pedagogical science, and reveals the most important contradictions that occur in practice. The criterion of relevance is dynamic, flexible, depends on time, taking into account specific and specific circumstances. In its most general form, relevance characterizes the degree of discrepancy between the demand for scientific ideas and practical recommendations (to satisfy a particular need) and the proposals that science and practice can provide at the present time.

The most convincing basis defining the topic of research is the social order, reflecting the most pressing, socially significant problems that require urgent solutions. Social order requires justification for a specific topic. Usually this is an analysis of the degree to which a question has been developed in science.

If the social order follows from the analysis of pedagogical practice, then the scientific problem is in a different plane. It expresses the main contradiction that must be resolved by means of science. The solution to the problem is usually purpose of the study. The goal is a reformulated problem.

The formulation of the problem entails object selection research. It can be a pedagogical process, an area of ​​pedagogical reality, or some pedagogical relationship that contains a contradiction. In other words, the object can be anything that explicitly or implicitly contains a contradiction and gives rise to a problematic situation. An object is what the process of cognition is aimed at. Subject of research - part, side of an object. These are the most significant properties, aspects, and features of an object from a practical or theoretical point of view that are subject to direct study.

In accordance with the purpose, object and subject of the study, research is determined tasks, which are usually aimed at checking hypotheses. The latter is a set of theoretically based assumptions, the truth of which is subject to verification.

Criterion scientific novelty applicable to assess the quality of completed studies. It characterizes new theoretical and practical conclusions, patterns of education, its structure and mechanisms, content, principles and technologies, which at this point in time were not known and not recorded in the pedagogical literature. The novelty of the research can have both theoretical and practical significance. The theoretical significance of the research lies in creating a concept, obtaining a hypothesis, pattern, method, model for identifying a problem, trend, direction. The practical significance of the research lies in the preparation of proposals, recommendations, etc. The criteria for novelty, theoretical and practical significance vary depending on the type of research; they also depend on the time of obtaining new knowledge.



2024 wisemotors.ru. How does this work. Iron. Mining. Cryptocurrency.