Causes and consequences of multicollinearity. Detection. The problem of multicollinearity of factors in regression models The main consequences of multicollinearity

Note that in some cases, multicollinearity is not such a serious “evil” that significant efforts must be made to identify and eliminate it. Basically, it all depends on the goals of the study.
If the main task of the model is to predict future values ​​of the dependent variable, then with a sufficiently large coefficient of determination R2(gt; 0.9), the presence of multicollinearity usually does not affect the predictive qualities of the model (if in the future the same relationships between the correlated variables will be maintained as before ).
If it is necessary to determine the extent to which each explanatory variable influences the dependent variable, then multicollinearity, which leads to larger standard errors, is likely to distort the true relationships between the variables. In this situation, multicollinearity is a serious problem.
There is no single method for eliminating multicollinearity that is suitable in any case. This is because the causes and consequences of multicollinearity are ambiguous and largely depend on the results of the sample.
Excluding Variable(s) from the Model
The simplest method for eliminating multicollinearity is to exclude one or a number of correlated variables from the model. Some caution is required when using this method. In this situation, specification errors are possible, so in applied econometric models it is advisable not to exclude explanatory variables until multicollinearity becomes a serious problem.
Getting more data or a new sample
Since multicollinearity directly depends on the sample, it is possible that with a different sample there will be no multicollinearity or it will not be as serious. Sometimes, to reduce multicollinearity, it is enough to increase the sample size. For example, if you are using annual data, you can move to quarterly data. Increasing the amount of data reduces the variance of regression coefficients and thereby increases their statistical significance. However, obtaining a new sample or expanding an old one is not always possible or is associated with serious costs. In addition, this approach may increase autocorrelation. These problems limit the use of this method.
Changing Model Specification
In some cases, the problem of multicollinearity can be solved by changing the specification of the model: either changing the form of the model, or adding explanatory variables that were not taken into account in the original model, but significantly affect the dependent variable. If this method is justified, then its use reduces the sum of squared deviations, thereby reducing the standard error of the regression. This results in a reduction in the standard errors of the coefficients.
Using advance information about some parameters
Sometimes, when building a multiple regression model, you can use preliminary information, in particular, the known values ​​of some regression coefficients.
It is likely that the values ​​of the coefficients calculated for some preliminary (usually simpler) models or for a similar model based on a previously obtained sample can be used for the one being developed in at the moment models.
Selection of the most significant explanatory variables. Procedure for sequentially connecting elements
Moving to fewer explanatory variables may reduce the duplication of information provided by highly interdependent traits. This is exactly what we encounter in the case of multicollinearity of explanatory variables.

36. methods for identifying multicolliarity. partial correlation

The greatest difficulties in using the multiple regression apparatus arise in the presence of multicollinearity of factor variables, when more than two factors are interconnected by a linear relationship.

Multicollinearity for linear multiple regression is the presence of a linear relationship between the factor variables included in the model.

Multicollinearity is a violation of one of the main conditions underlying the construction of a linear multiple regression model.

Multicollinearity in matrix form is the dependence between the columns of the matrix of factor variables X:

If you do not take into account the unit vector, then the dimension of this matrix is ​​equal to n*n. If the rank of matrix X is less than n, then the model has complete or strict multicollinearity. But in practice, complete multicollinearity almost never occurs.

It can be concluded that one of the main reasons for the presence of multicollinearity in a multiple regression model is a poor matrix of factor variables X.

The stronger the multicollinearity of factor variables, the less reliable is the estimation of the distribution of the amount of explained variation across individual factors using the least squares method.

The inclusion of multicollinear factors in the model is undesirable for several reasons:

1) the main hypothesis about the insignificance of multiple regression coefficients can be confirmed, but the regression model itself, when tested using the F-test, turns out to be significant, which indicates an overestimated value of the multiple correlation coefficient;

2) the obtained estimates of the coefficients of the multiple regression model may be unreasonably inflated or have incorrect signs;

3) adding or excluding one or two observations from the original data has a strong impact on the estimates of the model coefficients;

4) multicollinear factors included in the multiple regression model can make it unsuitable for further use.

There are no specific methods for detecting multicollinearity, but it is common to use a number of empirical techniques. In most cases, multiple regression analysis begins with consideration of the correlation matrix of factor variables R or matrix (XTX).

The correlation matrix of factor variables is a matrix of linear coefficients of pairwise correlation of factor variables that is symmetrical with respect to the main diagonal:

where rij is the linear coefficient of pair correlation between the i-th and j-th factor variables,

There are ones on the diagonal of the correlation matrix, because the correlation coefficient of the factor variable with itself is equal to one.

When considering this matrix in order to identify multicollinear factors, we are guided by the following rules:

1) if the correlation matrix of factor variables contains pairwise correlation coefficients in absolute value greater than 0.8, then they conclude that there is multicollinearity in this multiple regression model;

2) calculate the eigenvalues ​​of the correlation matrix of factor variables λmin and λmax. If λmin‹10-5, then there is multicollinearity in the regression model. If the attitude

then they also conclude that there are multicollinear factor variables;

3) calculate the determinant of the correlation matrix of factor variables. If its value is very small, then there is multicollinearity in the regression model.

37. ways to solve the problem of multicolliarity

If the estimated regression model is to be used to study economic relationships, then eliminating multicollinear factors is mandatory because their presence in the model can lead to incorrect signs of regression coefficients.

When building a forecast based on a regression model with multicollinear factors, it is necessary to evaluate the situation based on the magnitude of the forecast error. If its value is satisfactory, then the model can be used despite multicollinearity. If the forecast error is large, then eliminating multicollinear factors from the regression model is one of the methods for increasing forecast accuracy.

The main ways to eliminate multicollinearity in a multiple regression model include:

1) one of the most simple ways eliminating multicollinearity consists of obtaining additional data. However, in practice, in some cases, the implementation of this method can be very difficult;

2) a method for transforming variables, for example, instead of the values ​​of all variables participating in the model (including the resultant one), you can take their logarithms:

lny=β0+β1lnx1+β2lnx2+ε.

However this method is also unable to guarantee the complete elimination of multicollinearity of factors;

If the considered methods did not help eliminate multicollinearity of factors, then they move on to using biased methods for estimating unknown parameters of a regression model, or methods for excluding variables from a multiple regression model.

If none of the factor variables included in the multiple regression model can be excluded, then one of the main biased methods for estimating regression model coefficients is used - ridge regression or ridge.

When using the ridge regression method, a small number τ is added to all diagonal elements of the matrix (XTX): 10-6 ‹ τ ‹ 0.1. Estimation of unknown parameters of a multiple regression model is carried out using the formula:

where ln is the identity matrix.

The result of applying ridge regression is a reduction in the standard errors of the coefficients of the multiple regression model due to their stabilization to a certain number.

Principal component analysis is one of the main methods for eliminating variables from a multiple regression model.

This method used to eliminate or reduce multicollinearity of factor variables in a regression model. The essence of the method is to reduce the number of factor variables to the most significantly influencing factors. This is achieved using linear transformation all factor variables xi (i=0,...,n) into new variables called principal components, i.e., a transition is made from the matrix of factor variables X to the matrix of principal components F. In this case, the requirement is put forward that the selection of the first principal component corresponds to a maximum the total variance of all factor variables xi (i=0,...,n), the second component - the maximum of the remaining variance, after the influence of the first principal component is excluded, etc.

Method step-by-step inclusion variables consists in selecting from the entire possible set of factor variables exactly those that have a significant impact on the outcome variable.

The step-by-step inclusion method is carried out according to to the following algorithm:

1) of all factor variables, the regression model includes those variables that correspond to the largest modulus of the linear coefficient of pairwise correlation with the outcome variable;

2) when adding new factor variables to the regression model, their significance is checked using Fisher's F test. At the same time, the main hypothesis is put forward about the unjustified inclusion of the factor variable xk in the multiple regression model. The opposite hypothesis is a statement about the advisability of including the factor variable xk in the multiple regression model. The critical value of the F-criterion is defined as Fcrit(a;k1;k2), where a is the significance level, k1=1 and k2=n–l are the number of degrees of freedom, n is the volume of the sample population, l is the number of parameters estimated from the sample. The observed value of the F-criterion is calculated using the formula:

where q is the number of factor variables already included in the regression model.

When testing the main hypothesis, the following situations are possible.

Fob›Fcrit, then the main hypothesis about the unjustified inclusion of the factor variable xk in the multiple regression model is rejected. Therefore, inclusion of this variable in the multiple regression model is justified.

If the observed value of the F-criterion (calculated from sample data) is less than or equal to the critical value of the F-criterion (determined from the Fisher-Snedecor distribution table), i.e. Fobs.≤Fcrit, then the main hypothesis about the unjustified inclusion of the factor variable xk in the multiple model regression is accepted. Therefore, this factor variable can not be included in the model without compromising its quality

3) factor variables are checked for significance until there is at least one variable for which the condition Fob›Fcrit is not satisfied.

38. dummy variables. Chow test

The term “dummy variables” is used as opposed to “meaning” variables that indicate the level of quantitative indicator, taking values ​​from a continuous interval. As a rule, a dummy variable is an indicator variable that reflects a qualitative characteristic. The most commonly used are binary dummy variables that take two values, 0 and 1, depending on a certain condition. For example, in a survey of a group of people, a 0 might mean that the person being surveyed is a man, and a 1 might mean a woman. Dummy variables sometimes include a regressor consisting of only units (i.e., a constant, an intercept term), as well as a time trend.

Dummy variables, being exogenous, do not create any difficulties when using OLS. Dummy variables are an effective tool for building regression models and testing hypotheses.

Let's assume that a regression model has been built based on the collected data. The researcher is faced with the task of whether it is worth introducing additional dummy variables into the resulting model or whether the basic model is optimal. This task is solved using Chow's method or test. It is used in situations where the main sample population can be divided into parts or subsamples. In this case, you can test the assumption that subsamples are more effective than the overall regression model.

We will assume that the general regression model is an unconstrained regression model. Let's denote this model through UN. We will consider special cases of the regression model without restrictions as separate subsamples. Let us denote these partial subsamples as PR.

Let us introduce the following notation:

PR1 – first subsample;

PR2 – second subsample;

ESS(PR1) – sum of squared residuals for the first subsample;

ESS(PR2) – sum of squared residuals for the second subsample;

ESS(UN) – sum of squared residuals for general model regression.

– the sum of squared residuals for observations of the first subsample in the general regression model;

– the sum of squared residuals for observations of the second subsample in the general regression model.

For particular regression models, the following inequalities are valid:

Condition (ESS(PR1)+ESS(PR2))= ESS(UN) is carried out only if the coefficients of the partial regression models and the coefficients of the general regression model without restrictions are the same, but in practice such a coincidence is very rare.

The main hypothesis is formulated as a statement that the quality of the general regression model without restrictions better quality private regression models or subsamples.

The alternative or converse hypothesis states that the quality of the general unconstrained regression model is worse than the quality of specific regression models or subsamples

These hypotheses are tested using the Fisher-Snedecor F test.

The observed F-test value is compared with the critical F-test value, which is determined from the Fisher-Snedecor distribution table.

A k1=m+1 And k2=n-2m-2.

The observed value of the F-criterion is calculated using the formula: where ESS(UN)–ESS(PR1)–ESS(PR2)– value characterizing the improvement in the quality of the regression model after dividing it into subsamples;

m– number of factor variables (including dummy ones);

n– the size of the total sample population.

If the observed F-test value (calculated from sample data) is greater than the critical F-test value (determined from the Fisher-Snedecor distribution table), i.e. Fob>Fcrit, then the main hypothesis is rejected, and the quality of the particular regression models exceeds the quality of the general regression model.

If the observed F-test value (calculated from sample data) is less than or equal to the critical F-test value (determined from the Fisher-Snedecor distribution table), i.e. Fob?Fcrit, then the main hypothesis is accepted, and it makes no sense to split the overall regression into subsamples.

If the significance of basic regression or restricted regression is tested, then the main hypothesis of the form is put forward:

The validity of this hypothesis is tested using the Fisher-Snedecor F test.

The critical value of the Fisher F test is determined from the Fisher-Snedecor distribution table depending on the level of significance A and two degrees of freedom k1=m+1 And k2=n–k–1.

The observed value of the F-criterion is converted to the form:

When testing the hypotheses, the following situations are possible.

If the observed F-test value (calculated from sample data) is greater than the critical F-test value (determined from the Fisher-Snedecor distribution table), i.e. Fob›Fcrit, then the main hypothesis is rejected and additional dummy variables must be introduced into the regression model because the quality of the constrained regression model is higher than the quality of the baseline or constrained regression model.

If the observed F-test value (calculated from sample data) is less than or equal to the critical F-test value (determined from the Fisher-Snedecor distribution table), i.e. Fob?Fcrit, then the main hypothesis is accepted and the basic regression model is satisfactory; it makes no sense to introduce additional dummy variables into the model.

39. system of simultaneous equations (endogenous, exogenous, lagged variables). Economically significant examples of systems of simultaneous equations

So far, we have considered econometric models defined by equations that express the dependent (explained) variable in terms of explanatory variables. However, real economic objects studied using econometric methods lead to an expansion of the concept of an econometric model described by a system of regression equations and identities1.

1 Unlike regression equations, identities do not contain model parameters to be estimated and do not include a random component.

A feature of these systems is that each of the system equations, in addition to “its own” explanatory variables, can include explained variables from other equations. Thus, we have not one dependent variable, but a set of dependent (explained) variables related by the equations of the system. Such a system is also called a system of simultaneous equations, emphasizing the fact that in the system the same variables are simultaneously considered as dependent in some equations and independent in others.

Systems of simultaneous equations most fully describe an economic object containing many interconnected endogenous (formed within the functioning of the object) and exogenous (set from the outside) variables. In this case, lagged (taken at the previous point in time) variables can act as endogenous and exogenous.

A classic example of such a system is the model of demand Qd and supply Qs (see § 9.1), when the demand for a product is determined by its price P and consumer income /, the supply of a product is determined by its price P and a balance is achieved between supply and demand:

In this system, the exogenous variable is consumer income /, and the endogenous variable is the demand (supply) of the product Qd = Q» = Q and the price of the product (equilibrium price) R.

In another model of supply and demand, the variable explaining supply Qf can be not only the price of goods P at a given point in time /, i.e. Pb but also the price of the product at the previous point in time Ptb i.e. lagged endogenous variable:

th"=P4+P5^+Pb^-1+Є2.

Summarizing the above, we can say that the econometric model allows us to explain the behavior of endogenous variables depending on the values ​​of exogenous and lagged endogenous variables (in other words, depending on predetermined, i.e., predetermined, variables).

Concluding our consideration of the concept of an econometric model, the following should be noted. Not every economic and mathematical model that represents a mathematical and statistical description of the economic object under study can be considered econometric. It becomes econometric only if it reflects this object on the basis of empirical (statistical) data characterizing it.

40. indirect least squares method

If the i -th stochastic equation of the structural form is identified exactly, then the parameters of this equation (equation coefficients and random error variance) are uniquely restored from the parameters of the reduced system. Therefore, to estimate the parameters of such an equation, it is enough to estimate the coefficients of each of the equations of the reduced form using the least squares method (separately for each equation) and obtain an estimate of the covariance matrix Q of errors in the reduced form, and then use the relations PG = B and E = GTQT, substituting in them, instead of P, there is an estimated coefficient matrix of the reduced form P and an estimated covariance matrix of errors in the reduced form £2. This procedure is called indirect least squares (ILS indirect least squares). The resulting estimates of the coefficients of the i-th stochastic equation of the structural form inherit the property of consistency of the estimates of the reduced form. However, they do not inherit such properties of reduced form estimators as unbiasedness and efficiency due to the fact that they are obtained as a result of some nonlinear transformations. Accordingly, with a small number of observations, even these natural estimates may be subject to noticeable bias. In this regard, when considering various methods for estimating the coefficients of structural equations, they are primarily concerned with ensuring the consistency of the resulting estimates.

41. problems of identifiability of systems of simultaneous equations

With the correct specification of the model, the task of identifying a system of equations is reduced to a correct and unambiguous estimate of its coefficients. Direct assessment of equation coefficients is possible only in systems of apparently unrelated equations for which the basic prerequisites for constructing a regression model are met, in particular, the condition that the factor variables are uncorrelated with the residuals.

In recursive systems, it is always possible to get rid of the problem of correlation of residuals with factor variables by substituting as the values ​​of factor variables not actual, but model values ​​of endogenous variables acting as factor variables. The identification process is carried out as follows:

1. An equation is identified that does not contain endogenous variables as factors. The calculated value of the endogenous variable of this equation is found.

2. Consider the following equation, in which the endogenous variable found in the previous step is included as a factor. The model (estimated) values ​​of this endogenous variable provide the ability to identify this equation, etc.

In the system of equations in the reduced form, the problem of factor variables being correlated with deviations does not arise, since in each equation only predefined variables are used as factor variables. Thus, if other prerequisites are met, the recursive system is always identifiable.

When considering a system of simultaneous equations, an identification problem arises.

Identification in this case means determining the possibility of unambiguously recalculating the system coefficients in the reduced form into structural coefficients.

Structural model (7.3) contains in its entirety parameters that need to be determined. The given form of the model contains parameters in full. Therefore, to determine unknown parameters of the structural model, equations can be drawn up. Such systems are uncertain and the parameters of the structural model in the general case cannot be unambiguously determined.

To get the only possible solution it is necessary to assume that some of the structural coefficients of the model, due to their weak relationship with the endogenous variable from the left side of the system, are equal to zero. This will reduce the number of structural coefficients of the model. Reducing the number of structural coefficients of the model is also possible in other ways: for example, by equating some coefficients to each other, i.e., by assuming that their impact on the endogenous variable being formed is the same, etc.

From the point of view of identifiability, structural models can be divided into three types:

· identifiable;

· unidentifiable;

· overidentified.

Model identifiable, if all its structural coefficients are determined uniquely, in a unique way, by the coefficients of the reduced form of the model, that is, if the number of parameters of the structural model is equal to the number of parameters of the reduced form of the model.

Model unidentifiable, if the number of coefficients of the reduced model is less than the number of structural coefficients, and as a result, the structural coefficients cannot be estimated through the coefficients of the reduced form of the model.

Model overidentifiable, if the number of coefficients of the reduced model is greater than the number of structural coefficients. In this case, based on the reduced form coefficients, two or more values ​​of one structural coefficient can be obtained. An overidentified model, unlike a non-identified model, is practically solvable, but requires special methods for finding parameters.

To determine the type of structural model, each of its equations must be checked for identifiability.

A model is considered identifiable if each equation of the system is identifiable. If at least one of the equations of the system is non-identifiable, then the entire model is considered non-identifiable. In addition to the identifiable ones, an overidentified model contains at least one overidentified equation.

42. three-step least squares method

The most effective procedure for estimating systems of regression equations combines the simultaneous estimation method and the instrumental variables method. The corresponding method is called three-step least squares. It consists in the fact that in the first step the generalized least squares method is applied to the original model (9.2) in order to eliminate the correlation of random terms. The two-step least squares method is then applied to the resulting equations.

Obviously, if the random terms (9.2) do not correlate, the three-step method is reduced to a two-step one, while at the same time, if the matrix B is identity, the three-step method is a procedure for simultaneous estimation of equations as externally unrelated.

Let's apply the three-step method to the model under consideration (9.24):

ai=19.31; Pi=l.77; a2=19.98; p2=0.05; y=1.4. (6.98) (0.03) (4.82) (0.08) (0.016)

Since the coefficient p2 is insignificant, the equation for the dependence of Y on X has the form:

y = 16.98 + 1.4x

Note that it practically coincides with equation (9.23).

As is known, purifying an equation from the correlation of random terms is an iterative process. Accordingly, when using the three-step method computer program asks for the number of iterations or the required precision. Let us note an important property of the three-step method, which ensures its greatest efficiency.

When enough large number iterations, the three-stage least squares estimates coincide with the maximum likelihood estimates.

Maximum likelihood estimators are known to perform best on large samples.

43. concept of economic time series. General view multiplicative and additive time series models.

44. modeling of time series trends, seasonal and cyclical fluctuations.

There are several approaches to analyzing the structure of time series containing seasonal or cyclical fluctuations.

1 APPROACH. Calculation of seasonal component values ​​using the moving average method and construction of an additive or multiplicative time series model.

General view of the additive model: (T - trend component, S - seasonal, E - random).

General view of the multiplicative model:

Selecting a model based on an analysis of the structure of seasonal fluctuations (if the amplitude of fluctuations is approximately constant - additive, if it increases/decreases - multiplicative).

Building models comes down to calculations values ​​T,S,E for each row level.

Model building:

1. alignment of the original series using the moving average method;

2.calculation of component values S;

3. Removing the seasonal component from the initial levels of the series and obtaining aligned data ( T+E) in additive or ( T*E) in the multiplicative model.

4. Analytical leveling ( T+E) or ( T*E) and calculation of the value T using the obtained trend level.

5.Calculation of values ​​obtained from the model ( T+S) or ( T*S).

6.Calculation of absolute and/or relative errors.

If the obtained error values ​​do not contain autocorrelation, they can be used to replace the original levels of the series and subsequently use the error time series E to analyze the relationship between the original series and other time series.

2 APPROACH. Construction of a regression model including the time factor and dummy variables. The number of dummy variables in such a model should be one less than the number of moments (periods) of time within one oscillation cycle. For example, when modeling quarterly data, the model must include four independent variables—a time factor and three dummy variables. Each dummy variable reflects the seasonal (cyclical) component of the time series for any one period. It is equal to one (1) for of this period and zero (0) for all others. The disadvantage of a model with dummy variables is the presence of a large number of variables.

45. autocorrelation function. Its use to identify the presence or absence of trend and cyclical components

Autocorrelation of time series levels.

If there are trends and cyclical fluctuations in a time series, each subsequent level of the series depends on the previous ones. The correlation dependence between successive levels of a time series is called autocorrelation of series levels.

Quantitatively, the autocorrelation of series levels is measured using a linear correlation coefficient between the levels of the original time series and the levels of this series, shifted by several steps in time.

Let, for example, be given a time series . Let us determine the correlation coefficient between the series and .

One of the working formulas for calculating the correlation coefficient is:

And the time series, i.e. at lag 2. It is determined by the formula:

(4)

Note that as the lag increases, the number of pairs of values ​​from which the correlation coefficient is calculated decreases. Typically, the lag is not allowed to be greater than a quarter of the number of observations.

Let us note two important properties of autocorrelation coefficients.

Firstly, autocorrelation coefficients are calculated by analogy with the linear correlation coefficient, i.e. they characterize only the closeness of the linear connection between the two levels of the time series under consideration. Therefore, the autocorrelation coefficient can only judge the presence of a linear (or close to linear) trend. For time series that have a strong nonlinear trend (for example, exponential), the level autocorrelation coefficient may approach zero.

Answers to exam papers in econometrics Yakovleva Angelina Vitalievna

37. Definition of multicollinearity. Consequences of multicollinearity. Methods for detecting multicollinearity

The greatest difficulties in using the multiple regression apparatus arise in the presence of multicollinearity of factor variables, when more than two factors are interconnected by a linear relationship.

Multicollinearity for linear multiple regression is the presence of a linear relationship between the factor variables included in the model.

Multicollinearity is a violation of one of the main conditions underlying the construction of a linear multiple regression model.

Multicollinearity in matrix form is the dependence between the columns of the matrix of factor variables X:

If we do not take into account the unit vector, then the dimension of this matrix is ​​equal to n*n. If the matrix rank X less n, then the model contains complete or strict multicollinearity. But in practice, complete multicollinearity almost never occurs.

It can be concluded that one of the main reasons for the presence of multicollinearity in a multiple regression model is a poor matrix of factor variables X.

The stronger the multicollinearity of factor variables, the less reliable is the estimation of the distribution of the amount of explained variation across individual factors using the least squares method.

The inclusion of multicollinear factors in the model is undesirable for several reasons:

1) the main hypothesis about the insignificance of multiple regression coefficients can be confirmed, but the regression model itself, when tested using the F-test, turns out to be significant, which indicates an overestimated value of the multiple correlation coefficient;

2) the obtained estimates of the coefficients of the multiple regression model may be unreasonably inflated or have incorrect signs;

3) adding or excluding one or two observations from the original data has a strong impact on the estimates of the model coefficients;

4) multicollinear factors included in the multiple regression model can make it unsuitable for further use.

There are no specific methods for detecting multicollinearity, but it is common to use a number of empirical techniques. In most cases, multiple regression analysis begins by considering the correlation matrix of factor variables R or matrix ( XTX).

Correlation matrix of factor variables is a matrix of linear coefficients of pairwise correlation of factor variables that is symmetrical with respect to the main diagonal:

Where rij– linear coefficient of pair correlation between i-m and j-th factor variables,

There are ones on the diagonal of the correlation matrix, because the correlation coefficient of the factor variable with itself is equal to one.

When considering this matrix in order to identify multicollinear factors, we are guided by the following rules:

1) if the correlation matrix of factor variables contains pairwise correlation coefficients in absolute value greater than 0.8, then they conclude that there is multicollinearity in this multiple regression model;

2) calculate the eigenvalues ​​of the correlation matrix of factor variables ?min And ? max. If ? min‹10-5, then there is multicollinearity in the regression model. If the attitude

then they also conclude that there are multicollinear factor variables;

3) calculate the determinant of the correlation matrix of factor variables. If its value is very small, then there is multicollinearity in the regression model.

This text is an introductory fragment. From the book 100 Great Wonders of Technology author Mussky Sergey Anatolievich

Boeing E-3 early warning aircraft It was May 8, 1942 in the Coral Sea. “At 10:55 a.m., the radar installation detected a large group of enemy aircraft approaching from the northeast. At 11:13 am the Lexington observers

From the book Security Encyclopedia author Gromov V I

1.3.5. Means of detecting and neutralizing mines Detection of mines, individual land mines, as well as mined areas is carried out: - by external signs; - special devices (mine detectors, probes, stethoscopes); - mine detection service dogs. * Unmasking signs

From the book Answers to exam papers in econometrics author Yakovleva Angelina Vitalievna

38. Methods for eliminating multicollinearity If the estimated regression model is supposed to be used to study economic relationships, then eliminating multicollinear factors is mandatory, because their presence in the model can lead to incorrect

From the book Forensic Medicine and Psychiatry: Cheat Sheet author Author unknown

From the book Civil Code of the Russian Federation by GARANT

From the book "Spy Things 2" or how to keep your secrets author Andrianov Vladimir Ilyich

4.2. Special tools for detecting caches 4.2.1. Search mirrorThe main tool for detecting caches is a search mirror. It can be small, about the size of a dentist's, or it can be much larger. The mirror (Fig. 4.2) is mounted on

From the book Forensics. Cheat sheets author Petrenko Andrey Vitalievich

27. Rules and classification of trace detection methods 1. Non-destructive methods should be used first. It is necessary to start with micromethods: are there any traces of fat left, the smallest cells of exfoliated skin.2. Next, non-destructive methods are used,

From the book The Power of Shamans. Combat and healing magic of the Indians of the Wild West author Stukalin Yuri Viktorovich

38. Traces of teeth: features of detection and their signs. Traceological studies of traces of human teeth are quite common. Forensic science studies only teeth marks on materials, surfaces, food; traces on the human body - a subject of study

From the book Sniper Survival Manual [“Shoot rarely, but accurately!”] author Fedoseev Semyon Leonidovich

41. Features of detection, removal of bullets and cartridges In most cases, the cartridge remains at the crime scene, the detection method can be: a) selective; b) continuous. The use of the selective method for short-barreled weapons is as follows: - established

From the book Deception and Provocations in Small and Medium Businesses author Gladky Alexey Anatolievich

57. Means for detecting micro-objects Micro-objects are material objects associated with a crime event, the search, detection, seizure and examination of which, due to their small size and mass, are difficult or impossible with the naked eye. Actions with

From the book Basic Special Forces Training [Extreme Survival] author Ardashev Alexey Nikolaevich

58. Features of detection of micro-objects Search and detection of micro-objects must be carried out in compliance with safety precautions. All objects are first inspected without any movement; when changing the position of an object, place a blank sheet of tracing paper under it,

From the author's book

Methods for detecting sorcerers “There are many ways to distinguish a sorcerer from a shaman, although most people with powerful Power practice both,” said the Chiricahua Apache. “A person could live next to a sorcerer and not know about it. For example, the witch could be his

From the author's book

From the author's book

Anti-bug, or Spy Equipment Detection Tools As already noted, currently on the Russian market there is a great variety of different spy devices and intelligence equipment: hidden microphones, bugs, hidden systems

1. In a model with two variables, one of the signs of multicollinearity is value of the pair correlation coefficient close to unity. If the value of at least one of the pairwise correlation coefficients is greater than 0.8, then multicollinearity is a serious problem.

However, in a model with more than two independent variables, the pairwise correlation coefficient may take on a small value even in the presence of multicollinearity. In this case, it is better to consider partial correlation coefficients.

2. To check multicollinearity, you can consider determinants of the matrix of pair correlation coefficients|r|. This determinant is called the correlation determinant |r| ∈(0; 1). If |r| = 0, then there is complete multicollinearity. If |r|=1, then there is no multicollinearity. The closer |r| to zero, the more likely the presence of multicollinearity.

3. If the estimates have large standard errors, low significance, but the model as a whole is significant (has a high coefficient of determination), then this indicates the presence of multicollinearity.

4. If the introduction of a new independent variable into the model leads to a significant change in parameter estimates and a slight change in the coefficient of determination, then the new variable is linearly dependent on the other variables

65. Dummy variables: definition, purpose, types, meaning of names.

Dummy variables– these are variables with a discrete set of values ​​that quantitatively describe qualitative characteristics. Econometric models typically use binary “0-1” type dummy variables.

Dummy variables are required to assess qualitative characteristics on an endogenous variable. For example, when assessing the demand for a certain product, we built a regression model in which the regressors were quantitative variables - price and consumer income. One way to refine this model would be to include such qualitative characteristics as consumer taste, age, national characteristics, seasonality, etc. These indicators cannot be presented in numerical form. Therefore, the problem arises of reflecting their influence on the values ​​of the endogenous variable, which is solved precisely by introducing dummy variables.

In the general case, when a qualitative characteristic has more than two values, several binary variables are introduced. When using several binary variables, it is necessary to exclude a linear relationship between the variables, since otherwise, when estimating parameters, this will lead to perfect multicollinearity. Therefore, the following rule applies: if a qualitative variable has k alternative values, then only (k-1) dummy variables are used in the modeling.

Regression models use two types of dummy variables:

1. Dummy shift variables

2. Slope dummy variables is a variable that changes the slope of the regression line. Using such dummy variables, it is possible to construct piecewise linear models that allow one to take into account structural changes in economic processes (for example, the introduction of new legal or tax restrictions, changes in the political situation, etc.) Such variables are used when a change in a qualitative characteristic does not lead to parallel shift of the regression graph, but to a change in its slope. This is actually why such dummy variables are called slope variables.

66. Shift Dummy: Specification of a regression model with a shift dummy.

Dummy shift variables– these variables are used in dynamic models when certain point time, some qualitative factor comes into play (for example, when considering the productivity of a plant before and during a workers' strike). These variables are used when a change in a qualitative attribute leads to a parallel shift in the regression model graph, which is why they are called shift variables.

The specification of a pairwise regression model with a dummy shift variable is:

Where α, β, δ are model parameters; – value of the regressor in observation t;

Dummy variable;

δ is a parameter for a dummy variable.

The value of the dummy variable dt=0 is called the base (comparative) value. The base value can either be determined by the objectives of the study or chosen arbitrarily. If you replace the base value of the variable, the essence of the model will not change; the sign of the parameter δ will change to the opposite.

Consider a paired regression model with a dummy shift variable by example.

Let ice cream sales be influenced by the presence of advertising on the seller's van. Using an equation with dummy variables, using a single regression equation, you can get results for both sellers with advertising and sellers without advertising.

Let the initial model be described by the specification:

Where n is the number of ice cream sellers, is the number of sales for the t-th seller, is the value of the quantitative regressor for the t-th seller

Let us introduce a fictitious shift variable

An external sign of the presence of multicollinearity is the values ​​of matrix elements that are too large (X T X)~ 1 . More matrix definition (X T X) X and its use see chap. 4, paragraph 4.2.

The main sign of multicollinearity: correlation matrix determinant R x x . close to zero. If all explanatory variables are uncorrelated with each other, then R XjX .| = 1, otherwise 0 R x . x. |

There are several signs by which the presence of multicollinearity can be determined.

  • 1. Coefficient of determination K 2 quite high, high f-statistic, but some (sometimes all) of the coefficients of the multiple linear regression equation are statistically insignificant (have low 7-statistics).
  • 2. High pairwise correlation coefficients and high partial correlation coefficients.

Definition 7.1.Partial correlation coefficient is called the correlation coefficient between two explanatory variables, “cleared” of the influence of other variables.

For example, with three explanatory variables X 1y X 2, X 3 partial correlation coefficient between X ( and X 3, “purified” from X 2, is calculated by the formula

Remark 7.2. The partial correlation coefficient may differ significantly from the “usual” (paired) correlation coefficient. For a more reasonable conclusion about the correlation between pairs of explanatory variables, it is necessary to calculate all partial correlation coefficients.

General expression for determining the partial correlation coefficient

Where Cjj- matrix elements WITH = R~ x - matrix inverse to the interfactor pair correlation matrix R VjX . (7.1).

  • 3. Strong regression between explanatory variables. Any of the explanatory variables is a combination of other explanatory variables (linear or nearly linear).
  • 4. The signs of the regression coefficients are opposite to those expected from economic premises.
  • 5. Adding or removing observations from the sample greatly changes the values ​​of the estimates.

Let's look at a few examples to illustrate the above.

Example 7.4

For production volume at The following main factors influence: x x- the number of employees working at the enterprise; x 2- cost of fixed assets; x 3- average wages employees. The linear multiple regression equation has the form y = b 0 + b ( x x + b 2 x 2 + b 3 x 3 .

Matrix of pair correlation coefficients for this model

Matrix determinant |D | = 0.302. In this model, factors and x 2, and also X ( And x 3 factors are weakly related, on the contrary, x 2 And x 3 are strongly connected: r^z =0.8. Possibly a strong connection between the factors x 2 And x l This is explained by the fact that highly skilled workers who have higher wages work on expensive equipment.

Paired correlation coefficients of the resulting variable with the factors turned out to be equal: t yY| =0.7; g uh.^ =0,8; g uhz=0.75. The complete matrix of pair correlation coefficients has the form

All factors have a significant impact on the result. Since the regression model must include factors that are closely related to the result and weakly related to each other, then in this example two regression models are suitable simultaneously: y, = f(x v x 2) and y 2 = f(x v x 3).

Example 7.5

Let us find out the presence of multicollinearity for the sample data given in table. 7.2.

Input data for example 7.2

Table 7.2

X,

Solution. Paired correlation coefficients calculated using formula (7.2) are given in Table. 7.3.

Table 73

Paired correlation coefficients

From the data given in the table, it is clear that there is a strong correlation between the variables.G[ and x 2. Pairwise correlation coefficients can also be determined using the Analysis Tool. Microsoft Excel (Correlation tool),

Let’s check the correlation between the explained and explanatory variables; for this we will use the “Correlation” tool Microsoft Excel(you can calculate correlation coefficients g X1/ , using formula (7.2)). The results are presented in Fig. 7.1.


Rice. 7.1. Results of calculating the correlation between the explained and explanatory variables in Microsoft Excel

Let's calculate the partial correlation coefficients using formula (7.4), since in this example there are only three explanatory variables (you can find the partial correlation coefficients using formula (7.5), having first found inverse matrix C=R():

The partial correlation coefficient between the variables turned out to be the largest x x there are 2 of them. Partial correlation coefficient g XxX ^ X2 the smallest and opposite in sign to the pair coefficient g x x.

Answer. There is a strong correlation between the variables in the model x x And x 2.

QUESTIONS FOR THE COURSE EXAM

"ECONOMETRICS (advanced level)"

1. Multiple regression model. Types of multiple regression models.

2. Matrix recording form and matrix formula for estimating multiple regression parameters.

3. Assessing the quality of the regression equation. Explained and unexplained components of the regression equation.

4. Coefficient of determination and correlation coefficient, their calculation in a paired regression model.

5. Selective multiple coefficient of determination and checking its significance using Fisher's criterion.

6. Checking the significance of a multiple regression equation using Fisher's test.

The significance of the regression equation, i.e. econometric model fit Y= aˆ0 + aИ 1 X+ e actual (empirical) data, allows us to

determine whether the regression equation is suitable for practical use(for analysis and forecast), or not.

To test the significance of the equation, use F- Fisher criterion. It is calculated from the actual data as the ratio of the unbiased

variances residual component to the variance of the original series. The significance of the coefficient of determination is checked using the Fisher criterion, the calculated value of which is found using the formula:

,

where is the multiple correlation coefficient, is the number of observations, is the number of variables, is the diagonal element of the matrix.

To test the hypothesis, the table value is determined from the table

Fisher test F.

F(α ν1 ν2) is the maximum possible value of the criterion depending on the influence of random factors for given degrees of freedom

ν = m1, ν2 = nm−1, and significance level α. Here m– number of arguments in the model.

The significance level α is the probability of rejecting the correct hypothesis, but provided that it is true (type I error). Typically α is taken to be 0.05 or 0.01.

If F f> F table, then H0– the hypothesis about the random nature of the assessed characteristics is rejected and their statistical significance and reliability are recognized. If on the contrary, then the hypothesis H0 is not rejected and the statistical insignificance and unreliability of the regression equation are recognized.

7. Assessing the significance of linear correlation coefficients. -Student's t-test.

To assess the statistical significance of the regression coefficients and correlation coefficient, the Student t-test is calculated. A hypothesis is put forward H 0 about the random nature of the indicators, i.e. about their insignificant difference from zero. Observed t-test values ​​are calculated using the formulas:

, , ,

where are the random errors of the linear regression parameters and the correlation coefficient.


For linear pair regression, the equality is satisfied, therefore testing the hypotheses about the significance of the regression coefficient under a factor and the correlation coefficient is equivalent to testing the hypothesis about the statistical significance of the regression equation as a whole.

In general, random errors are calculated using the formulas:

, , .

where is the residual dispersion per degree of freedom:

.

The tabulated (critical) value of t-statistics is found from tables of t-Student distribution at a significance level of α = 0.05 and the number of degrees of freedom. If t table< t fact, then H 0 is rejected, i.e. It is not by chance that the regression coefficients differ from zero and were formed under the influence of a systematically acting factor.

8. Analysis of the influence of factors based on multifactor regression models: elasticity coefficient; beta coefficient and delta coefficient.

9. Methods for calculating the parameters , , of the Cobb-Douglas production function.

10. Regression equations with variable structure. Dummy variables. Types of dummy variables. Advantages of using dummy variables when building regression models.

11. Using dummy variables to study structural change. Modeling seasonality. Number of binary variables at k gradations.

The concept of multicollinearity. Methods for detecting and eliminating multicollinearity.

Quantitative assessment of the parameters of the regression equation assumes that the condition of linear independence between the independent variables is met. However, in practice, explanatory variables often have a high degree of interrelationship with each other, which is a violation of this condition. This phenomenon is called multicollinearity.

Term collinearity (collinear) denotes a linear correlation between two independent variables, and Multicollinearity (multi-collinear) – between more than two independent variables. Typically, multicollinearity refers to both cases.

Thus, multicollinearity means there is a close linear relationship or strong correlation between two or more explanatory (independent) variables. One of the tasks of econometrics is to identify multicollinearity between independent variables.

Distinguish perfect And imperfect multicollinearity. Perfect Multicollinearity means that the variation in one of the independent variables can be fully explained by variation in the other variable(s).

Otherwise, the relationship between them is expressed by a linear function

Graphic interpretation of this case:

Imperfect Multicollinearity can be defined as a linear functional relationship between two or more independent variables that is so strong that it can significantly affect the estimates of the coefficients of the variables in the model.

Imperfect multicollinearity occurs when two (or more) independent variables are in a linear functional relationship with each other, described by the equation

Unlike the previously discussed equation, this one includes the magnitude of the stochastic error. This suggests that although the relationship between and may be quite strong, it is not so strong that the change in the variable can be fully explained by the change in , i.e. there is some unexplained variation.

Graphically this case is presented as follows:


In what cases can multicollinearity occur? There are at least two of them.

1. There is a global trend of simultaneous changes in economic indicators. As an example, we can cite such indicators as production volume, income, consumption, accumulation, employment, investment, etc., the values ​​of which increase during periods of economic growth and decrease during periods of recession.

One of the reasons for multicollinearity is the presence of a trend (tendency) in the dynamics of economic indicators.

2. Use of lagged values ​​of variables in economic models.

As an example, we can consider models that use both the income of the current period and the consumption costs of the previous period.

In general, when studying economic processes and phenomena using econometric methods, it is very difficult to avoid dependence between indicators.

The consequences of multicollinearity boil down to

1. reduction in assessment accuracy, which manifests itself through

a. too large errors in some estimates,

b. high degree of correlation between errors,

c. A sharp increase in the dispersion of parameter estimates. This manifestation of multicollinearity may also be reflected in obtaining an unexpected sign when estimating parameters;

2. the insignificance of parameter estimates for some model variables due, first of all, to the presence of their relationship with other variables, and not due to the fact that they do not affect the dependent variable. That is, the -statistics of the model parameters do not meet the level of significance (the Student's t-test does not pass the adequacy test);

3. a strong increase in the sensitivity of parameter estimates to the size of the population of observations. That is, an increase in the number of observations can significantly affect the estimates of model parameters;

4. increasing confidence intervals;

5. increasing the sensitivity of estimates to changes in model specification (for example, to adding or excluding variables from the model, even if they have an insignificant effect).

Signs of multicollinearity:

1. when among pair correlation coefficients

between the explanatory (independent) variables there are those whose level either approaches or is equal to the multiple correlation coefficient.

If there are more than two independent variables in the model, then a more detailed study of the relationships between the variables is necessary. This procedure can be implemented using the Farrar-Glober algorithm;

2. when the determinant of the matrix of pairwise correlation coefficients between independent variables approaches zero:

if , then there is complete multicollinearity,

if , then there is no multicollinearity;

3. if a small parameter value is found in the model when high level coefficient of partial determination and at the same time - the criterion differs significantly from zero;



2024 wisemotors.ru. How does this work. Iron. Mining. Cryptocurrency.