What is multicollinearity & how to deal with it?
What is Multicollinearity, how to identify and how to deal with it?
Multicollinearity occurs when two or more independent variables are highly correlated with each other in a regression model. This means that the independent variable can be predicted from another independent variable in the regression model.
What causes the multicollinearity?
few reasons are listed below below:
- Data-based multicollinearity
- Insufficient data
- Dummy variables
Data-driven multicollinearity: Caused by poorly designed experiments, data that is 100% observational, or data collection methods that cannot be manipulated. In some cases, the variables can be highly correlated (usually due to the collection of data from insufficient observational studies), and the researcher does not make mistakes. For this reason, you should experiment when possible by setting the level of predictor variables in advance.
Insufficient data:Some time insufficient data also leads to multicollinearity
Dummy variables:Dummy variables can be misused. For example, the researcher might not exclude one category, or add a dummy variable for each category.
How to identify the Multicollinearity?
few techniques are listed below :
- Variance Inflation Factor(VIF)
- Correlation matrix / heatmap plot
Variance Inflation Factors(VIF) — a measure of the degree of multicollinearity in a set of multiple regression variables, a/c to maths, the VIF for a regression model variable is equal to the ratio of the total variance of the model to the variance of a model that includes only that single independent variable.
now we will see, how we can calculate VIF
Step1: Run a regular least squares regression in which Xi is a function of all other explanatory variables in the first equation. If i = 1, for example, the equation would be
α0 is a constant and e is the error term
Step2: Calculate the VIF factor for β(hat)with the following formula:
where R2i is the coefficient of determination of the regression equation in step one, with xi on the left hand side, and all other predictor variables (all the other X variables) on the right hand side.
Step 3: Conclusion if the values of VIF is in below condition :
- 1 (Non-collinear)
- 1–5 (Medium collinear)
- >5 (High collinear)
in above example we can see that “year_Build” features having high collinearity and “swimming pool” having less collinearity.
Correlation matrix — A correlation matrix is a table showing the correlation coefficients between sets of variables. Each random variable (Xi) in the table is correlated with each of the other values in the table (Xj). allows you to see which pairs have the greatest correlation.
How to fix multicollinearity?
- Dropping highly correlated features(when you have less records in the dataset)
2. PCA(Principal Component Analysis)
3. Ridge and lasso regression
For these 2 topics follow my next post for further updates.
- See that 👏 icon? Send my article some claps
- Connect with me via linkedin, github and on medium👈 and Buy me a coffee if you like this blog.
- source code you can find it here 👈