Data Science Terminology (Part I)

4 min readDec 30, 2020

Autoregressive is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. This is a very simple idea that can lead to accurate predictions for a variety of time series problems.

Covariance is a measure of how much two random variables together change. This is similar to variance, but where variance shows how one variable changes, covariance tells you how two variables change together.

Formula for covariance:

Python implemenation of covariance:

Correlation is a statistical measure that expresses the degree to which two variables are linearly related.correlation values will range from +1 to -1, 1 (+ ve) represents a strong positive correlation, 1 (-ve) represents a perfect negative correlation, and 0 means that there is no relationship between the variables.

Formula for correlation:

Python implemenation of correlation:

Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a previous version of itself at successive time intervals. For example, r1 is the autocorrelation between yt and yt-1; similarly, r2 is the autocorrelation between yt and yt-2.

An autocorrelation values will ranges from +1 to -1, 1(+ve) represents a perfect positive autocorrelation, 1(-ve) represents a perfect negative autocorrelation.

Formula for autocorrelation:

Yi=ith data points,Y-hat=mean of y, Yi+k=ith data point at kth timing

Python implemenation of autocorrelation:

Mostly used application of autocorrelation:
Pattern recognition
Signal detection
Signal processing
Technical analysis of Stocks

Bias: According to Wikipedia, “the bias is an error due to wrong assumptions in the learning algorithm. high bias can cause the algorithm to lose the proper relationship between the features and the target results, that cause underfitting problem in your model.

The bias is an assumption made by the model to facilitate learning the target function. High bias models are less flexible and cannot fully learn from the training data set.

“if any model has a high level of bias, which means that the model has low trainability, models with a low bias are preferred”

Variance:According to wikipedia, the variance is the sensitivity error to small fluctuations in the training sample. Large variance can cause overfitting of the model. The variance determines how much the predictions of the model will change from one data set to another.

“In general we prefer a model with low variance”

Bagging: is an ensemble technique used when our goal is to reduce the variance in a decision tree model.concept behind bagging is to create multiple subsets of data from a training sample that is randomly selected and replaced, each dataset in the subset is used to prepare its decision trees, so we get a different set of models. The average of all the assumptions in the trees is used to assign a new data point.

Boosting is also an ensemble technique for creating a set of predictors, if a given input dataset is misclassified,then its weight increases so that a future hypothesis is more likely to classify it correctly, consolidating the whole set, and ultimately we can say making weak model to more effective model.

Gradient Boosting is the best algorithm for the boosting example, it uses a gradient descent algorithm that can optimize any differentiable loss function. A set of trees is built individually and the individual trees are sequentially summarized.

Multicollinearity occurs when two or more independent variables are highly correlated with each other in a regression model. This means that the independent variable can be predicted from another independent variable in the regression model. For more please click here👈

See that 👏 icon? Send my article some claps
Connect with me via linkedin, github and on medium👈 and Buy me a coffee if you like this blog.
Source code you can find it here 👈

Data Science Terminology (Part I)

Written by Gautam Kumar

Responses (1)