Linear Regression and its Assumptions

Vaishnavi Ambati
4 min readSep 7, 2021

--

What is Linear Regression?

Linear Regression is an algorithm used to identify the linear or non-linear relationship between independent and dependent (target) variables. A Simple Linear Regression is a way of predicting target variable based on predictor variables by using the best line. It is often referred to as Ordinary Least Squares or Linear Least Squares. Typically, you need regression to answer whether and how some phenomenon influences the other or how several variables are related. For example, we have to examine the relationship between age and the price of used cars.

Explanation

Imagine if we have an independent variable or feature called weight and we need to predict height. Typically, we know that as weight increases height also increases with some outliers. So, in Linear Regression we try to find a line that fits the data as best as possible. In Simple Linear Regression, we can generalize the mathematical equation as

The equation for the example

which is called the Regression Equation. W1 and W0 are called Regression Coefficients.

In 2D space this equation is similar to a linear line as shown below:

Line equation

The objective of Linear Regression is to find a line /plane that best fits the data points. Now let's understand what best fit means. The best-fitted line must minimize the sum of errors across the training data. To obtain this line, we must take the sum of the difference between the true value of y and the predicted value of y, square it, and take the minimum. We take the minimum because we are trying to minimize the distance between the true and predicted value.

Mathematical Explanation

To find the best fit line, we have reduced the error which is called the Sum of Squared Residuals(SSR). Let us consider 2 points P1(x1, y2) and P2 (x2, y2), so error/residual can be calculated as

Error calculation example

In this way for every point error is calculated and the best line is chosen where the cumulative error is low. As the errors are both positive and negative, they can cancel out each other. Hence, we have the square them.

Error equations

Simple and Multiple Linear Regression

A simple way of predicting a target variable based on a single independent variable. The pattern can be either linear or non-linear.

Simple Linear Regression Equation

But in Multiple Linear Regression, there can be more than one independent variable. Even in Multiple Linear Regression, the relationship can be linear and non-linear. A model with two X variables and one Y variable can be understood with a 2D surface. The shape of the surface depends on the structure of the model.

Multiple Linear Regression

Assumptions of Linear Regression

In simple Linear Regression, the basic assumptions are

  • Additive: — effect of X variables on response is independent of other variables.
  • Linear: — assuming that change in the dependent (Y) variable is due to l unit change in independent (X) variable is constant.
Linear assumption explanation

Problems with Linear Regression

Following are the problems with Linear Regression:-

  • Non-linearity of data -

This issue is also called Heteroscedasticity. This means that the variance of the dependent variable varies across data. This issue complicates analysis because Regression is based on the assumption of equal variance.

  • Correlation of error teams -

We assume in the linear model that the error terms are uncorrelated. But in real-world scenarios, this might not be the case. This correlation will lead to the narrowing of confidence intervals on effect errors of estimates. This usually happens in time-series data.

  • Outliers -

Outliers are the predicted points that are far from the best fit line. These points have high error (vertical distance from the line to the point). These outliers must be examined and handled carefully.

  • High Leverage points -

A data point has high leverage if it has extreme predictor values. If age is an independent variable in the range of [18,55] for a data point the age in 200, then it is a high leverage point.

  • Multi Collinearity -

When two or more predictors are related then we can say that there exists multicollinearity. Due to this effect standard error increases.

Code

Conclusion

Linear Regression is one of the first algorithms that every Machine Learning beginner learns. It is also an algorithm that every Machine Learning enthusiast should know about and the math behind it.

References:

https://towardsdatascience.com/linear-regression-in-real-life-4a78d7159f16

https://ai-with-santosh.blogspot.com/search/label/Machine%20Learning

https://realpython.com/linear-regression-in-python/

--

--

No responses yet