REGRESSION

Kübra Nur Akdoğan
6 min readApr 3, 2022

Regression analysis is a set of statistical processes for estimating the relationships between dependent variable and one or more independent variables. On the basis of regression analysis; observed when evaluating an event, which events are affected by is to be investigated. These events are either can be multiple and directly or indirectly affected.Regression according to the number of variables in the model analysis, simple linear regression and multiple linear regression is divided into two.

Note!

Independent Variable : Usually denoted by x. It is the (explanatory) variable that is not affected by another variable, but is the cause of y or is thought to affect it.

Dependent Variable : Usually denoted by y. It is the variable that can change or be affected (explained) depending on the variable x.

Assumptions of Regression Analysis

The regression model is based on the following assumptions.

  • The relationship between independent variable and dependent is linear.
  • The expected value of the error term is zero.
  • The variance of the error term is constant for all the values of the independent variable, the assumption of homoscedasticity.
  • There is no autocorrelation.
  • The independent variable is uncorrelated with the error term.
  • The error term is normally distributed.
  • On an average difference between the observed value (yi) and the predicted value (ˆyi) is zero.
  • On an average the estimated values of errors and values of independent variables are not related to each other.
  • The squared differences between the observed value and the predicted value are similar.
  • There is some variation in independent variable. If there are more than one variable in the equation, then two variables should not be perfectly correlated.
  • If multiple regression analysis is performed and estimation is requested for three or more parameters, the independent variables should not be related to each other. This is called the assumption of no multicollinearity.
  • Errors show a normal distribution. If the assumption of normality of these errors is not appropriate, the generalized linear model can be applied.

Homoscedastic : Homoscedastic data have the same standard deviation in different groups where data are.

Heteroscedastic : Heteroscedastic have different standard deviations in different groups and assumes that the relationship between the two variables is linear.

Autocorrelation : Autocorrelation refers to the degree of correlation between the values of the same variables across different observations in the data.

REGRESSION MODEL

The model is what we express with the help of a function and called the regression model. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.

A regression model relates Y to a function of X and β. Most regression models propose that Yi is a function of Xi and β with 𝜀i representing an additive error term that may stand in for unmodeled determinants of Yi or random statistical noise: Yi = f(Xi,β) +𝜀i

Steps To Conduct A Regression Analysis

Coefficient of Determination

The coefficient of determination, denoted or is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

Calculating the coefficient of determination (R²)

𝑹 ² = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅 𝑽𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏 / 𝑻𝒐𝒕𝒂𝒍 𝑽𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏

= 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝑺𝒖𝒎 𝒐𝒇 𝑺𝒒𝒖𝒂𝒓𝒆 (𝑺𝑺𝑹) / 𝑻𝒐𝒕𝒂𝒍 𝑺𝒖𝒎 𝒐𝒇 𝑺𝒒𝒖𝒂𝒓𝒆 (𝑺𝑺𝑻)

LINEAR REGRESSION

Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

Simple Linear Regression And Multiple Linear Regression

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable y, based on the value of an independent variable x. Mathematically, the regression model is represented by the following equation: 𝐲 = 𝛽0 +𝛽1 𝒙1 +𝜀1

Where:

  • x independent variable
  • y dependent variable
  • 𝛽1 The Slope of the regression line
  • 𝜷𝟎 The intercept point of the regression line and the y axis

𝜷𝟏 = 𝒏 ∑ 𝒙𝐲 — ∑ 𝒙 ∑ 𝐲 / 𝒏 ∑ 𝒙²− (∑ 𝒙)²

An example of simple linear regression, which has one independent variable

Example — Linear Regression of patient’s age and their blood pressure

A study is conducted involving 10 patients to investigate the relationship and effects of patient’s age and their blood pressure.

Table: calculating the linear regression of patient’s age and blood pressure

Calculating the mean (𝒙̅ ,Ӯ):

𝒙̅=∑ 𝒙/ n = 491/10= 49.1 , Ӯ= ∑ 𝐲/n = 1410/10= 141

Calculating the regression coefficient;

𝛽1 = 𝑛 ∑ 𝑥y — ∑ 𝑥 ∑ y / 𝑛 ∑ 𝑥²− (∑ 𝑥)²

𝛽1 = 10 ∗ 71566 − 491 ∗ 1410 /10 ∗ 26157 − (491)²

𝛽1 = 23350 / 20489 = 1.140

𝛽0 =Ӯ −𝛽1𝑥̅

𝛽0 = 141–1.140 ∗ 49.1 so 𝛽0 = 85.026

Then substitute the regression coefficient into the regression model

𝑬𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅 𝒃𝒍𝒐𝒐𝒅 𝒑𝒓𝒆𝒔𝒔𝒖𝒓𝒆 (Ŷ) = 85.026 + 1.140 𝑎𝑔e

Interpretation of the equation;

Constant (intercept) value 𝛽0 = 85.026 indicates that blood pressure at age zero.

Regression coefficient 𝛽1 = 1.140 indicates that as age increase by one year the blood pressure increase by 1.140

Applying the value of age to the regression Model to calculate the estimated blood pressure (Ŷ) coefficient of determination (R²) as follows:

Equation of ANOVA table for simple linear regression

Calculating the ANOVA table values for simple linear regression;

Calculating the coefficient of determination (R²)

𝑹²= 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅 𝑽𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏 𝑻𝒐𝒕𝒂𝒍 𝑽𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏/𝑻𝒐𝒕𝒂𝒍 𝑽𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏=SSR/SST

Then substitute the values from ANOVA table

𝑹 𝟐 = 𝟐𝟔𝟔𝟐.𝟕𝟓 / 𝟑𝟐𝟖𝟒 = 𝟎. 𝟖𝟏𝟎

We can say that 81% of the variation in the blood pressure rate is explained by age.

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a dependent variable (target or criterion variable) based on the value of two or more independent variables (predictor or explanatory variables). Multiple regression allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time and lecture attendance “as a whole”, but also the “relative contribution” of each independent variable in explaining the variance. Mathematically, the multiple regression model is represented by the following equation: 𝒀 = 𝜷𝟎 +𝜷𝒊 𝑿𝒊 … … … … +𝜷𝒏 𝑿𝒏 +𝜀𝒊

Where:

  • 𝑿𝒊 to 𝑿𝒏 Represent independent variables.
  • 𝒀 Dependent variable.
  • 𝛽1 The regression coefficient of variable 𝒙1
  • 𝛽2 The regression coefficient of variable 𝒙2
  • 𝜷𝟎 The intercept point of the regression line and the y axis

In both cases,𝜀𝒊 is an error term and the subscript 𝒊 indexes a particular observation.

By using method of deviation;

  • 𝜷𝟏 = (∑ 𝒙𝟏𝒚)(∑ 𝒙𝟐²)−(∑ 𝒙𝟐 𝒚)(∑ 𝒙𝟏𝒙𝟐 ) /(∑ 𝒙𝟏²)(∑ 𝒙𝟐²) − (∑ 𝒙𝟏𝒙𝟐)²
  • 𝜷𝟐 = (∑ 𝒙𝟐𝒚)(∑ 𝒙𝟏²) − (∑ 𝒙𝟏 𝒚)(∑ 𝒙𝟏𝒙𝟐 )/ (∑ 𝒙𝟏²)(∑ 𝒙𝟐²) − (∑ 𝒙𝟏𝒙𝟐)²
  • 𝜷𝟎 =Ӯ− 𝜷𝟏𝒙̅𝟏 − 𝜷𝟐𝒙̅

REFERENCES

  • Foster, Dean P., & George, Edward I. (1994). The Risk Inflation Criterion for Multiple Regression. Annals of Statistics, 22(4). 1947–1975. doi:10.1214/aos/1176325766
  • Harrell, F. E. (2001) “Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis,” Springer-Verlag, New York.

--

--