CORRELATION

Kübra Nur Akdoğan
6 min readFeb 21, 2022

The relationships between the variables can be analyzed with different statistically methods. Correlation and regression analysis are the most commonly used methods in the statistical analysis of relations between variables. In this section, the correlation analysis will be examined. Regression analysis is examined in my other article.

In statistics, correlation is any statistical relationship, whether causal or not, between two random variables or bivariate data. Correlation indicates the direction and strength of the linear relationship between two random variables in probability theory and statistics. In general statistical use, correlation indicates how far away from independence has been achieved. Correlation basically has two purposes. These are :

  1. To help explain the relationship between variables.
  2. Predicting similar results before.

TYPES OF CORRELATION

  1. Positive and Negative Correlation: Whether the correlation between the variables is positive or negative depends on its direction of change. The correlation is positive when both the variables move in the same direction, i.e. when one variable increases the other on an average also increases and if one variable decreases the other also decreases. The correlation is said to be negative when both the variables move in the opposite direction, i.e. when one variable increases the other decreases and vice versa.
  2. Simple, Partial and Multiple Correlation: Whether the correlation is simple, partial or multiple depends on the number of variables studied. The correlation is said to be simple when only two variables are studied. The correlation is either multiple or partial when three or more variables are studied. The correlation is said to be multiple when three variables are studied simultaneously. Such as, if we want to study the relationship between the yield of wheat per acre and the amount of fertilizers and rainfall used, then it is a problem of multiple correlations.Whereas, in the case of a partial correlation we study more than two variables, but consider only two among them that would be influencing each other such that the effect of the other influencing variable is kept constant.
  3. Linear and Non-Linear (Curvilinear) Correlation: Whether the correlation between the variables is linear or non-linear depends on the constancy of ratio of change between the variables. The correlation is said to be linear when the amount of change in one variable to the amount of change in another variable tends to bear a constant ratio. The correlation is called as non-linear or curvilinear when the amount of change in one variable does not bear a constant ratio to the amount of change in the other variable.

CORRELATION COEFFICIENT

A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. As a result of the correlation analysis, whether there is a linear relationship and if there is, the degree of this relationship is calculated with the correlation coefficient. The correlation coefficient is denoted by “r” . Values of “r” range from -1 to +1. A correlation coefficient of 0 means that there is no relationship. A value of -1 is a perfect negative coefficient and a correlation value of +1 indicates a perfect positive correlation.

METHODS OF MEASUREMENT OF CORRELATION

Quantification of the relationship between variables is very essential to take the benefit of study of correlation. For this, we find there are various methods of measurement of correlation. A few of them: Pearson Product-Moment Coefficient, Rank Correlations,Point Biserial Correlation Coefficient,Measures of Association: C, V, Lambda. In this article, I will explain the most important coefficient : Pearson Product-Moment Coefficient.

Pearson Product-Moment Coefficient

The Pearson correlation coefficient also known as Pearson’s r, the Pearson product-moment correlation coefficient , the bivariate correlation ,or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It states the relationship between variables using the interval and ratio scales. It is obtained by taking the ratio of the covariance of the two variables in question of our numerical dataset, normalized to the square root of their variances.

Formulation (for sample)

Note that:

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). The figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

Symmetry property

The correlation coefficient is symmetric: corr(X,Y)=corr(Y,X).This is verified by the commutative property of multiplication.

• Mathematical Property

A key mathematical property of the Pearson correlation coefficient is that it is invariant under separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY, where a, b, c, and d are constants with b, d > 0, without changing the correlation coefficient. (This holds for both the population and sample Pearson correlation coefficients.) Note that more general linear transformations do change the correlation.

Assumptions

Of primary importance are linearity and normality. Pearson’s r requires that interval data be used to determine a linear relationship. Further assumptions must be met in order to establish statistical significance, “…for the test statistic to be valid the sample distribution has to be normally distributed”. Homoscedasticity assumes that the error at each level of the independent variable is constant. A violation of the assumption of homoscedasticity increases the chances of obtaining a statistically significant result even though H0 is true.

Coefficient of Determination

Another value of use in correlation analysis is the coefficient of determination which is represented as r² . Because it is a square, it is always a positive number and varies between 0 and 1. By squaring the correlation coefficient r, the total variability in Y can be accounted for after regressing Y on X; r² can be considered to be a measure of the strength of the linear relationship. The resulting value when multiplied by 100 results in a percent variance, e.g., if the correlation coefficient for X and Y is r = .50, then r²= (.50)(.50) = .25 = .25(100) = 25%. X explains 25% of the variability in Y.

SOME COMMON MISCONSEPTIONS ABOUT CORRELATION

Correlation and Causality

A common phrase about correlation in statistics and that every statistical user should know is this: Correlation does not imply causation. Often, many people seem to believe that when a relationship is established between two variables, one is the cause and the other is the effect, and that one causes the other. Indeed, causality and correlation are interconnected concepts: correlation is necessary to prove causality, but it is not sufficient to demonstrate causality.

Correlation and Linearity

Normally distributed and uncorrelated does not imply independent. The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. This result becomes even more important if the data do not show a normal distribution. In particular, if the conditional mean of Y given X, denoted E(Y|X), is not linear in X , the correlation coefficient will not fully determine the form of E(Y|X).

REFERENCES

Naked Statistics- Charles Wheelan

Introduction to Linear Regression and Correlation Analysis- Dr. Peerapat Wongchaiwat

Regression and Correlation- Kean University

--

--