Understand the relationship between two variable using Correlation
It is important to discover the degree to which variables in your data-set are dependent upon each other. This will help us to prepare the good data-set to meet our machine learning algorithms expectation. Because if the data-set is not good then the performance will degrade.
In this tutorial, we are going to discover the correlation statistical information of the relationship of two variable.
After completing this tutorial you will know,
- How to calculate a covariance matrix to summarise the linear relationship between two or more variables
- How to calculate the Pearson’s correlation coefficient to summarise the linear relationship between two variables.
- How to calculate the Spearman’s correlation coefficient to summarise the monotonic relationship between two variables.
What is correlation ?
Variable in the data-set can be related each other in many reasons,For example
- One variable could cause or depend on the values of another variable
- One variable could be lightly associated with another variable
- Two variables could depend on a third unknown variable
it can be useful in data analysis and modelling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.
A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.
- Positive Correlation - Both variables change in the same direction.
- Neutral Correlation - No relationship in the change of the variables.
- Negative Correlation - Variables change in opposite directions.
Dataset:
We will generate 1,000 samples of two two variables with a strong positive correlation. The first variable will be random numbers drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 20. The second variable will be values from the first variable with Gaussian noise added with a mean of a 50 and a standard deviation of 10.
The python code for generate test data set.
The python code for generate test data set.
1 2 3 4 5 6 7 8 9 10 11 12 | import numpy as np from matplotlib import pyplot # prepare data-set data1 = 20 * np.random.randn(1000) + 100 data2 = data1 + (10 * np.random.randn(1000) + 50) # summarise print('data1: mean=%.3f stdv=%.3f' % (np.mean(data1), np.std(data1))) print('data2: mean=%.3f stdv=%.3f' % (np.mean(data2), np.std(data2))) # plot pyplot.scatter(data1, data2) pyplot.show() |
A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables
Pearson’s Correlation
The Pearson correlation coefficient (named for Karl Pearson) can be used to summarise the strength of the linear relationship between two data samples.
The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalisation of the covariance between the two variables to give an interpretable score.
The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution.
The result of the calculation, the correlation coefficient can be interpreted to understand the relationship.
The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation.
We can calculate the correlation between the two variables in our test problem. The pandas data frame providing function to calculate the correlation between two variable, The sample code given below,
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd import numpy as np #Preapring data set data1 = 20 * np.random.randn(1000) + 100 data2 = data1 + (10 * np.random.randn(1000) + 50) # Creatind data frame data_set = pd.DataFrame(columns=['data1', 'data2']) data_set['data1'] = data1 data_set['data2'] = data2 #calcualting Pearson correlation print(' Pearson correlation =',data_set['data1'].corr(data_set['data2'], method='pearson')) |
The result is given below,
We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.
The Pearson’s correlation coefficient can be used to evaluate the relationship between more than two variables
This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the main diagonal as each column always perfectly correlates with itself.
1 | Pearson correlation = 0.897870771565 |
We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.
The Pearson’s correlation coefficient can be used to evaluate the relationship between more than two variables
This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the main diagonal as each column always perfectly correlates with itself.
Spearman’s Correlation
The Spearman’s correlation coefficient (named for Charles Spearman) can be used to summarise the strength between the two data samples. This test of relationship can also be used if there is a linear relationship between the variables, but will have slightly less power (e.g. may result in lower coefficient scores).
As with the Pearson correlation coefficient, the scores are between -1 and 1 for perfectly negatively correlated variables and perfectly positively correlated respectively.
Instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated from the relative rank of values on each sample. This is a common approach used in non-parametric statistics, e.g. statistical methods where we do not assume a distribution of the data such as Gaussian.
A linear relationship between the variables is not assumed, although a monotonic relationship is assumed. This is a mathematical name for an increasing or decreasing relationship between the two variables.
If you are unsure of the distribution and possible relationships between two variables, Spearman correlation coefficient is a good tool to use. The sample code given below,
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd import numpy as np #Preapring data set data1 = 20 * np.random.randn(1000) + 100 data2 = data1 + (10 * np.random.randn(1000) + 50) # Creatind data frame data_set = pd.DataFrame(columns=['data1', 'data2']) data_set['data1'] = data1 data_set['data2'] = data2 #calcualting Pearson correlation print('Spearman correlation =',data_set['data1'].corr(data_set['data2'], method='spearman')) |
The result for the above code is given below,
We know that the data is Gaussian and that the relationship between the variables is linear. Nevertheless, the non-parametric rank-based approach shows a strong correlation between the variables of 0.8.
Thanks for reading this post.
1 | Spearman correlation = 0.873144093144 |
We know that the data is Gaussian and that the relationship between the variables is linear. Nevertheless, the non-parametric rank-based approach shows a strong correlation between the variables of 0.8.
Thanks for reading this post.
Comments
Post a Comment