spearman correlation coefficient（斯皮尔曼相关性系数）

斯皮尔曼相关性系数，通常也叫斯皮尔曼秩相关系数。“秩”，可以理解成就是一种顺序或者排序，那么它就是根据原始数据的排序位置进行求解，这种表征形式就没有了求皮尔森相关性系数时那些限制。下面来看一下它的计算公式：
相关性系数及其python实现
计算过程就是：首先对两个变量（X, Y）的数据进行排序，然后记下排序以后的位置（X’, Y’），（X’, Y’）的值就称为秩次，秩次的差值就是上面公式中的di，n就是变量中数据的个数，最后带入公式就可求解结果

kendall correlation coefficient（肯德尔相关性系数）

肯德尔相关性系数，又称肯德尔秩相关系数，它也是一种秩相关系数，不过它所计算的对象是分类变量。
分类变量可以理解成有类别的变量，可以分为
无序的，比如性别（男、女）、血型（A、B、O、AB）；
有序的，比如肥胖等级（重度肥胖，中度肥胖、轻度肥胖、不肥胖）。
通常需要求相关性系数的都是有序分类变量。

pandas代码实现

pandas.DataFrame.corr()
DataFrame.corr(method='pearson', min_periods=1)[source]
Compute pairwise correlation of columns, excluding NA/null values
Parameters:
method : {‘pearson’, ‘kendall’, ‘spearman’}
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
Returns:

y : DataFrame

import pandas as pd
 
df = pd.DataFrame({'A':[5,91,3],'B':[90,15,66],'C':[93,27,3]})
 
print(df.corr())
 
print(df.corr('spearman'))
 
print(df.corr('kendall'))
 
df2 = pd.DataFrame({'A':[7,93,5],'B':[88,13,64],'C':[93,27,3]})
 
print(df2.corr())
 
print(df2.corr('spearman'))
 
print(df2.corr('kendall'))

numpy代码实现

numpy.corrcoef（x，y = None，rowvar = True，bias = <class'numpy._globals._NoValue'>，ddof = <class'numpy._globals._NoValue'> ）
返回Pearson乘积矩相关系数。
cov有关更多详细信息，请参阅文档。相关系数矩阵R和协方差矩阵C之间的关系为
相关性系数及其python实现

R的值在-1和1之间（含）。
参数：
x：array_like
包含多个变量和观察值的1维或2维数组。x的每一行代表一个变量，每一列都是对所有这些变量的单独观察。另请参阅下面的rowvar。
y：array_like，可选
一组额外的变量和观察。y的形状与x相同。
rowvar：布尔，可选
如果rowvar为True（默认），则每行表示一个变量，并在列中有观察值。否则，该关系将被转置：每列表示一个变量，而行包含观察值。
bias : _NoValue, optional Has no effect, do not use. Deprecated since version 1.10.0.
ddof : _NoValue, optional Has no effect, do not use. Deprecated since version 1.10.0.
返回：

R：ndarray 变量的相关系数矩阵。

import numpy as np
 
vc=[1,2,39,0,8]
 
vb=[1,2,38,0,8]
 
print(np.mean(np.multiply((vc-np.mean(vc)),(vb-np.mean(vb))))/(np.std(vb)*np.std(vc)))
 
#corrcoef得到相关系数矩阵（向量的相似程度）
 
print(np.corrcoef(vc,vb))

Spearman’s Rank Correlation

Spearman’s rank correlation is named for Charles Spearman.

It may also be called Spearman’s correlation coefficient and is denoted by the lowercase greek letter rho (p). As such, it may be referred to as Spearman’s rho.

This statistical method quantifies the degree to which ranked variables are associated by a monotonic function, meaning an increasing or decreasing relationship. As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H0).

The Spearman rank-order correlation is a statistical procedure that is designed to measure the relationship between two variables on an ordinal scale of measurement.

— Page 124, Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, 2009.

The intuition for the Spearman’s rank correlation is that it calculates a Pearson’s correlation (e.g. a parametric measure of correlation) using the rank values instead of the real values. Where the Pearson’s correlation is the calculation of the covariance (or expected difference of observations from the mean) between the two variables normalized by the variance or spread of both variables.

Spearman’s rank correlation can be calculated in Python using the spearmanr() SciPy function.

The function takes two real-valued samples as arguments and returns both the correlation coefficient in the range between -1 and 1 and the p-value for interpreting the significance of the coefficient.

1 2	# calculate spearman's correlation coef, p = spearmanr(data1, data2)

We can demonstrate the Spearman’s rank correlation on the test dataset. We know that there is a strong association between the variables in the dataset and we would expect the Spearman’s test to find this association.

The complete example is listed below.

# calculate the spearman's correlation between two variables

from numpy.random import rand

from numpy.random import seed

from scipy.stats import spearmanr

# seed random number generator

seed(1)

# prepare data

data1 = rand(1000) * 20

data2 = data1 + (rand(1000) * 10)

# calculate spearman's correlation

coef, p = spearmanr(data1, data2)

print('Spearmans correlation coefficient: %.3f' % coef)

# interpret the significance

alpha = 0.05

if p > alpha:

print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)

else:

print('Samples are correlated (reject H0) p=%.3f' % p)

Running the example calculates the Spearman’s correlation coefficient between the two variables in the test dataset.

The statistical test reports a strong positive correlation with a value of 0.9. The p-value is close to zero, which means that the likelihood of observing the data given that the samples are uncorrelated is very unlikely (e.g. 95% confidence) and that we can reject the null hypothesis that the samples are uncorrelated.

1 2	Spearmans correlation coefficient: 0.900 Samples are correlated (reject H0) p=0.000

Kendall’s Rank Correlation

Kendall’s rank correlation is named for Maurice Kendall.

It is also called Kendall’s correlation coefficient, and the coefficient is often referred to by the lowercase Greek letter tau (t). In turn, the test may be called Kendall’s tau.

The intuition for the test is that it calculates a normalized score for the number of matching or concordant rankings between the two samples. As such, the test is also referred to as Kendall’s concordance test.

The Kendall’s rank correlation coefficient can be calculated in Python using the kendalltau() SciPy function. The test takes the two data samples as arguments and returns the correlation coefficient and the p-value. As a statistical hypothesis test, the method assumes (H0) that there is no association between the two samples.

1 2	# calculate kendall's correlation coef, p = kendalltau(data1, data2)

We can demonstrate the calculation on the test dataset, where we do expect a significant positive association to be reported.

The complete example is listed below.

# calculate the kendall's correlation between two variables

from numpy.random import rand

from numpy.random import seed

from scipy.stats import kendalltau

# seed random number generator

seed(1)

# prepare data

data1 = rand(1000) * 20

data2 = data1 + (rand(1000) * 10)

# calculate kendall's correlation

coef, p = kendalltau(data1, data2)

print('Kendall correlation coefficient: %.3f' % coef)

# interpret the significance

alpha = 0.05

if p > alpha:

print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)

else:

print('Samples are correlated (reject H0) p=%.3f' % p)

Running the example calculates the Kendall’s correlation coefficient as 0.7, which is highly correlated.

The p-value is close to zero (and printed as zero), as with the Spearman’s test, meaning that we can confidently reject the null hypothesis that the samples are uncorrelated.

1 2	Kendall correlation coefficient: 0.709 Samples are correlated (reject H0) p=0.000

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

List three examples where calculating a nonparametric correlation coefficient might be useful during a machine learning project.
Update each example to calculate the correlation between uncorrelated data samples drawn from a non-Gaussian distribution.
Load a standard machine learning dataset and calculate the pairwise nonparametric correlation between all variables.

If you explore any of these extensions, I’d love to know.

Summary

In this tutorial, you discovered rank correlation methods for quantifying the association between variables with a non-Gaussian distribution.

Specifically, you learned:

How rank correlation methods work and the methods are that are available.
How to calculate and interpret the Spearman’s rank correlation coefficient in Python.
How to calculate and interpret the Kendall’s rank correlation coefficient in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Spearmans Rank Correlation

20 Dec 2017

Preliminaries

import numpy as np
import pandas as pd
import scipy.stats

Create Data

# Create two lists of random values
x = [1,2,3,4,5,6,7,8,9]
y = [2,1,2,4.5,7,6.5,6,9,9.5]

Calculate Spearman’s Rank Correlation

Spearman’s rank correlation is the Pearson’s correlation coefficient of the ranked version of the variables.

# Create a function that takes in x's and y's
def spearmans_rank_correlation(xs, ys):
    
    # Calculate the rank of x's
    xranks = pd.Series(xs).rank()
    
    # Caclulate the ranking of the y's
    yranks = pd.Series(ys).rank()
    
    # Calculate Pearson's correlation coefficient on the ranked versions of the data
    return scipy.stats.pearsonr(xranks, yranks)
# Run the function
spearmans_rank_correlation(x, y)[0]
0.90377360145618091

Calculate Spearman’s Correlation Using SciPy

# Just to check our results, here it Spearman's using Scipy
scipy.stats.spearmanr(x, y)[0]
0.90377360145618102

上一篇：解决微信授权回调页面域名只能设置一个的问题

下一篇： Android NDK 生成以及调用so 文件

相关性系数及其python实现

spearman correlation coefficient（斯皮尔曼相关性系数）

kendall correlation coefficient（肯德尔相关性系数）

pandas代码实现

numpy代码实现

Spearman’s Rank Correlation

Kendall’s Rank Correlation

Extensions

Further Reading

Books

API

Articles

Summary

Spearmans Rank Correlation

Preliminaries

Create Data

Calculate Spearman’s Rank Correlation

Calculate Spearman’s Correlation Using SciPy

相关性系数及其python实现

python实现字符串连接的三种方法及其效率、适用场景详解

python实现字符串连接的三种方法及其效率、适用场景详解

梅尔频率倒谱系数（mfcc）及Python实现

神经网络基本原理及其在Python中的实现

梅尔频率倒谱系数（mfcc）及Python实现

栈和队列数据结构的基本概念及其相关的Python实现

详解字典树Trie结构及其Python代码实现

python实现H2O中的随机森林算法介绍及其项目实战

python 计算两个列表的相关系数的实现