Calculating Pearson correlation and significance in Python

QBoard » Artificial Intelligence & ML » AI and ML - Conceptual » Calculating Pearson correlation and significance in Python

User Dashboard

Calculating Pearson correlation and significance in Python

Back To Topics

Tags : statistics python numpy scipy

Sai Anirudh

24

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

December 14, 2020 1:22 PM IST

0

Viaan Prakash

461

You can have a look at scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

December 15, 2020 11:34 AM IST

Laksh Nath

126

Rather than rely on numpy/scipy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient (PCC) .

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

The significance of PCC is basically to show you how strongly correlated the two variables/lists are. It is important to note that the PCC value ranges from -1 to 1. A value between 0 to 1 denotes a positive correlation. Value of 0 = highest variation (no correlation whatsoever). A value between -1 to 0 denotes a negative correlation.

December 29, 2020 3:17 PM IST

Tarun Reddy

If you don't feel like installing scipy, I've used this quick hack, slightly modified from Programming Collective Intelligence:

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

December 29, 2020 3:25 PM IST

Member Sign In

Member Sign In

Create Account

Calculating Pearson correlation and significance in Python

Connect With Us