相関係数とは何ですか?説明された統計のr値

相関関係は、あるものが別のものとどのように変化するかを学ぶための優れたツールです。これを読んだ後、相関とは何か、自分の仕事で相関について考える方法を理解し、相関を計算するための最小限の実装をコーディングする必要があります。

相関関係とは、2つのものが互いにどのように変化するかについてです

相関関係は抽象的な数学の概念ですが、おそらくそれが何を意味するのかについてはすでに理解しているでしょう。これは、相関の3つの一般的なカテゴリの例です。

あなたがより多くの食物を食べるにつれて、あなたはおそらくより満腹に感じることになるでしょう。これは、2つのものが同じように一緒に変化している場合です。1つは上がる(より多くの食物を食べる)、そしてもう1つも上がる(満腹感)。これは正の相関関係です。

車に乗って速く行くと、目的地に早く到着し、総移動時間が短くなります。これは、2つのことが反対方向に変化する場合です(速度は速くなりますが、時間は短くなります)。これは負の相関関係です。

2つのことが「変化」する可能性のある3番目の方法もあります。というか、変わらない。たとえば、体重を増やしてテストスコアがどのように変化したかを調べた場合、テストスコアに一般的な変化のパターンはおそらくないでしょう。これは、相関関係がないことを意味します。

2つのことが一緒にどのように変化するかを知ることは、予測への最初のステップです

前の例で何が起こっているのかを説明できることは素晴らしいことです。しかし、ポイントは何ですか?その理由は、この知識を意味のある方法で適用して、次に何が起こるかを予測するのに役立てるためです。

私たちの食事の例では、1週間の食事量を記録し、その後の満腹感を記録する場合があります。以前に見つけたように、食べるほど、満腹感が増します。

このすべての情報を収集した後、なぜこれがこの関係をよりよく理解するために起こるのかについて、さらに質問することができます。ここで、どのような食べ物が私たちをより満腹にするのか、または時間帯が私たちの満腹感にも影響を与えるのかどうかを尋ね始めるかもしれません。

同様の考え方は、あなたの仕事やビジネスにも当てはまります。売上やその他の重要な指標がビジネスの他の指標と上下していることに気付いた場合(つまり、物事は正の相関または負の相関がある)、ビジネスを改善するためにその関係について詳しく調べて学ぶ価値があるかもしれません。

相関関係にはさまざまなレベルの強度があります

いくつかの一般的な相関関係については、次のいずれかとして説明しました。

  • ポジティブ、
  • ネガティブ、または
  • 存在しない

これらの説明は問題ありませんが、すべての正と負の相関関係がすべて同じというわけではありません。

これらの説明は、数字に変換することもできます。相関値は、負の値\(-1 \)と正の値\(+ 1 \)の間の任意の10進値を取ることができます。

\(-1 \)と\(0 \)の間の10進値は、\(-0.32 \)のように負の相関関係にあります。

\(0 \)と\(+ 1 \)の間の小数値は、\(+ 0.63 \)のように正の相関関係にあります。

完全なゼロ相関は、相関がないことを意味します。

相関のタイプごとに、強い相関と弱い相関の範囲があります。ゼロに近い相関値は弱い相関であり正または負の値に近いは強い相関です。

強い相関はデータのより明白な傾向を示しますが、弱い相関はより乱雑に見えます。たとえば、以下の強い高い正の相関は、弱い低い正の相関と比較して、線のように見えます。

xとyの間の低、高、および完全な正の相関の例

同様に、強い負の相関は、弱くて低い負の相関よりも明白な傾向があります。

xとyの間の低、高、および完全な負の相関の例

どこんRの値はから来ますか?そして、それはどのような価値観をとることができますか?

r値」は、相関値を示す一般的な方法です。より具体的には、(サンプル)ピアソン相関、またはピアソンのrを指します。「サンプル」ノートは、所有しているデータの相関関係のみを主張できることを強調するためのものであり、データを超えてより大きな主張を行う場合は注意が必要です。

次の表は、これまでの相関関係について説明した内容をまとめたものです。

ピアソンのr値2つの間の相関関係は...
r = -1完全にネガティブ1日の時間と1日の残り時間数
r <0より速い車速とより短い移動時間
r = 0独立または無相関体重増加とテストスコア
r> 0ポジティブより多くの食べ物を食べ、より満腹感を感じる
r = 1完全にポジティブ私の年齢の増加とあなたの年齢の増加

In the next few sections, we will

  • Break down the math equation to calculate correlations
  • Use example numbers to use this correlation equation
  • Code up the math equation in Python and JavaScript

Breaking down the math to calculate correlations

As a reminder, correlations can only be between \(-1\) and \(1\). Why is that?

The quick answer is that we adjust the amount of change in both variables to a common scale. In more technical terms, we normalize how much the two variables change together by how much each of the two variables change by themselves.

From Wikipedia, we can grab the math definition of the Pearson correlation coefficient. It looks very complicated, but let's break it down together.

\[ \textcolor{lime}{r} _{ \textcolor{#4466ff}{x} \textcolor{fuchsia}{y} } = \frac{ \sum_{i=1}^{n} (x_i - \textcolor{green}{\bar{x}})(y_i - \textcolor{olive}{\bar{y}}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \textcolor{green}{\bar{x}})^2 \sum_{i=1}^{n} (y_i - \textcolor{olive}{\bar{y}})^2 } }\]

From this equation, to find the \(\textcolor{lime}{\text{correlation}}\) between an \( \textcolor{#4466ff}{\text{x variable}} \) and a \( \textcolor{fuchsia}{\text{y variable}} \), we first need to calculate the \( \textcolor{green}{\text{average value for all the } x \text{ values}} \) and the \( \textcolor{olive}{ \text{average value for all the } y \text{ values}} \).

Let's focus on the top of the equation, also known as the numerator. For each of the \( x\) and \(y\) variables, we'll then need to find the distance of the \(x\) values from the average of \(x\), and do the same subtraction with \(y\).

Intuitively, comparing all these values to the average gives us a target point to see how much change there is in one of the variables.

This is seen in the math form, \(\textcolor{#800080}{\sum_{i=1}^{n}}(\textcolor{#000080}{x_i - \overline{x}})\), \(\textcolor{#800080}{\text{adds up all}}\) the \(\textcolor{#000080}{\text{differences between}}\) your values with the average value for your \(x\) variable.

In the bottom of the equation, also known as the denominator, we do a similar calculation. However, before we add up all of the distances from our values and their averages, we will multiple them by themselves (that's what the \((\ldots)^2\) is doing).

This denominator is what "adjusts" the correlation so that the values are between \(-1\) and \(1\).

Using numbers in our equation to make it real

To demonstrate the math, let's find the correlation between the ages of you and your siblings last year \([1, 2, 6]\) and your ages for this year \([2, 3, 7]\). Note that this is a small example. Typically you would want many more than three samples to have more confidence in your correlation being true.

Looking at the numbers, they appear to increase the same. You may also notice they are the same sequence of numbers but the second set of numbers has one added to it. This is as close to a perfect correlation as we'll get. In other words, we should get an \(r = 1\).

First we need to calculate the averages of each. The average of \([1, 2, 6]\) is \((1+2+6)/3 = 3\) and the average of \([2, 3, 7]\) is \((2+3+7)/3 = 4\). Filling in our equation, we get

\[ r _{ x y } = \frac{ \sum_{i=1}^{n} (x_i - 3)(y_i - 4) }{ \sqrt{ \sum_{i=1}^{n} (x_i - 3)^2 \sum_{i=1}^{n} (y_i - 4)^2 } }\]

Looking at the top of the equation, we need to find the paired differences of \(x\) and \(y\). Remember, the \(\sum\) is the symbol for adding. The top then just becomes

\[ (1-3)(2-4) + (2-3)(3-4) + (6-3)(7-4) \]

\[= (-2)(-2) + (-1)(-1) + (3)(3) \]

\[= 4 + 1 + 9 = 14\]

So the top becomes 14.

\[ r _{ x y } = \frac{ 14 }{ \sqrt{ \sum_{i=1}^{n} (x_i - 3)^2 \sum_{i=1}^{n} (y_i - 4)^2 } }\]

In the bottom of the equation, we need to do some very similar calculations, except focusing on just the \(x\) and \(x\) separately before multiplying.

Let's focus on just \( \sum_{i=1}^n (x_i - 3)^2 \) first. Remember, \(3\) here is the average of all the \(x\) values. This number will change depending on your particular data.

\[ (1-3)^2 + (2-3)^2 + (6-3)^2 \]

\[= (-2)^2 + (-1)^2 + (3)^2 = 4 + 1 + 9 = 14 \]

And now for the \(y\) values.

\[ (2-4)^2 + (3-4)^2 + (7-4)^2 \]

\[ (-2)^2 + (-1)^2 + (3)^2 = 4 + 1 + 9 = 14\]

We those numbers filled out, we can put them back in our equation and solve for our correlation.

\[ r _{ x y } = \frac{ 14 }{ \sqrt{ 14 \times 14 }} = \frac{14}{\sqrt{ 14^2}} = \frac{14}{14} = 1\]

We've successfully confirmed that we get \(r = 1\).

Although this was a simple example, it is always best to use simple examples for demonstration purposes. It shows our equation does indeed work, which will be important when coding it up in the next section.

Python and JavaScript code for the Pearson correlation coefficient

Math can sometimes be too abstract, so let's code this up for you to experiment with. As a reminder, here is the equation we are going to code up.

\[ r _{ x y } = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2 } }\]

After going through the math above and reading the code below, it should be a bit clearer on how everything works together.

Below is the Python version of the Pearson correlation.

import math def pearson(x, y): """ Calculate Pearson correlation coefficent of arrays of equal length. Numerator is sum of the multiplication of (x - x_avg) and (y - y_avg). Denominator is the squart root of the product between the sum of (x - x_avg)^2 and the sum of (y - y_avg)^2. """ n = len(x) idx = range(n) # Averages avg_x = sum(x) / n avg_y = sum(y) / n numerator = sum([(x[i] - avg_x)*(y[i] - avg_y) for i in idx]) denom_x = sum([(x[i] - avg_x)**2 for i in idx]) denom_y = sum([(y[i] - avg_y)**2 for i in idx]) denominator = math.sqrt(denom_x * denom_y) return numerator / denominator

Here's an example of our Python code at work, and we can double check our work using a Pearson correlation function from the SciPy package.

import numpy as np import scipy.stats # Create fake data x = np.arange(5, 15) # array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) y = np.array([24, 0, 58, 26, 82, 89, 90, 90, 36, 56]) # Use a package to calculate Pearson's r # Note: the p variable below is the p-value for the Pearson's r. This tests # how far away our correlation is from zero and has a trend. r, p = scipy.stats.pearsonr(x, y) r # 0.506862548805646 # Use our own function pearson(x, y) # 0.506862548805646

Below is the JavaScript version of the Pearson correlation.

function pearson(x, y) { let n = x.length; let idx = Array.from({length: n}, (x, i) => i); // Averages let avgX = x.reduce((a,b) => a + b) / n; let avgY = y.reduce((a,b) => a + b) / n; let numMult = idx.map(i => (x[i] - avg_x)*(y[i] - avg_y)); let numerator = numMult.reduce((a, b) => a + b); let denomX = idx.map(i => Math.pow((x[i] - avgX), 2)).reduce((a, b) => a + b); let denomY = idx.map(i => Math.pow((y[i] - avgY), 2)).reduce((a, b) => a + b); let denominator = Math.sqrt(denomX * denomY); return numerator / denominator; };

Here's an example of our JavaScript code at work to double check our work.

x = Array.from({length: 10}, (x, i) => i + 5) // Array(10) [ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ] y = [24, 0, 58, 26, 82, 89, 90, 90, 36, 56] pearson(x, y) // 0.506862548805646

Feel free to translate the formula into either Python or JavaScript to better understand how it works.

In conclusion

Correlations are a helpful and accessible tool to better understand the relationship between any two numerical measures. It can be thought of as a start for predictive problems or just better understanding your business.

Correlation values, most commonly used as Pearson's r, range from \(-1\) to \(+1\) and can be categorized into negative correlation (\(-1 \lt r \lt 0\)), positive (\(0 \lt r \lt 1\)), and no correlation (\(r = 0\)).

A glimpse into the larger world of correlations

There is more than one way to calculate a correlation. Here we have touched on the case where both variables change at the same way. There are other cases where one variable may change at a different rate, but still have a clear relationship. This gives rise to what's called, non-linear relationships.

Note, correlation does not imply causation. If you need quick examples of why, look no further.

Below is a list of other articles I came across that helped me better understand the correlation coefficient.

  • If you want to explore a great interactive visualization on correlation, take a look at this simple and fantastic site.
  • Using Python, there multiple ways to implement a correlation and there are multiple types of correlation. This excellent tutorial shows great examples of Python code to experiment with yourself.
  • A blog post by Sabatian Sauer goes over correlations using "average deviation rectangles", where each point creates a visual rectangle from each point using the mean, and illustrating it using the R programming language.
  • And for the deeply curious people out there, take a look at this paper showing 13 ways to look at the correlation coefficient (PDF).

Follow me on Twitter and check out my personal blog where I share some other insights and helpful resources for programming, statistics, and machine learning.

Thanks for reading!