Math & Linear Algebra/Concepts

Pearson & Spearman correlation coefficients

metamong 2022. 5. 13.

๐Ÿง“๐Ÿป ๋ฐ์ดํ„ฐ๋ถ„์„์— ์žˆ์–ด์„œ ๊ผญ ์•Œ๊ณ  ๋„˜์–ด๊ฐ€์•ผ ํ•  ๊ฐœ๋…์ธ ๋‘ coefficients ์ข…๋ฅ˜ Pearson๊ณผ Spearman์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด์ž

 

โ‰ซ ์ €๋ฒˆ coursera ๊ฐ•์ขŒ posting์—์„œ ์•„์ฃผ ์ž ๊น ๋ฐฐ์› ๋˜ ์ ์ด ์žˆ์—ˆ๋‹ค

 

 

๐Ÿ„๐Ÿป coefficients๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ ๋ฐฐ๊ฒฝ ๋ฐ ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ coefficients๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋ฉด

from covariance>

๐Ÿ„๐Ÿป covariance๋ž€ '1๊ฐœ์˜ ๋ณ€์ˆ˜ ๊ฐ’์ด ๋ณ€ํ™”ํ•  ๋•Œ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๊ฐ€ ์–ด๋– ํ•œ ์—ฐ๊ด€์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ ๋ณ€ํ•˜๋Š” ์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ'์ด๋‹ค

 

๐Ÿ„๐Ÿป ๋ถ„์‚ฐ = 'ํ•œ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํผ์ ธ ์žˆ๋Š” ์ •๋„'

๐Ÿ„๐Ÿป ๊ณต๋ถ„์‚ฐ(๊ณตํ†ต๋œ ๋ถ„์‚ฐ) = '๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํผ์ ธ ์žˆ๋Š” ์ •๋„(ํ•œ ๊ฐœ์˜ ๋ณ€์ˆ˜๊ฐ€ ์™”๋‹ค๊ฐ”๋‹ค ์›€์ง์ด๋Š” ๋™์•ˆ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๊ฐ€ ์–ด๋Š ์ •๋„๋กœ ํผ์ ธ ์žˆ๋Š” ์ง€ ์ธก์ •)'

 

- ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ํ•จ๊ป˜ -

 

โ‘  covariance > 0 (positive covariance):  ํ•˜๋‹จ ์™ผ์ชฝ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด x๊ฐ€ ๋†’์•„์ง€๋ฉด y๊ฐ’๋„ ๊ฐ™์ด ๋†’์•„์ง€๋Š” ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค

โ‘ก covariance < 0 (negative covariance): ํ•˜๋‹จ ์ค‘๊ฐ„ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด x๊ฐ€ ๋†’์•„์ง€๋ฉด y๊ฐ’์€ ๊ฐ์†Œํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค

โ‘ข covariance โ‰’ 0: ํ•˜๋‹จ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ๋‘ x์™€ y๋ณ€์ˆ˜๊ฐ€ ๋†’๊ณ  ๋‚ฎ์Œ์— ๋Œ€ํ•˜์—ฌ ๊ด€๋ จ ์žˆ๋Š” ๊ด€๊ณ„๋ฅผ ๋ณผ ์ˆ˜ ์—†๋‹ค

 

 

- covariance ์›๋ฆฌ

โ˜€๏ธ ์‹ค์ œ๋กœ ๊ณ„์‚ฐํ•˜๊ฑฐ๋‚˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ!

→ variable scale์— ๋”ฐ๋ผ covariance๊ฐ€ ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์— correlation์„ ์œ„ํ•œ ์ผ์ข…์˜ stepping stone์ด๋ผ ์ƒ๊ฐํ•˜๊ธฐ

 

โ˜€๏ธ covariance ์‹($\bar{x}$, $\bar{y}$๋Š” ๊ฐ๊ฐ x์™€ y์˜ ํ‰๊ท )

 

${\huge{\sum(x-\bar{x})(y-\bar{y})\over n-1}}$

 

→ ์œ„ ์‹์˜ ๋ถ„์ž๋ฅผ ๋ณด๋ฉด x์™€ y์˜ ํ‰๊ท ์„ ๊ธฐ์ค€์œผ๋กœ ์ด ๋„ค ๊ฐœ์˜ ์‚ฌ๋ถ„๋ฉด์œผ๋กœ ๋‚˜๋ˆ„์—ˆ์„ ๋•Œ, ์ œ 1๊ณผ 3์‚ฌ๋ถ„๋ฉด์€ covariance ์–‘์˜ ๊ฐ’ / ์ œ 2์™€ 4์‚ฌ๋ถ„๋ฉด์€ covariance ์Œ์˜ ๊ฐ’์„ ๋ณด์ธ๋‹ค

์ œ 1,3 ์‚ฌ๋ถ„๋ฉด์„ ์ง€๋‚˜๋Š” ๊ฑด positive relationship / ์ œ 2,4 ์‚ฌ๋ถ„๋ฉด์„ ์ง€๋‚˜๋Š” ๊ฑด negative relationship์ด๋ผ ํ•  ์ˆ˜ ์žˆ์Œ

 

โ˜€๏ธ covariance ๊ฒฐ๊ด๊ฐ’์˜ ๋ถ€ํ˜ธ๊ฐ€ ์•„๋‹Œ ๊ฐ’ ์ž์ฒด๋Š” ์œ ์šฉํ•˜์ง€ ์•Š๋‹ค. ๊ฐ’์œผ๋กœ line์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์–ด๋Š ์ •๋„๋กœ steepํ•œ ์ง€๋Š” ์•Œ ์ˆ˜ ์—†์Œ & ์ถ”๊ฐ€์ ์œผ๋กœ ์‹ค์ œ ์ ๋“ค์ด ์‹ค์ œ relationship line์— ์–ด๋Š ์ •๋„๋กœ ๊ฐ€๊นŒ์šด ์ง€๋„ ์•Œ ์ˆ˜ ์—†๋‹ค

 

โ˜€๏ธ ํ•˜์ง€๋งŒ ๊ธฐ์กด x์™€ y value์—์„œ y value๋งŒ scale์„ ๋†’์ด๋”๋ผ๋„ covariance ๊ฐ’ ์ž์ฒด๊ฐ€ ๋ฐ”๋€Œ๋Š” ๊ฑธ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค
(scale๋งŒ ๋ฐ”๋€Œ์–ด์„œ relationship line์€ ๋˜‘๊ฐ™์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ )

 

covariance values are sensitive to the scale of the data, and this makes them difficult to interpret

 

โ˜€๏ธ ํ•ด๊ฒฐ๋ฒ•? correlation! ๐Ÿฅ correlation์„ ๊ณ„์‚ฐํ•˜๋Š” ์ดˆ์„์œผ๋กœ covariance ๊ฐœ๋…์ด ์‚ฌ์šฉ๋œ๋‹ค

- variance-covariance matrix

a = b = np.arange(5, 50, 5)
c = d = np.arange(10,100,10)

fake_data = {"a": a, "b": b, "c": c, "d": d}

df = pd.DataFrame(fake_data)

 

df.cov()

 

 

์ž๊ธฐ์ž์‹ ๊ณผ์˜ covariance๊ฐ’์€ variance๋ฅผ ๋œปํ•˜๋ฉฐ, c์™€ d๋กœ scale์„ ๋†’์ด๋ฉด, ๋†’์ธ๋งŒํผ covariance๊ฐ’์ด ์ฆ๊ฐ€ํ–ˆ์Œ์„ code๋กœ ํ™•์ธ ๊ฐ€๋Šฅ!

 

๐Ÿ”ฎ ์šฐ๋ฆฌ๋Š” ์ด๋ ‡๊ฒŒ covariance matrix๋ฅผ ํ†ตํ•ด (2์ฐจ์› ์ด์ƒ์—์„œ์˜) ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค!
(covariance matrix → ๊ฐ vector์˜ variance์™€ covariance๋ฅผ ๊ฐ€์ง€๋Š” matrix)

 

๐Ÿ”ฎ ๊ฐ vector์˜ variance(์ฆ‰ matrix์—์„œ์˜ ๋Œ€๊ฐ์„ฑ๋ถ„)๋ฅผ ํ†ตํ•ด ๊ฐ vector๋ณ„๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ํฉ์–ด์ ธ ์žˆ๋Š” ์ง€์˜ ์ •๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค

 

๐Ÿ”ฎ covariance๋ฅผ ์‰ฝ๊ฒŒ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ vector๋ณ„ ํ‰๊ท ์„ 0์œผ๋กœ ๋งž์ถ”์–ด ํ‰๊ท ์„ ์›์ ์— ๊ฐ–๋‹ค๋Œ€๊ณ  ๊ณ„์‚ฐํ•˜๋Š” ๊ฒŒ ๋” ์‰ฝ๋‹ค
(๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด covariance๋Š” ๊ฐ ์ขŒํ‘œ๋ผ๋ฆฌ์˜ ๊ณฑ๋งŒ ํ•˜๋ฉด ๋จ - product of coordinates)

+ using vectors - orthogonality>

๐Ÿ”ฎ 'orthogonality'๋ž€ ๋ฒกํ„ฐ๊ฐ€ ์„œ๋กœ ์ˆ˜์ง์œผ๋กœ ์žˆ๋Š” ์ƒํƒœ๋ฅผ ๋œปํ•œ๋‹ค

 

๐Ÿ”ฎ ๋ฒกํ„ฐ๋“ค๋ผ๋ฆฌ๋Š” ์„œ๋กœ ์กฐ๊ธˆ์ด๋ผ๋„ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ์Œ (covariance ๊ฐ’์„ ํ†ตํ•ด ์ฆ๋ช…)์„ ์•Œ์•˜๋‹ค. ์ด ๋•Œ ๋ฒกํ„ฐ๊ฐ€ ์„œ๋กœ ์ˆ˜์ง์ด๋ผ๋ฉด ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์ „ํ˜€ ์—†๋‹ค๊ณ  ํ•ด์„ ๊ฐ€๋Šฅ!

 

๐Ÿ”ฎ ๋‘ ๋ฒกํ„ฐ๊ฐ€ ์„œ๋กœ ์ˆ˜์ง์ด๋‹ค = ๋‘ ๋ฒกํ„ฐ์˜ ๋‚ด์ ๊ฐ’์ด 0์ด๋‹ค

1> Pearson Correlation Coefficient(r)>

๐Ÿฆ‹ trend line์— data๊ฐ€ ๋ฐ€์ง‘ํ•ด ์žˆ๋Š” ์ •๋„์— ๋”ฐ๋ผ ํฌ๊ฒŒ weak, moderate, strong relationship์œผ๋กœ ์„ธ ๊ฐ€์ง€ ๊ด€๊ณ„๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋‹ค

 

๐Ÿฆ‹ ์ด ๋•Œ, weaker relationship์ผ ์ˆ˜๋ก correlation value๊ฐ€ ๊ฐ์†Œ, stronger์ผ์ˆ˜๋ก correlation value๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค

 

๐Ÿฆ‹ x value์— ๋งค์นญ๋˜๋Š” trend line์˜ y๊ฐ’์„ ์˜ˆ์ธกํ•  ๋•Œ stronger relationship์ผ์ˆ˜๋ก ์˜ˆ์ธก๋˜๋Š” y๊ฐ’ ๋ฒ”์œ„๊ฐ€ ๋งค์šฐ ์ข๋‹ค

 

โ˜† ์ด ๋•Œ, data์˜ ๊ฐฏ์ˆ˜๋‚˜ line์˜ slope์€ correlation์˜ ์„ธ๊ธฐ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Œ (๋ฐ์ดํ„ฐ์˜ ํ‰๊ท , ๋ถ„์‚ฐ์˜ ํฌ๊ธฐ์—๋„ ์˜ํ–ฅ X)

 

๐Ÿฆ‹ * ์ƒˆ๋กœ์šด data๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํ•ด๋‹น paired value๋ฅผ line์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” confidence

→ โ‘  p-value๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ์ฃผ์–ด์ง„ data point์— line์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” ์„ค๋ช…๋ ฅ์ด ๋” ๊ฐ•ํ•ด์ง„๋‹ค

→ โ‘ก๋” ๋งŽ์€ data point๋ฅผ ๊ฐ€์งˆ์ˆ˜๋ก ์„ค๋ช…๋ ฅ์ด ๋” ์ข‹์•„์ง„๋‹ค

 

๐Ÿฆ‹ ์—ฌ๊ธฐ์„œ data point์— ์ขŒ์šฐ๋˜๋Š” confidence๊ฐ€ ๋‘ ๋ณ€์ˆ˜๊ฐ„์˜ relationship ๊ด€๋ จ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ง€ํ‘œ๊ฐ€ ๋˜์ง€๋Š” ์•Š๋Š”๋‹ค. 

→ data point๊ฐ€ ๋งค์šฐ ๋งŽ์•„ confidence๊ฐ€ ๋†’๊ฒŒ ์ธก์ •์ด ๋˜์–ด๋„ correlation์€ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—

์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๊ฒฐ์ •์ ์ธ ์ง€ํ‘œ๋Š” correlation!

 

- x์™€ y์˜ ๋‘ ๊ฐ€์ง€ ๋ณ€์ˆ˜๋งŒ ์žˆ๋‹ค ๊ฐ€์ •ํ•˜๊ณ  gene ํŠน์ • ํ’ˆ๋ชฉ์— ๋Œ€ํ•œ covariance ๊ณ„์‚ฐ์‹์ž„(ํ•˜๋‹จ) -

 

๐Ÿฆ‹ x์™€ y ๊ฐ๊ฐ์˜ variance์— ๋ฃจํŠธ๋ฅผ ์”Œ์šด ๊ฒฐ๊ณผ๋ฅผ ์„œ๋กœ ๊ณฑํ•ด์„œ ๋ถ„๋ชจ๋กœ ๋‚˜๋ˆ ์ค€ ๊ฒฐ๊ณผ๊ฐ€ correlation value๊ฐ€ ๋œ๋‹ค

 

๐Ÿฆ‹ ํ•ด์„1> trend line์— data๊ฐ€ ๋ฐ€์ง‘๋˜์–ด ์žˆ์„์ˆ˜๋ก covariance๊ฐ’๊ณผ ๊ฐ vector์˜ variance ๋ฃจํŠธ๊ฐ’์˜ ๊ณฑ์ด ๊ฑฐ์˜ ๊ฐ™๊ฒŒ ๋œ๋‹ค

๐Ÿฆ‹ ํ•ด์„2> trend line์œผ๋กœ๋ถ€ํ„ฐ data๊ฐ€ ์ ์  ๋ฉ€์–ด์งˆ์ˆ˜๋ก ๋‹น์—ฐํžˆ covariance๊ฐ’๋„ ๋–จ์–ด์ง€๊ธฐ์— correlation๊ฐ’๋„ ๊ฐ์†Œํ•จ - 0์œผ๋กœ ์ˆ˜๋ ด 

vs. R^2(coefficient of determination)?

→ ์ฆ‰ pearson correlation coefficient๋Š” relationship ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„์˜ ์ •๋„ strength๋ฅผ ๋ณด์—ฌ์ค€๋‹ค

 

Q. ๊ทธ๋Ÿฌ๋ฉด R-Squared์™€ Pearson Correlation๊ณผ์˜ ์ฐจ์ด์ ?

โ‘  ์ผ๋‹จ Pearson ๊ณ„์ˆ˜๋Š” scale์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค. ์˜ค๋กœ์ง€ ์ˆซ์ž๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์ฐจ์ด๋กœ ์ธํ•ด์„œ ๊ฒฐ์ •๋˜๋Š” ๊ณ„์ˆ˜์ด๋‹ค. ๋”ฐ๋ผ์„œ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ data ์ˆ˜๊ฐ€ ์ ์–ด์„œ confidence์—์„œ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์—๋„ Pearson ๊ณ„์ˆ˜๋Š” -1์ด๋‚˜ 1์— ๊ฐ€๊นŒ์šด ์ˆ˜์น˜๋ฅผ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค

โ‘ก coefficient of determination์€, ์ฆ‰ R^2๋Š” ์–ผ๋งˆ๋‚˜ ์˜ˆ์ธก๋œ value๊ฐ€ trend line ์ƒ์—์„œ์˜ y variance๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ง€ํ‘œ์ด๋‹ค.
→ ์ œ๊ณฑ์ด ๋˜์—ˆ๊ธฐ์— 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก model์— ๋Œ€ํ•ด ๋†’์€ ์„ค๋ช…๋ ฅ์„ ๋ณด์—ฌ์ค€๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค

scale์— ์˜ํ–ฅ์„ ๋ฐ›์Œ

 

โ˜… ์ •๋ฆฌํ•˜๋ฉด Pearson ๊ณ„์ˆ˜๋Š” ์˜ˆ์ธก์šฉ์ด ์•„๋‹Œ, ์ฃผ์–ด์ง„ data์—์„œ pattern์ด๋‚˜ relationship์„ ์ฐพ์•„๋‚ผ ๋•Œ ์“ฐ์ด๋Š” ์šฉ๋„์ด๋ฉฐ

โ˜… R^2๋Š” model์—์„œ ์ฃผ์–ด์ง„ prediction๊ณผ ์‹ค์ œ ์ฃผ์–ด์ง„ observation์˜ variance๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ์ง€ ์“ฐ์ด๋Š” ์šฉ๋„์ด๋‹ค
(๋˜ scale์— ์˜ํ–ฅ์„ ๋ฐ›๊ธฐ์— confidence๋ฉด์—์„œ๋„ ์‹ ๋ขฐ๊ฐ€ ๊ฐ€๋Š” ์ง€ํ‘œ์ด๋‹ค)

 

+ ์ถ”๊ฐ€๋กœ Pearson Corrleation Coefficient๋ฅผ ์ œ๊ณฑํ•œ๋‹ค๊ณ  ํ•ด์„œ R-Sqaured๊ฐ€ ํ•ญ์ƒ ๋˜๋Š” ๊ฑด ์•„๋‹ˆ๋‹ค!
(๋งค์šฐ ํŠน์ •ํ•œ ์„ ํ˜•๋ชจ๋ธ SLR์„ ์ œ์™ธํ•˜๊ณ ๋Š”)

 

 

→ ์œ„ ์„ธ ๊ทธ๋ž˜ํ”„์—์„œ ๋ณด๋“ฏ์ด ์˜ˆ์ธก ๋ผ์ธ model๊ณผ ๋งŽ์ด ๋ฒ—์–ด๋‚˜ ์žˆ์Œ์—๋„ (์ค‘๊ฐ„๊ณผ ์˜ค๋ฅธ์ชฝ) ์„œ๋กœ ์ผ๋ ฌ๋กœ ๋‚˜์—ด๋˜์–ด ์žˆ๋Š” data ๋•๋ถ„์— Pearson๊ฐ’์€ 1์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜จ๋‹ค. ๋”ฐ๋ผ์„œ ์ฃผ์–ด์ง„ ๋ชจ๋ธ์— ๋งž๊ฒŒ ์ž˜ ์˜ˆ์ธกํ–ˆ๋Š” ์ง€ ํ™•์ธํ•  ๋ฐฉ๋ฒ•์€ R^2๋ฅผ ์ด์šฉ

 

* R^2 ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ํฌ์ŠคํŒ… ์ฐธ์กฐ

 

All About Evaluation Metrics(1/2) → MSE, MAE, RMSE, R^2

** ML ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ข…์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ๋•Œ ๋‹ค์–‘ํ•œ evaluation metrics๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ–ˆ์Œ! ** (supervised learning - regression problem์—์„œ ๋งŽ์ด ์“ฐ์ด๋Š” ํ‰๊ฐ€์ง€ํ‘œ๋“ค) - ๊ณผ์ • (5) - ๐Ÿ˜™ ๊ทธ๋Ÿฌ๋ฉด ์ฐจ๊ทผ์ฐจ..

sh-avid-learner.tistory.com

2>Spearman Rank-Order Correlation Coefficient(ρ; rs)>

๐Ÿค ๋‘ ๊ฐœ์˜ rank ์ˆœ์œ„๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ordinal variable ์‚ฌ์ด(continuous & discrete ๋ชจ๋‘ ๊ฐ€๋Šฅ)์˜ ๊ด€๊ณ„ relationship์ด ์–ผ๋งˆ๋‚˜ weakํ•˜๊ฑฐ๋‚˜ strongํ•œ ์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์ฒ™๋„!

 

๐Ÿค Pearson๊ณผ ๋‹ฌ๋ฆฌ ๋ถ„์‚ฐ๊ณผ ๊ฐ™์€ ์ˆ˜์น˜๋ฅผ numeric data๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๊ฐ€ ์•„๋‹Œ, data๊ฐ€ categorialํ•  ๋•Œ ์“ฐ์ด๋Š” ์ˆ˜์น˜์ด๋‹ค

 

๐Ÿค ๋น„๋ชจ์ˆ˜์  ๋ฐฉ๋ฒ•

(๋ชจ์ˆ˜์  & ๋น„๋ชจ์ˆ˜์  ์ฐจ์ด ํ•˜๋‹จ ํฌ์ŠคํŒ… ์ฐธ์กฐ!)

 

Parametric vs. Non-Parametric Tests

๐Ÿ‘ฉ‍๐Ÿ”ฌ Parametric(๋ชจ์ˆ˜์ ) & Non-Parametric(๋น„๋ชจ์ˆ˜์ ) test ์ข…๋ฅ˜ ๊ตฌ๋ณ„์€ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค! ๐Ÿ‘ฉ‍๐Ÿ”ฌ ๊ฐ„๋žตํžˆ ๋งํ•˜์ž๋ฉด ๋ชจ์ˆ˜์  ๋ฐฉ๋ฒ•์€ data์˜ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ • (์ฃผ๋กœ ์ •๊ทœ์„ฑ - normal distribution)ํ•œ ์ฑ„ hypothetical tes..

sh-avid-learner.tistory.com

 

๐Ÿค'a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.'

 

๐Ÿค Pearson๊ณผ ๋‹ค๋ฅด๊ฒŒ linear function์„ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€ํ–ˆ์ง€๋งŒ, Spearman์€ monotonic function์„ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€
(pearson๊ณผ ๋‹ฌ๋ฆฌ ์„ ํ˜•์ ์ธ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์•„๋‹Œ, ํ•œ ๋ณ€์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•  ๋•Œ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๊ฐ€ ์ฆ๊ฐ€/๊ฐ์†Œํ•˜๋Š” ์ง€์— ๋Œ€ํ•œ ์ •์„ฑ์  ๊ด€๊ณ„๋ฅผ ์•Œ๊ณ ์ž ํ•˜๋Š” ๊ฒŒ spearman)

 

๐Ÿค The Spearman correlation btw 2 variables = The Pearson correlation btw the rank values of those 2 variables

 

 

๐Ÿค ์œ„ ๊ทธ๋ž˜ํ”„์—์„œ ๋ณด์‹œ๋‹ค์‹ถ์ด monotnoic function์„ ๋”ฐ๋ฅด๋ฉด spearman ๊ณ„์ˆ˜๋Š” 1์— ๋”ฐ๋ฅธ๋‹ค. ์œ„ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„์™€ ๊ฐ™์ด outlier๊ฐ€ ์กด์žฌํ•  ๊ฒฝ์šฐ pearson๋ณด๋‹ค spearman์ด outlier์— ๋œ ์˜ํ–ฅ์„ ๋ฐ›์•„(less sensitive) ๊ณ„์ˆ˜๊ฐ’์ด ๋” ๋†’์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

๊ณ„์‚ฐ>

๐Ÿค ์œ„์—์„œ ์–˜๊ธฐํ–ˆ๋“ฏ์ด spearman ๊ณ„์ˆ˜๋Š” ordinalํ•œ variable์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ฒจ์„œ ํ•ด๋‹น ์ˆœ์œ„ R(X), R(Y)์˜ pearson ๊ณ„์ˆ˜๋กœ ์‚ฐ์ถœํ•œ ๊ฒƒ์„ ๋งํ•œ๋‹ค

 

๐Ÿค For a sample of size n, the n raw scores $X_i, Y_i$ are converted to ranks $R(X_i), R(Y_i)$ and $r_s$ is computed as

 

- pearson ๊ณต์‹์— Rank๋ฅผ ๊ทธ๋Œ€๋กœ ๋„ฃ์œผ๋ฉด ๋จ -

 

 

๐Ÿค (ํŠน์ˆ˜) ๋งŒ์•ฝ ๋ชจ๋“  n๊ฐœ์˜ rank๊ฐ€ ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์ •์ˆ˜๋ผ๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์€ ๊ณต์‹์œผ๋กœ $r_s$๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค

 

- $d_i$๋Š” $R(X_i) - R(Y_i)$๋กœ ๊ฐ ๊ด€์ธก์น˜์˜ ์ฐจ์ด / n์€ ๊ด€์ธก์น˜์˜ ๊ฐœ์ˆ˜ -

 

interpretations & ์ •๋ฆฌ>

๐Ÿค spearman ๊ณ„์ˆ˜๊ฐ€ ๊ฑฐ์˜ 1์— ๊ฐ€๊น๋‹ค๋Š” ๊ฒƒ์€ ์Œ์œผ๋กœ ์ด๋ฃฌ $X_i, Y_i$์˜ ์ฐจ์ด๊ฐ€ ํ•ญ์ƒ ๊ฑฐ์˜ ๋˜‘๊ฐ™์€ ํŠธ๋ Œ๋“œ๋ฅผ ๋ณด์ธ๋‹ค๋Š” ๊ฑธ ๋œปํ•œ๋‹ค

 

๐Ÿค ์ฆ‰, 1์— ๊ฐ€๊นŒ์šฐ๋ฉด ๋‘ ordinal ๋ณ€์ˆ˜๊ฐ€ ์„œ๋กœ ๊ด€๊ณ„๊ฐ€ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , 0์— ๊ฐ€๊นŒ์šฐ๋ฉด ๊ฑฐ์˜ ๊ด€๊ณ„๊ฐ€ ์—†์Œ์„, -1์— ๊ฐ€๊นŒ์šฐ๋ฉด ์„œ๋กœ negative monotnoic function์˜ ํ˜•ํƒœ๋กœ, ์„œ๋กœ ๊ฐ๊ฐ (rank๊ฐ€) ์ฆ๊ฐ€ํ•  ๋•Œ ๊ฐ์†Œํ•˜๊ณ , ๋˜๋Š” ๊ฐ์†Œํ•  ๋•Œ ์ฆ๊ฐ€ํ•˜๋Š” ๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ๋œป์ด๋‹ค.

 

๐Ÿ‘‰ spearman's correlation coefficient๋ฅผ ์‚ฐ์ถœํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”..!

 

โ‘  ์ฃผ์–ด์ง„ sample data๋Š” ๋ชจ๋‘ random

โ‘ก monotonic ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  (Ha) assumption test ์ง„ํ–‰

โ‘ข variable๋“ค์€ ์ ์–ด๋„ ordinalํ•˜๊ฑฐ๋‚˜ continuous numerical

โ‘ฃ data๋Š” ์„œ๋กœ ์Œ์„ ๊ฐ€์ง„ sample๋“ค์ด์–ด์•ผ ํ•จ - ๊ทธ๋ž˜์•ผ ๊ฐ ์Œ๋ผ๋ฆฌ monotonic function์„ ๋”ฐ๋ฅด๋Š” ์ง€ ๋ณผ ์ˆ˜ ์žˆ์Œ

โ‘ค data๋ผ๋ฆฌ ์„œ๋กœ independentํ•ด์•ผ ํ•จ

 

๐Ÿค โ–’ ์—ฌ๊ธฐ์„œ! pearson๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋น„๋ชจ์ˆ˜์ ์ธ ๋ฐฉ๋ฒ•์ด๋ฏ€๋กœ data๋Š” ๊ตณ์ด ์ •๊ทœ๋ถ„ํฌ์—์„œ ๊ฐ€์ ธ์˜ค์ง€ ์•Š์•„๋„ ๋œ๋‹ค โ–’

w/code

๐Ÿง  scipy.stats.spearmanr docu ๐Ÿง 

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

 

scipy.stats.spearmanr(a, b=None, axis=0, nan_policy='propagate', alternative='two-sided')

 

'Calculate a Spearman correlation coefficient with associated p-value. The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.'

 

๐Ÿ‘ฉ‍๐Ÿฆฐ returns 1>correlation ๊ณ„์ˆ˜ & 2>p-value

 

โ‰ซ H0 ๊ท€๋ฌด๊ฐ€์„ค์€ ๋‘ ordinal variable๊ฐ„์— monotonic relation์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์œผ๋กœ, p-value๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ 0.05์ดํ•˜์ด๋ฉด ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ! - ์ฆ‰ Ha ๋Œ€์•ˆ๊ฐ€์„ค๋กœ ์„ธ์šด '๋‘ ordinal variable๊ฐ„์— ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค'๋กœ ํ†ต๊ณ„์  ๊ฒ€์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋‹ค!

 

โ‰ซ ์˜ˆ์‹œ

 

Q. ์•„๋ž˜๋Š” 1๋ฐ˜ ์ด 35๋ช… ํ•™์ƒ ๊ฐ๊ฐ์˜ ์˜์–ด์™€ ์ˆ˜ํ•™ ๋ชจ์˜๊ณ ์‚ฌ ๋“ฑ๊ธ‰์ด๋‹ค. ํ•™์ƒ๋“ค์˜ ์˜์–ด๋“ฑ๊ธ‰๊ณผ ์ˆ˜ํ•™๋“ฑ๊ธ‰์ด ์„œ๋กœ ์œ ์˜๋ฏธํ•œ ๊ด€๊ณ„์„ฑ์„ ๋ณด์ด๋Š” ์ง€ spearman correlation coefficients๋ฅผ ์‚ฐ์ถœํ•ด ๋‚ด ํ†ต๊ณ„์  ๊ฒ€์ •์„ ๋‚ด๋ฆฌ์ž. (๋‹จ, ๊ฐ ํ•™์ƒ๋ณ„ ๋“ฑ๊ธ‰์€ ์„œ๋กœ independentํ•˜๋‹ค)

 

- (ํ•˜๋‹จ) ํ•™์ƒ๋ณ„ ๋“ฑ๊ธ‰ํ‘œ ์ผ๋ถ€ -

 

A.

 

import pandas as pd
English = [1, 5, 2, 3, 3, 5, 6, 6, 5, 4, 2, 4, 5, 3, 4, 5, 2, 3, 5, 4, 3, 4, 6, 5, 8, 6, 4, 7, 5, 4, 3, 4, 1, 3, 4]
Math = [2, 5, 6, 4, 3, 5, 4, 2, 4, 5, 3, 3, 5, 6, 4, 5, 6, 5, 6, 5, 7, 6, 5, 4, 5, 3, 5, 4, 2, 1, 1, 4, 3, 2, 2]
grade = pd.DataFrame({'English':English, 'Math': Math})

 

* ๊ฒ€์ • ๊ฒฐ๊ณผ>

 

stats.spearmanr(grade['English'].tolist(), grade['Math'].tolist())
#SpearmanrResult(correlation=0.03519639020891649, pvalue=0.8409139955046013)

 

* ๊ฒฐ๋ก > ์—ญ์‹œ ์˜์–ด๋“ฑ๊ธ‰๊ณผ ์ˆ˜ํ•™ ๋“ฑ๊ธ‰์„ ๋žœ๋ค์œผ๋กœ 1~9์‚ฌ์ด์˜ ์ˆซ์ž๋ฅผ ์•„๋ฌด๋ ‡๊ฒŒ ์ž…๋ ฅํ•˜์˜€๋”๋‹ˆ spearman๊ฐ’์ด 0์— ๊ฐ€๊นŒ์šด 0.03์œผ๋กœ monotonicํ•œ function relationship์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ์„ ํ†ต๊ณ„์ ์œผ๋กœ ๊ฒ€์ •ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค! (pvalue๋Š” 0.05๋ณด๋‹ค ํ›จ์”ฌ ํฐ ๊ฐ’์œผ๋กœ ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ ์•ˆ๋จ) 

 

++ ์ถ”๊ฐ€์ ์ธ correlation ์ข…๋ฅ˜ - kendall-rank & point-biserial correlation ์ถ”ํ›„ ํฌ์ŠคํŒ… ์ฐธ์กฐ! ++


* ์ถœ์ฒ˜1) https://programmathically.com/covariance-and-correlation/

* ์ถœ์ฒ˜2) covariance matrix https://www.youtube.com/watch?v=WBlnwvjfMtQ 

* ์ถœ์ฒ˜3) r^2 vs. Pearson ๊ณ„์ˆ˜ https://towardsdatascience.com/r%C2%B2-or-r%C2%B2-when-to-use-what-4968eee68ed3

* ์ถœ์ฒ˜4) spearman ์„ค๋ช… https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

* ์ถœ์ฒ˜5) spearman youtube https://www.youtube.com/watch?v=JwNwbu-g2m0 

* ์ถœ์ฒ˜6) pearson, spearman, r-squared ๋น„๊ต, ์„ค๋ช… https://m.blog.naver.com/istech7/50153288534 

* ์ถœ์ฒ˜) ๊ฐ“ STATQUEST - correlation explained! ๐Ÿ‘ฉ๐Ÿฟ‍๐Ÿš€  https://www.youtube.com/watch?v=xZ_z8KWkhXE 

* ์ถœ์ฒ˜) ๊ฐ“ STATQUEST - covariance explained! ๐ŸŒป https://www.youtube.com/watch?v=qtaqvPAeEJY&t=435s 

'Math & Linear Algebra > Concepts' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

eigenvalue & eigenvector  (0) 2022.05.14
linear & non-linear โ†’ span, basis, rank, projection  (0) 2022.05.13
Cramer's Rule (+exercise)  (0) 2022.05.09
Basic Derivative - ๋ฏธ๋ถ„ ๊ธฐ์ดˆ  (0) 2022.04.18
Scalar & Vector (fundamentals)  (0) 2022.04.06

๋Œ“๊ธ€