Statistics/Concepts(+codes)

Auto-correlation + Durbin-Watson test

metamong 2022. 6. 17.

๐ŸŒท residual plot์—์„œ์˜ error๊ด€๋ จ ์„ฑ์งˆ๋กœ auto-corrleation์„ ๋ฌด์กฐ๊ฑด ์งš๊ณ  ๋„˜์–ด๊ฐ€๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€๋ฐ, ํ•ด๋‹น corrleation์ด ๋ฌด์—‡์ธ์ง€ ์•Œ์•„๋ณด๊ณ 

 

๐ŸŒท ํ•ด๋‹น auto-correlation์ด ์กด์žฌํ•˜๋Š” ์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ• - Durbin-Watson test & Breusch-Godfrey test๋ฅผ ์•Œ์•„๋ณด์ž

* linear regression model ๊ฐ€์ • → auto-correlation concepts

 

๐Ÿ‚ ์œ„ ์ „์ œ 2๋ฒˆ์—์„œ ์ž”์ฐจ - residual์— ๋Œ€ํ•ด ์„ค๋ช…ํ•  ๋•Œ, ์ž”์ฐจ๋ฅผ ์‹œ๊ฐํ™”ํ•œ residual plot์—์„œ ์—ฌ๋Ÿฌ error terms ์˜ค์ฐจํ•ญ์ด ํ˜•์„ฑํ•˜๋Š” ์ผ์ข…์˜ pattern์ด ์กด์žฌํ•œ๋‹ค๋ฉด ์ž๊ธฐ์ƒ๊ด€์„ฑ auto-correlation์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•˜์˜€๋‹ค.

 

๐Ÿ‚ ์ €๋ฒˆ coursera ๊ฐ•์ขŒ์—์„œ ์ž ๊น ๋‹ค๋ฃฌ ๋‚ด์šฉ์„ ํ™•์ธํ•˜๋ฉด,,

 

 

→ residual plot์˜ residual plot์˜ ๋ถ„ํฌ์— ๋”ฐ๋ผ ํ•ด๋‹น linear model์˜ ์ ์ ˆ์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿ‚ ์ฆ‰ ์œ„ ๋‘ ๊ทธ๋ž˜ํ”„์—์„œ ์ œ์‹œํ•œ residual plot์—์„œ error term์ด ์„œ๋กœ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์งˆ ๋•Œ auto-correlation(๋˜๋Š” ์‹œ๊ณ„์—ด ์ž๋ฃŒ์—์„œ ๋งŽ์ด ์“ฐ์—ฌ serial correlation์œผ๋กœ๋„ ์“ฐ์ธ๋‹ค.)์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•˜๋ฉฐ, ์ผ์ •ํ•œ ์‹œ๊ฐ„๊ฐ„๊ฒฉ์œผ๋กœ ์˜ค์ฐจํ•ญ์ด ์„œ๋กœ ๋™์ผํ•œ ํ˜•ํƒœ(์–‘๋ผ๋ฆฌ, ๋˜๋Š” ์Œ๋ผ๋ฆฌ)๋ฅผ ๋ณด์ด๋ฉด 'positive auto-correlation'์ด๋ผ ํ•˜๊ณ , ์˜ค์ฐจํ•ญ๋“ค์ด ์„œ๋กœ ๋ฐ˜๋Œ€ ํ˜•ํƒœ(์Œ๊ณผ ์–‘)๋ฅผ ๋ณด์ธ๋‹ค๋ฉด 'negative auto-correlation'์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๋„ ์ €๋„ ์•„๋‹Œ ์ƒ๊ด€์„ฑ์„ ๋ณด์ด์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•˜๋ฉด 'uncorrelated'๋˜์—ˆ๋‹ค๊ณ  ํ•˜๋ฉฐ ์ž๊ธฐ์ƒ๊ด€์„ฑ์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค
(์ฆ‰, ํ•œ error term $e_k$๊ฐ€ ์–‘(+)์ผ ๋•Œ, ์˜†์˜ $e_{k+1}$ํ•ญ๋„ ์–‘(+)์„ ๋ณด์ด๋Š”, ์ฆ‰ ์ž”์ฐจ๋“ค์ด ์ผ์ข…์˜ pattern trend๋ฅผ ํ˜•์„ฑํ•˜๋ฉด ์ž๊ธฐ์ƒ๊ด€์„ฑ)

 

๐Ÿ‚ ์ด๋ ‡๊ฒŒ ์ž๊ธฐ์ƒ๊ด€์„ฑ์„ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ, โ‘ Durbin-Watson test์™€ โ‘กBreusch-Godfrey test๊ฐ€ ์กด์žฌํ•œ๋‹ค (์•„๋ž˜์—์„œ ๋‹ค๋ฃฐ ์˜ˆ์ •!)

 

๐Ÿ‚ ์ฆ‰ ์„ ํ˜•๋ชจ๋ธ์—์„œ ๊ฐ€์ •์œผ๋กœ ๋‘˜ ๋•Œ, ์ž๊ธฐ์ƒ๊ด€์„ฑ์ด ์กด์žฌํ•ด์„œ๋Š” ์•ˆ๋œ๋‹ค๋Š” ๊ฐ€์ •์ด ๊ผญ ๋“ค์–ด๊ฐ„๋‹ค. ๋‹น์—ฐํžˆ ์˜ค์ฐจํ•ญ๋ผ๋ฆฌ ์„œ๋กœ ์—ฐ๊ด€์„ ๋ณด์ด๋Š” pattern์ด ์กด์žฌํ•œ๋‹ค๋ฉด, ๋ณ€์ˆ˜๋“ค๋ผ๋ฆฌ ๋ชจ๋‘ ๋…๋ฆฝ์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  regression modeling์„ ์ง„ํ–‰ํ•˜๋Š” ๋ฐ์„œ ํ‹€๋ ธ๊ธฐ ๋•Œ๋ฌธ!

(๋”ฐ๋ผ์„œ detection tests๋กœ ๋ฐ˜๋“œ์‹œ ์ž๊ธฐ์ƒ๊ด€์„ฑ ์œ ๋ฌด๋ฅผ ํŒŒ์•…ํ•˜๋„๋ก ํ•˜์ž)

 

๐Ÿ‚ ์ฃผ๋กœ ์‹œ๊ณ„์—ด ์ž๋ฃŒ์— ์ •๋ง ๋งŽ์ด ํ™œ์šฉ๋œ๋‹ค. ์ผ์ •ํ•œ ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ์„ ๋‘” data๋ผ๋ฆฌ ์—ฐ๊ด€์„ฑ์„ ๋ณด์ด๋Š”, ์ผ์ข…์˜ pattern์ด๋‚˜ trend๋ฅผ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งค์šฐ ๋งŽ๊ธฐ์—, ์ฃผ์–ด์ง„ ์‹œ๊ณ„์—ด data์—์„œ ์•ž์œผ๋กœ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์ „์— ์ž๊ธฐ์ƒ๊ด€์„ฑ์„ ๊ผญ ํ™•์ธํ•˜๊ณ  ๊ฐ€๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

 

๐Ÿ‚ Q. ์ž๊ธฐ์ƒ๊ด€์„ฑ์ด ์กด์žฌํ•œ๋‹ค๋ฉด?

→ regression์˜ coefficients ๊ณ„์ˆ˜๋งŒ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์„ ๋ฟ, standard error๋‚˜ p-value๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” p-value๋กœ ๊ฒฐ๋ก ์„ ๋‚ด๋ฆฌ๋Š”, '์–ด๋–ค ํŠน์ • ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ์ข…์†๋ณ€์ˆ˜์— ์˜ํ–ฅ์„ ์ค€๋‹ค'๋ผ๊ณ  ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์—†๋‹ค๋Š” ๋œป!

→ ์ฆ‰ ์šฐ๋ฆฌ๋Š” ์ž๊ธฐ์ƒ๊ด€์„ฑ์„ ๋ฐœ๊ฒฌํ•˜๋ฉด ํ•ด๋‹น ์„ฑ์งˆ์„ ์—†์•จ ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋น ์ง„ ๋ณ€์ˆ˜๋‚˜ ํ•จ์ˆ˜ term์ด ๋น ์ง„ ๊ฒฝ์šฐ๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ํ•ด๋‹น ๋ณ€์ˆ˜๋‚˜ term์„ ์ถ”๊ฐ€ํ•ด์„œ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.

1> Durbin-Watson test

๐ŸŒน ๋”๋นˆ-์™“์Šจ์—์„œ ์•Œ์•„์•ผ ํ•  ์„ฑ์งˆ ์ค‘์˜ ํ•˜๋‚˜๋Š” '๊ทผ์ฒ˜ residual term๋ผ๋ฆฌ๋งŒ' test๋œ๋‹ค๋Š” ์„ฑ์งˆ

(only looks at successive error terms)

 

๐ŸŒน Durbin-Watson d-statistic

 

$d_w = \cfrac{(e_2 - e_1)^2 + (e_3 - e_2)^2 + ... + (e_n - e_{n-1})^2}{e_1^2 + e_2^2 + ... + e_n^2}$

 

๐ŸŒน $d_w$๊ฐ’์€ 0์—์„œ 4๊นŒ์ง€์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๊ณ  2 ๋ถ€๊ทผ์˜ ๊ฐ’์ด ๋‚˜์˜ค๋ฉด uncorrelated ๋˜์—ˆ๋‹ค ํŒ๋‹จ, 4์— ๊ฐ€๊นŒ์šฐ๋ฉด negatively correlated, 0์— ๊ฐ€๊นŒ์šฐ๋ฉด positively correlated ๋˜์—ˆ๋‹ค ํŒ๋‹จํ•œ๋‹ค.

 

๐ŸŒน range 0 from 4  ($d_w$)

 

 

๐ŸŒน ์ฃผ์–ด์ง„ ๊ด€์ธก์น˜์˜ ์ˆ˜ n & intercept๋ฅผ ์ œ์™ธํ•œ regressor ์ˆ˜(์ฆ‰ x ๋ณ€์ˆ˜ ๊ฐœ์ˆ˜) k ๋‘ ๊ฐœ์™€ significance level์„ ์ฐธ์กฐํ•˜์—ฌ ์œ„์˜ $d_u$์™€ $d_L$๊ฐ’์„ ์ •ํ•œ๋‹ค.

(์•„๋ž˜ durbin-watson table์€ ์ง์ ‘ ๊ตฌ๊ธ€๋งํ•˜๋ฉด ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. ์•„๋ž˜ ์ถœ์ฒ˜ ์ฐธ๊ณ )

 

 

๐ŸŒน ์œ„ table๊ณผ ๊ฐ™์ด alpha๊ฐ’ 0.05์ผ ๋•Œ๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ณ , k๋Š” ์ฃผ๋กœ 10์ดํ•˜์ธ case๊ฐ€ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์‚ฌ์šฉ๋œ๋‹ค. ($d_L, d_u$ ์ˆœ์„œ)

* ์‹ค์Šต>

โ‘  time-series data kaggle dataset ์„ธ ๊ฐœ ์ค€๋น„ & import

- ์˜ค์ŠคํŠธ๋ฆฌ์•„ ์›”๋ณ„ ๋งฅ์ฃผ ์†Œ๋น„๋Ÿ‰

- ์ƒดํ‘ธ ์†Œ๋น„๋Ÿ‰

- ์ „๊ธฐ ์ƒ์‚ฐ๋Ÿ‰

 

import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as pyplot
from matplotlib import pyplot as plt

from statsmodels.formula.api import ols #ols model - residual calculation
from statsmodels.stats.stattools import durbin_watson #durbin_watson method
import statsmodels.tsa.api as smt #visualizing residuals

 

df = pd.read_csv("monthly-beer-production-in-austr.csv")

df.isnull().sum()

df['Month'] = pd.to_datetime(df['Month'])

 

โ‘ก ์‹œ๊ฐํ™” ๋ฐ dataframe ์ƒ์„ฑ

 

plt.figure(figsize=(10,5))
plt.title("Monthly beer production in Austria", fontsize=15)
plt.plot(df["Month"], df["Monthly beer production"], "-", color='#A69037')
plt.grid()
plt.xticks(rotation=90)
plt.show()

lst = list(range(1,476+1))

production = df["Monthly beer production"].to_list()

new = pd.DataFrame({'time':lst, 'production':production})

 

โ‘ข ์ฃผ์–ด์ง„ dataframe์˜ independent, dependent variable ์„ ํƒ + d-ํ†ต๊ณ„๋Ÿ‰ ์ถœ๋ ฅ + autocorrelation(residual plot) ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” ํ•จ์ˆ˜ ์ƒ์„ฑ

 

def durbin_watson_func(y, X, df, n):
    # y - dependent variable
    # X - independent variables
    # df - dataframe of y and X
    # n - (a number of data points - 1)
    
    #fit multiple linear regression model
    model = ols(formula = f'{y} ~ {X}', data=df).fit()
    
    print('durbin_watson d statistic: ', durbin_watson(model.resid))
    
    acf = smt.graphics.plot_acf(model.resid, lags=n , alpha=0.05)

 

โ‘ฃ ์ด ์„ธ ๊ฐœ์˜ dataset์˜ time-series ์‹œ๊ฐํ™”์™€ durbin-watson ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

 

 

โ‘ค ๊ฒฐ๊ณผ๋ถ„์„

 

โœ”๏ธ ๋งฅ์ฃผ์™€ ์ „๊ธฐ ์ƒ์‚ฐ๋Ÿ‰ ๊ทธ๋ž˜ํ”„๋Š” the number of data points๊ฐ€ 300๊ฐœ๊ฐ€ ๋„˜๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ํŠน์„ฑ์ƒ, ๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐํ™” ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด์•˜์„ ๋•Œ ๋Œ€์ฒด์ ์ธ ํŠธ๋ Œ๋“œ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ํŠน์ • ๋ฐ์ดํ„ฐ ๊ธฐ์ค€ ์•ž๋’ค๋กœ ์„œ๋กœ ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ์ž๊ธฐ์ƒ๊ด€์„ฑ์ด ๋†’๋‹ค๊ณ  ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•œ๋ฐ, ์‹ค์ œ๋กœ ํ•˜๋‹จ autocorrelation ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ํŠน์ • ๋ฌด๋Šฌ๋ฅผ ๋„๋Š” pattern ํ™•์ธ์ด ๊ฐ€๋Šฅํ•จ. durbin-watson d-ํ†ต๊ณ„๋Ÿ‰๋„ dL๊ฐ’๋ณด๋‹ค ์ ์€ ๊ฐ’์œผ๋กœ ์ธก์ •๋˜์–ด 'positive auto-correlation'์„ ๊ฐ–๋Š”๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

 

โœ”๏ธ ๋ฐ˜๋ฉด์— ์ค‘๊ฐ„ ๊ทธ๋ž˜ํ”„ ์ƒดํ‘ธ ํŒ๋งค๋Ÿ‰์˜ ๊ฒฝ์šฐ the number of data points๋Š” 35๊ฐœ๋ฐ–์— ์•ˆ๋œ๋‹ค. ์‹œ๊ณ„์—ด ๊ทธ๋ž˜ํ”„์ด๊ธฐ๋Š” ํ•˜์ง€๋งŒ data ์ˆ˜ ์ž์ฒด๊ฐ€ ์ ์–ด ์ž๊ธฐ์ƒ๊ด€์ด ์ž‘์šฉํ•œ๋‹ค๊ณ  ๋ณด๊ธฐ ์–ด๋ ค์šธ ๊ฑฐ๋ผ๊ณ  ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹ค์ œ๋กœ ํ•˜๋‹จ autocorrelation ๊ทธ๋ž˜ํ”„์—์„œ ๋ˆˆ์— ๋„๋Š” ํŒจํ„ด์„ ์ฐพ๊ธฐ ํž˜๋“ค๊ณ , ์‹ค์ œ๋กœ durbin-watson d-ํ†ต๊ณ„๋Ÿ‰๋„ 2์— ๋งค์šฐ ๊ฐ€๊นŒ์šด ๊ฐ’์œผ๋กœ ์ž๊ธฐ์ƒ๊ด€์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

 

++

breusch-godfrey test ์ถ”ํ›„ ์ฐธ์กฐ


* ์ถœ์ฒ˜1) autocorrelation + two detections + remedies ์„ค๋ช…  https://www.youtube.com/watch?v=UFvDSX3jsYg 

* ์ถœ์ฒ˜2) ์ž๊ธฐ์ƒ๊ด€์„ฑ ์„ค๋ช… https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=yonxman&logNo=220960992282 

* ์ถœ์ฒ˜3) auttocorrleation wiki https://en.wikipedia.org/wiki/Autocorrelation

* ์ถœ์ฒ˜4) autocorrelation ๊ฐœ๋… https://corporatefinanceinstitute.com/resources/knowledge/other/autocorrelation/

* ์ถœ์ฒ˜5) durbin-watson table https://www.real-statistics.com/statistics-tables/durbin-watson-table/

* ์ถœ์ฒ˜6) ์ž๊ธฐ์ƒ๊ด€๊ณ„์ˆ˜ ์„ค๋ช… https://otexts.com/fppkr/graphics-autocorrelation.html

* ์ถœ์ฒ˜7) autocorrelation ์‹œ๊ฐํ™” ์ฐธ์กฐ https://www.youtube.com/watch?v=FiBBpscb6es 

* ์ถœ์ฒ˜8) ์‹ markdown ์ผ๋ถ€ ์ฐธ์กฐ https://github.com/bhattbhavesh91/durbin-watson-test-python/blob/master/durbin-watson-notebook.ipynb

 

* code ์ผ๋ถ€ ์ฐธ์กฐ) https://yganalyst.github.io/etc/visual_2/

* time series dataset(beer production in Austria) ์ถœ์ฒ˜) https://www.kaggle.com/datasets/podsyp/time-series-starter-dataset

๋Œ“๊ธ€