Machine Learning/Experiments

EXP001 - ≪Ridge 효과(1)≫ - coefficients 변화 & 성능 향상 확인하기

metamong 2022. 4. 22.

🤙 저번 포스팅에서 우리는 Multiple Regression 다항회귀모델에 대해 공부했다.

Multiple Linear Regression Model (concepts+w/code)

✌️ 저번 시간에 feature가 1개인 단순선형회귀모델에 대해서 배웠다 ✌️ - 이론(개념) - Simple Linear Regression (concepts) ** 우리는 저번시간에 Supervised Learning - Regression - Linear Regression..

sh-avid-learner.tistory.com

🤙 그리고 Ridge 모델 소개 시간에 SLR 모델에 Ridge 규제를 정해 Ridge의 효과를 실험으로 증명했고, 후반부에 MLR 모델에 관해서도 잠깐 언급했었다

(L2 Regularization) → Ridge Regression (w/scikit-learn)

😼 저번 포스팅에서 Ridge 회귀가 무엇인지 개념에 대해 정확히 알아보았다 😼 (L2 Regularization) → Ridge Regression (concepts) ** 우리는 저번 포스팅에서 Supervised Learning 중 Regression의 일종인 '..

sh-avid-learner.tistory.com

> MLR 다항회귀모델에서는 여러 변수들이 feature로 작용한다. 이 때 ridge 규제를 통해 덜 중요한 feature의 영향력을 감소시킨다. SLR 모델과 다르게 ridge 효과가 확연히 느껴질 수 밖에 없다! (feature가 많으므로)

> 그럼 우리는 ridge 적용 후 ①각 feature의 coefficients 계수의 변화를 살펴볼 것이며, 그 결과 ②ridge 후 모델의 성능이 어느 정도 개선되었는 지 수치로 확인해보고자 한다.

시작!

①각 feature의 coefficients 계수의 변화

1> data 준비 / train, test 분리 / features, target 분리

#<MLR 모델 - 최적의 alpha찾고 이때 각 feature별 coef 변화 보기>
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/module_5_auto.csv'
df = pd.read_csv(path)
df.to_csv('module_5_auto.csv')
df=df._get_numeric_data()

train = df.sample(frac=0.75,random_state=1)
test = df.drop(train.index)

train.dropna(inplace=True)
test.dropna(inplace=True)

target = 'price'

## X_train, y_train, X_test, y_test 데이터로 분리
X_train = train.drop(columns=target)
y_train = train[target]
X_test = test.drop(columns=target)
y_test = test[target]

2> MLR 모델

#mlr model
#20 features
from sklearn.linear_model import LinearRegression

model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

3> RidgeCV로 최적의 ridge 규제 모델 찾고 RidgeCV 적용 전과 후 coefficients 확인하기

#ridgeCV
alphas = np.arange(1, 200, 1)

ridge_mlr = RidgeCV(alphas=alphas, cv=10)

#fitting 
ridge_mlr.fit(X_train, y_train) 

print(ridge_mlr.coef_, ridge_mlr.intercept_, ridge_mlr.alpha_)
print(ridge_mlr.best_score_)

#[-7.56491516e+00 -7.56491516e+00  2.78681974e+02 -1.06845870e+01
#  8.61802076e+01 -7.29448106e+01  1.27669085e+02  2.37005747e+02
#  2.26131771e+00  8.68613041e+01  8.27692719e+02 -2.19957856e+03
#  3.19741534e+02  5.01485386e+01  1.73810064e+00 -4.85339063e+01
#  2.47655001e+02  7.95687055e+02  7.05511891e+01 -7.05511891e+01] -49415.03407219711 5

# plot MLR coefficients
coefficients = pd.Series(model_lr.coef_, X_train.columns)
plt.figure(figsize=(10,5))
coefficients.sort_values().plot.barh()
plt.show()

print(model_lr.coef_.mean())
#1251.874366308613

print(model_lr.coef_.var())
#63032739.06265261

# plot Ridge coefficients
coefficients = pd.Series(ridge_mlr.coef_, X_train.columns)
plt.figure(figsize=(10,5))
coefficients.sort_values().plot.barh()
plt.show()

print(ridge_mlr.coef_.mean())
#35.722544617003535

print(ridge_mlr.coef_.var())
#323993.716524918

4> coefficients 비교 분석

- (위) Ridge 적용 전 / (아래) Ridge 적용 후 -

☝🏻 ridge 적용 전 계수 값이 상대적으로 적용 후보다 큼을 확인할 수 있다. 즉 width와 length의 target 결정력 영향이 타 feature에 비해 압도적으로 큰데, 이로 인해 상대적으로 다른 feature가 target 결정에 영향을 거의 못 미치고 있다. ridge 적용 후 압도적으로 큰 영향을 미치는 width와 length 계수가 확 감소했으며, stroke 결정력이 압도적으로 커진게 아닌가 생각할 수 있지만, 모든 계수들의 평균을 따지면 1251에서 35로 큰 감소를 보임을 알 수 있다. 즉 l2 규제를 통해 모든 feature의 target 결정력을 어느 정도 효과를 감소시킴을 알 수 있다.

✌🏻 ridge 적용 전 분산 값은 63032739이고, ridge 적용 후 분산 값은 323993으로 엄청난 계수값 분산의 감소를 보인다. 이는 계수값 간의 큰 편차를 규제로 인해 확 줄였다는 뜻이 되며, 곧 상대적으로 더 많은 feature들이 target 결정력에 어느 정도 영향력을 보이기 시작했음을 뜻하며, 이는 곧 일부 feature로만 target값이 결정된다고 말할 수 있는 현상을 어느 정도 막았다고 분석할 수 있겠다!

🤟🏻 feature selection의 효과 - 우리는 ridge를 적용한 결과 stroke라는 feature가 결정적인 feature임을 확인했고, 이는 곧 ridge model을 통해 어느 feature가 결정적인 지 말해주고 있다. (물론 ridge 적용 전에도 어떤 종류를 feature로 정할 지 알 수 있지만 기존 training set에 맞춰진 과적합의 가능성이 있다. 따라서 신뢰성이 상대적으로 떨어진다. 실험결과를 보았을 때도 두드러진 feature 종류가 달라졌음을 확인 가능!) 또한 우리가 ridge를 사용하는 이유는 기존 training set보다 앞으로 새롭게 들어올 test data에 대한 좋은 예측 성능을 보이기 위함이기 때문이다. 따라서 l2 규제는 계수값을 감소시켜 너무 복잡한 모델이 되지 않게끔 하고, 어느 정도 예측은 잘되게 모델을 만들어준다. (l1 규제는 영향력이 상대적으로 적은 feature 아예 0으로 만들어버린다. 따라서 feature selection이 목적이라면 LASSO가 더 낫다고 판단된다. 후에 포스팅 예정)

②ridge 후 모델의 성능이 어느 정도 개선되었는 지 수치로 확인

5> test data로 모델 성능 예측하기 (MLR & ridge)

#predicting 
y_test_pred_MLR = model_lr.predict(X_test)

print(r2_score(y_test, y_test_pred_MLR), mean_squared_error(y_test, y_test_pred_MLR))
#0.8977890607619405 5466979.56131857

#predicting 
y_test_pred_mlr = ridge_mlr.predict(X_test)

print(r2_score(y_test, y_test_pred_mlr), mean_squared_error(y_test, y_test_pred_mlr))
#0.8991869412385929 5392210.812963509

👌🏻 alpha 규제값이 5일 때 최고의 성능을 보여주었고, MLR 기존 모델보다 소폭이지만 결정계수값, MSE 모두 성능이 좋아졌으며, 이 결과를 통해 우리는 l2 regulariztion의 효과를 보았다고 말할 수 있겠다!

- 마지막으로 ridge에 대해 어떤 user가 남긴 '너무 정리가 잘 된' 한 마디를 복붙하여 가져옴👏🏻 -

'If our underlying data follows a relatively simple model, and the model we use is too complex for the task, what we are essentially doing is we are putting too much weight on any possible change or variance in the data. Our model is overreacting and overcompensating for even the slightest change in our data(과적합). People in the field of statistics and machine learning call this phenomenon overfitting. When you have features in your dataset that are highly linearly correlated with other features, turns out linear models will be likely to overfit(위 예시에선 특히 width가 너무 압도적일정도로 overfit의 가능성을 보였음 - 크게 영향을 주는 feature가 있다면 이로 인해 조금의 feature 변화에도 target에 민감하게 반응함 - 따라서 어느 정도 이 feature의 영향력을 유하게 만들 필요가 있음). Ridge Regression, avoids over fitting by adding a penalty to models that have too large coefficients(penalty 준 결과 width의 영향력 현저한 감소).'

* overfitting/underfitting 관련 포스팅은 아래 참조 ↓↓↓↓

Overfitting/Underfitting & Bias/Variance Tradeoff

1. 일반화(generalization) "In machine learning, generalization is a definition to demonstrate how well is a trained model to classify or forecast unseen data. Training a generalized machine learnin..

sh-avid-learner.tistory.com

* 참고출처) https://stats.stackexchange.com/questions/251708/when-to-use-ridge-regression-and-lasso-regression-what-can-be-achieved-while-us

저작자표시 비영리 변경금지

EXP001 - ≪Ridge 효과(1)≫ - coefficients 변화 & 성능 향상 확인하기

①각 feature의 coefficients 계수의 변화

②ridge 후 모델의 성능이 어느 정도 개선되었는 지 수치로 확인

댓글

티스토리툴바