Machine Learning/Fundamentals

feature selection (1) - selectKBest (+jointplot)

metamong 2022. 4. 20.

๐Ÿคณ ์˜ˆ์ „ ํฌ์ŠคํŒ…์—์„œ 'ํŠน์„ฑ๊ณตํ•™(feature engineering)'์ด ๋ฌด์—‡์ธ์ง€์— ๋Œ€ํ•ด ๊ฐ„๋žตํ•˜๊ฒŒ ๊ฐœ๋… ํ•™์Šต์„ ํ•˜์˜€๋‹ค.

 

 

FE - Feature Engineering

1. Concepts * In real world, data is really messy - we need to clean the data * FE = a process of extracting useful features from raw data using math, statistics and domain knowledge - ์ฆ‰, ๋„๋ฉ”..

sh-avid-learner.tistory.com

 

๐Ÿƒ‍โ™‚๏ธ ์šฐ๋ฆฌ๊ฐ€ ์‹ค์ƒํ™œ์˜ data๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ๋งํ•˜๋Š” machine learning์˜ ์„ธ๊ณ„์—์„œ ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ feature๋ฅผ ๋งŒ๋‚ ํ…๋ฐ, ์ด ๋ชจ๋“  feature๋ฅผ modeling์— ๋Œ€์ž…ํ•˜๋ฉด model์ด ํ„ฐ์ง„๋‹ค..? ๊ธฐ์กด ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉ์‹์˜ model์˜ ์„ฑ๋Šฅ์ด ์•ˆ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ๋†’๋‹ค. ์ตœ๋Œ€ํ•œ ์ค‘์š”ํ•œ feature๋งŒ ๋‚จ๊ฒจ๋‘๊ฑฐ๋‚˜, ๊ธฐ์กด์— ์ฃผ์–ด์ง„ feature๋“ค ์ผ๋ถ€๋ฅผ domain knowledge์— ์˜ํ•ด ์žฌ์กฐํ•ฉํ•ด ์ค‘์š”ํ•œ feature๋ฅผ ๋งŒ๋“œ๋Š” feature๋ฅผ ์ด์šฉํ•œ ์—ฌ๋Ÿฌ ์ž‘์—…์„ ํ•จ์œผ๋กœ์จ ์šฐ๋ฆฌ๋Š” modeling์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณธ๋‹ค.

 

๐Ÿ‘“ ์ด๋ฒˆ ํฌ์ŠคํŒ…์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” feature selection - ์ฆ‰ '์ค‘์š”ํ•œ ํŠน์„ฑ๋งŒ ๋ฝ‘์•„๋‚ด๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐฉ๋ฒ•' ์ค‘ ๊ทธ ์ฒซ๋ฒˆ์งธ, selectKBest()๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋ ค ํ•œ๋‹ค!


โ–ท selectKBest docu โ—

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

 

class sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, *, k=10)

 

๐ŸŒ‚ sklearn์—์„œ ์ œ๊ณตํ•˜๋Š” feature_selection module์—์„œ๋Š” ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ method๋ฅผ ์ œ๊ณตํ•ด์ค€๋‹ค. ๊ทธ ์ค‘ selectKBest()๋Š” univariate feature selection์˜ ์ผ๋ถ€๋ถ„์œผ๋กœ ๋ชจ๋“  feature๋“ค ๊ฐ๊ฐ ํ•œ ๊ฐœ์”ฉ target์— ์˜ํ–ฅ์„ ์ค€๋‹ค๊ณ  ๊ฐ€์ •(๊ฐ feature๋“ค์ด target๊ณผ์˜ ์—ฐ๊ด€์„ฑ์— ๋…๋ฆฝ์ ์œผ๋กœ ํ‰๊ฐ€๋จ)ํ•  ๋•Œ(๊ทธ๋ž˜์„œ univariate์ด๋‹ค. multivariate์ด๋ฉด scoringํ•  ๋•Œ ๋‘ ๊ฐœ ์ด์ƒ์˜ ๋ณ€์ˆ˜๊ฐ€ ๋™์‹œ์— target์— ์˜ํ–ฅ์„ ์ค€๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์‹œ์ž‘ํ•œ๋‹ค), ์ด ๋•Œ์˜ ์˜ํ–ฅ ์ •๋„๋ฅผ, ์›ํ•˜๋Š” scoring method๋กœ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ ๊ฐ€์žฅ ์ ์ˆ˜๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜จ feature๋“ค k๊ฐœ๋งŒ (default๋Š” 10๊ฐœ) ์„ ํƒํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค.

 

'Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method'

selectKBest removes all but the K highest scoring features

 

๐Ÿ‘ ์—ฌ๊ธฐ์„œ target ์†์„ฑ์— ๋”ฐ๋ผ, ์ฆ‰ regression์ธ์ง€ classification์ธ์ง€์— ๋”ฐ๋ผ ์ ์šฉํ•˜๋Š” score_func๊ฐ€ ๋‹ค๋ฅด๋‹ค

→ regression) f_regression / mutual_info_regression

→ classification) chi2 / f_classif / mutual_info_classif

'The methods based on F-test estimate the degree of linear dependency between two random variables'

= f-๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•˜๊ณ  ๋‘ ๋ณ€์ˆ˜๊ฐ„์˜ ์„ ํ˜•์  ์˜์กด์„ฑ ์ •๋„๋ฅผ ์‚ฐ์ถœํ•˜๋Š” (์•ž์— f_๊ฐ€ ๋ถ™์œผ๋ฉด) scoring method์ด๋‹ค.

 

<<๋‹ค์–‘ํ•œ score_func>>

 

 

Q. categorical type data๊ฐ€ ์žˆ์œผ๋ฉด selectKBest๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์„๊นŒ?

 

A. ์–ด๋ ต๋‹ค

→ ๊ฐ์ข… docu์™€ ๋งŽ์€ article๋“ค Q&A๋ฅผ ์ฐพ์•„๋ดค์ง€๋งŒ, categorical data type์„ ๋”ฐ๋กœ one-hot ๋˜๋Š” ordinal ๋“ฑ๋“ฑ encoded๋˜๋Š” ๊ณผ์ •์ด ๊ฑฐ์น˜์งˆ ์•Š๋Š”๋‹ค๋ฉด selectKBest ์ˆ˜ํ–‰์ด ์–ด๋ ค์›€.

→ ๋”ฐ๋ผ์„œ selectKBest ๊ณผ์ • ์ด์ „์— ๋จผ์ € encodingํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ณ  ๋‚œ ๋’ค ์ˆ˜ํ–‰ ์ถ”์ฒœ!

→ ์—๋Ÿฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๋œธ

'ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
  X = check_array('

(+) numeric data๋ผ๋ฉด discrete & continuous ๋ชจ๋‘ selectKBest ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ

 

-- ์ง์ ‘ ์˜ˆ์‹œ์™€ ํ•จ๊ป˜! --

โ‘  ์›ํ•˜๋Š” scoring method์™€ kBest๋ฅผ import

 

from sklearn.feature_selection import f_regression, SelectKBest

 

โ‘ก selector ๊ฐ์ฒด ์ •์˜ (์ฆ‰ ์–ด๋–ค method๋กœ ๋ช‡ ๊ฐœ๋ฅผ ๋ฝ‘์„ ๊ฑด์ง€ ๋ฝ‘์œผ๋ ค๋Š” '์ผ์ข…์˜ ์•ˆ๋‚ด์„œ(๊ฐ์ฒด)' ๋งŒ๋“ค๊ธฐ)

 

selector = SelectKBest(score_func=f_regression, k=10)

 

โ‘ข ํ•ด๋‹น ์•ˆ๋‚ด์„œ๋ฅผ ๊ฐ€์ง€๊ณ  train, test (val๋„ ์žˆ๋‹ค๋ฉด ๋ชจ๋‘) ๋˜๋Š” ์ „์ฒด data์— ์ ์šฉ

(์ฒ˜์Œ์—๋Š” fit_transform → transform - fit๋Š” ํ•œ ๋ฒˆ๋งŒ ํ•ด์„œ selectํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ ์šฉ์‹œํ‚ค๋ฉด ์ดํ›„ transform๋งŒ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋จ!)

 

## ํ•™์Šต๋ฐ์ดํ„ฐ์— fit_transform 
X_train_selected = selector.fit_transform(X_train, y_train)

## ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” transform
X_test_selected = selector.transform(X_test)

 

โ‘ฃ ๊ฐ์†Œ๋œ feature๋“ค ๊ฒฐ๊ณผ data shape() ํ™•์ธ

(์ด์   X_train_selected๋กœ modeling์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋œ๋‹ค.)

 

X_train_selected.shape, X_test_selected.shape

++ selector ์œ ์šฉํ•œ method ์‚ดํŽด๋ณด๊ธฐ ++

 

get_support()

 

** ์ค‘์š”! get_support()๋Š” ์–ด๋–ค ํŠน์„ฑ๋“ค์ด ์„ ํƒ๋˜์—ˆ๋Š” ์ง€, ์„ ํƒ๋˜๋ฉด True, ์•„๋‹ˆ๋ฉด False๋ฅผ ๋ฐ˜ํ™˜ํ•ด์ค€๋‹ค. ๋”ฐ๋ผ์„œ ํ•ด๋‹น method๋ฅผ ํ†ตํ•ด ์–ด๋–ค ํŠน์„ฑ๋“ค์ด ์„ ํƒ๋˜์—ˆ๋Š” ์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

selected_mask = selector.get_support()

all_names = X_train.columns

## ์„ ํƒ๋œ ํŠน์„ฑ๋“ค
selected_names = all_names[selected_mask]

## ์„ ํƒ๋˜์ง€ ์•Š์€ ํŠน์„ฑ๋“ค
unselected_names = all_names[~selected_mask]

 

๊ทธ ์™ธ fit()์™€ transform()์„ ํ™œ์šฉํ•ด์„œ X์™€ y๋ฅผ ์ง‘์–ด๋„ฃ์–ด ์•Œ๋งž๋Š” feature๋“ค์„ ์„ ํƒํ•œ๋‹ค

 

++ ์ถ”๊ฐ€) ๊ฐ€์žฅ score๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜จ attribute ์ถœ๋ ฅํ•˜๊ธฐ

 

<< numpy.argmax()>> ์‚ฌ์šฉ!

 

selector.feature_names_in_[selector.scores_.argmax()]

 

<<top n๊ฐœ์˜ score๋‚˜์˜จ attribute ์ถœ๋ ฅํ•˜๋ ค๋ฉด .argsort()[-n:]>>

 

selector.feature_names_in_[selector.scores_.argsort()[-n:]]
#n๊ฐœ์˜ top score features

 

<<๋’ค์—์„œ n๊ฐœ์˜ ๊ฐ€์žฅ ์ ์€ score๋“ค์ด ๋‚˜์˜จ attribute ์ถœ๋ ฅํ•˜๋ ค๋ฉด .argsort()[:n]>>

 

selector.feature_names_in_[selector.scores_.argsort()[:n]]
#score๊ฐ’ ์ ์€ n๊ฐœ์˜ feature names

++ seaborn - jointplot์œผ๋กœ ๋‘ ๋ณ€์ˆ˜๊ฐ„์˜ correlation ์‹œ๊ฐํ™” ++

 

โˆฝ jointplot docu โˆฝ 

https://seaborn.pydata.org/generated/seaborn.jointplot.html

 

seaborn.jointplot(*, x=None, y=None, data=None, kind='scatter', color=None, height=6, ratio=5, space=0.2, dropna=False, xlim=None, ylim=None, marginal_ticks=False, joint_kws=None, marginal_kws=None, hue=None, palette=None, hue_order=None, hue_norm=None, **kwargs)

 

๐Ÿค– "Draw a plot of two variables with bivariate and univariate graphs"

โ‰ซ ์ฃผ๋กœ ์ˆ˜์น˜ํ˜• ์—ฐ์†ํ˜• ๋‘ ๋ณ€์ˆ˜๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์œก์•ˆ์œผ๋กœ ๋ฐ”๋กœ ํŒŒ์•…ํ•˜๊ณ  ์‹ถ์„ ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค!

 

"kind{ “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }"

โ‰ซ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ graph๋กœ ๋‘ ๋ณ€์ˆ˜๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ jointplot์— ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ฃผ๋กœ numerical bivariates' correlation์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ํšŒ๊ท€์„ ์ด scatterplot์— ํฌํ•จ๋œ ํ˜•ํƒœ๋กœ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค (kind = 'reg')


์˜ˆ์ œ) <k selection์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ์‹คํ—˜>

 

Q. ์ฃผ์–ด์ง„ car price data์™€ ๋งŒ๋“ค์–ด์ง„ MLR ๋ชจ๋ธ์—์„œ car price๋ฅผ ๊ฒฐ์ •์ง€์„ ์œ ์šฉํ•œ ๋ณ€์ˆ˜๋“ค๋งŒ f_regression์— ์˜ํ•ด ๋ช‡ ๊ฐœ๋ฅผ ๋ฝ‘๋Š”๊ฒŒ ์ข‹์„ ์ง€ selectKBest์— ์˜ํ•ด์„œ ๊ตฌํ•˜๋Š” feature selection์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ดํ›„ ๊ฒฐ์ •ํ•œ evaluation metric์— ์˜ํ•ด feature selection ์ด์ „๊ณผ ์ดํ›„ ์–ผ๋งˆ๋‚˜ MLR ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์ข‹์•„์กŒ๋Š”์ง€ ์ˆ˜์น˜๋ฅผ ํ†ตํ•ด ํ™•์ธํ•˜์ž. the number of features์— ๋”ฐ๋ฅธ error ์ˆ˜์น˜์˜ ๋ณ€ํ™”๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•˜๊ณ  ๊ณผ์ ํ•ฉ์ด ์ผ์–ด๋‚˜์ง€ ์•Š๋Š” ์ตœ์ ์˜ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” feature์˜ ๊ฐœ์ˆ˜ ๋ฐ ํ•ด๋‹น ์„ ํƒ๋œ feature์˜ ์ข…๋ฅ˜๋ฅผ ๊ตฌํ•ด๋ณด์ž.

(+์ถ”๊ฐ€ ์ž‘์—… - jointplot ์‹œ๊ฐํ™”๋ฅผ ๊ทผ๊ฑฐ๋กœ selectKBest์— ์˜ํ•ด selected๋œ feature์˜ ์ •๋‹น์„ฑ์„ ๋ณด์ด์ž)

 

A. 

 

1> import & dataset ์ค€๋น„

 

import pandas as pd
import matplotlib
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/module_5_auto.csv'
df = pd.read_csv(path)
df.to_csv('module_5_auto.csv')
df=df._get_numeric_data()

 

2> train / test set ๋‚˜๋ˆ„๊ธฐ & target๊ณผ ๊ทธ ์ด์™ธ์˜ data๋กœ ๋‚˜๋ˆ„๊ธฐ

 

train = df.sample(frac=0.75,random_state=1)
test = df.drop(train.index)

train.dropna(inplace=True)
test.dropna(inplace=True)

target = 'price' #y๋Š” car price data

## X_train, y_train, X_test, y_test ๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฆฌ
X_train = train.drop(columns=target)
y_train = train[target]
X_test = test.drop(columns=target)
y_test = test[target]

 

3> evaluation metrics๋Š” r2 score & MAE / ๊ทธ๋ฆฌ๊ณ  1๊ฐœ๋ถ€ํ„ฐ ์ „์ฒด column๊นŒ์ง€ ์ฐจ๋ก€๋กœ selectKBest ๋Œ๋ ค ๊ฐ๊ฐ์˜ training set & test set์˜ ์ง€ํ‘œ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”ํ•˜๊ธฐ

 

๋”๋ณด๊ธฐ
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

training=[]
testing=[]
feature_names = []

x = range(1, len(X_train.columns)+1)

for num in range(1, len(X_train.columns)+ 1):
    selector = SelectKBest(score_func=f_regression, k=num)
    
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)
    
    selected_mask = selector.get_support()
    selected_names = all_names[selected_mask]
    feature_names.append(selected_names)
    
    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    
    y_train_pred = model.predict(X_train_selected)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    training.append(train_mae)
    
    y_test_pred = model.predict(X_test_selected)
    test_mae = mean_absolute_error(y_test,y_test_pred)
    testing.append(test_mae)

plt.plot(x, training, label='Training Score', color='b')
plt.plot(x, testing, label='Testing Score', color='g')
plt.axvline(6.0, 0, 1, color='red', linestyle='--', linewidth=2)
plt.ylabel("MAE ($)")
plt.xlabel("Number of Features")
plt.title('Validation Curve')
plt.legend()
plt.show()

print('------selected feature names------')
for a in feature_names[5].values[0:6]:
    print(a)

 

 

4> ๊ฒฐ๊ณผ ํ•ด์„

๐Ÿงธ validation curve ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ(MAE)๋ฅผ ๋ณด์•˜์„ ๋•Œ feature ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์ด์— ๊ณผ์ ํ•ฉ๋˜์–ด error๊ฐ€ ๊ฒฐ๊ตญ์€ ์ญ‰ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ overfitting์˜ ๋ฌธ์ œ & ๋ชจ๋ธ training time์˜ ๊ฐ์†Œ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด features ์ˆ˜๊ฐ€ ์ ๊ฒŒ ์„ ํƒ๋˜๋Š” ๊ฒŒ ์ข‹๊ธฐ ๋•Œ๋ฌธ์— error๊ฐ€ ํฐ ํญ์œผ๋กœ ๊ฐ์†Œํ•˜๋Š” ์‹œ์ ์ธ, features ์ˆ˜๊ฐ€ 6๊ฐœ์ผ ๋•Œ๊ฐ€ ์ตœ์ ์˜ feature selection (selectKBest()๋กœ๋งŒ ๋ณด์•˜์„ ๋•Œ)๋ผ๊ณ  ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ test set error๊ฐ€ ์ œ์ผ ์ ์€ 20์ด ์ œ์ผ ์ข‹์€ ๊ฒƒ์ด ์•„๋‹Œ๊ฐ€ ๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์—ฌ๊ธฐ์„œ์˜ test set์€ validation ๊ฐœ๋…์ด๋ฉฐ, ์ถ”ํ›„ ๊ทธ ์–ด๋–ค ์ƒˆ๋กœ์šด ๊ฐœ๋…์˜ test set์ด ์™€๋„ ์œ„์™€ ๊ฐ™์ด ๋ฌด์กฐ๊ฑด error๊ฐ€ ์ ๊ฒŒ ๋‚˜์˜จ๋‹ค๊ณ  ๋ณด์žฅ์ด ์–ด๋ ต๊ณ  ๋˜ ์‹ค์ œ ์„ธ๊ณ„์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ์˜ˆ์ธก ๊ฐ€๋Šฅ์„ฑ ์„ค๋ช…๋ ฅ์ด ๋†’์€ ๊ฒฝ์šฐ๊ฐ€ ๋“œ๋ฌผ๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์†Œ feature ์ˆ˜์˜ ์ ์ •๊ฐ’ ๋‚ด์—์„œ selectํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ ์ด๋ผ๊ณ  ์ƒ๊ฐ๋œ๋‹ค. r-squared ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์•˜์„ ๋•Œ๋„ test set line์—์„œ ๊ฒฐ์ •๊ณ„์ˆ˜๊ฐ’์ด ๊ฐ‘์ž‘์Šค๋Ÿฝ๊ฒŒ ์ฆ๊ฐ€ํ•œ feature๊ฐ€ 6๊ฐœ์ผ ๋•Œ์˜ ์ง€์ ์ด ์ œ์ผ ์ตœ์ ์ด๋ผ ์ƒ๊ฐํ•˜๋ฉฐ ์ดํ›„ r-sqaured ๊ฒฐ์ •๊ณ„์ˆ˜๊ฐ€ ์†Œํญ ์ฆ๊ฐ€ํ•˜๊ธด ํ•˜๋‚˜ ๊ทธ ์ •๋„๊ฐ€ ์•ฝํ•˜๋ฏ€๋กœ ๊ทธ ์˜ํ–ฅ์„ ๋ฐฐ์ œํ•ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค. ๐Ÿงธ

 

(++ r-squared & MAE ๊ด€๋ จ ์˜ค๋ฅ˜ ์ง€ํ‘œ ์ƒ์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ํฌ์ŠคํŒ… ์ฐธ์กฐ ++ - ์ถ”ํ›„ adjusted-R metric๋„ ๋‹ค๋ฃฐ ์˜ˆ์ • - โœŒ๏ธ )

 

 

All About Evaluation Metrics(1/2) → MSE, MAE, RMSE, R^2

** ML ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ข…์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ๋•Œ ๋‹ค์–‘ํ•œ evaluation metrics๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ–ˆ์Œ! ** (supervised learning - regression problem์—์„œ ๋งŽ์ด ์“ฐ์ด๋Š” ํ‰๊ฐ€์ง€ํ‘œ๋“ค) - ๊ณผ์ • (5) - ๐Ÿ˜™ ๊ทธ๋Ÿฌ๋ฉด ์ฐจ๊ทผ์ฐจ..

sh-avid-learner.tistory.com

 

→ code ๊ฒฐ๊ณผ ์ตœ์ ์˜ 6๊ฐœ feature๋Š” 'width & curb-weight & engine-size & horsepower & highway-mpg' & 'city-L/100km'

 

5> jointplot์œผ๋กœ ์„ ํƒ๋œ ํŠน์„ฑ๊ณผ target๊ณผ์˜ ํšŒ๊ท€์„  ์‹œ๊ฐํ™” (scatterplot ํฌํ•จ)

 

<์œ„ ์„ ํƒ๋œ ํŠน์„ฑ ์ค‘ horsepower, ๊ทธ๋ฆฌ๊ณ  ์„ ํƒ ์•ˆ๋œ ํŠน์„ฑ ์ค‘ compression-ratio๋ฅผ ๊ฐ๊ฐ jointplot์œผ๋กœ ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ>

 

#code source by Bhavesh Bhatt
from scipy import stats
from scipy.stats.stats import pearsonr

def plot_join_plot(df, feature, target):
    j = sns.jointplot(feature, target, data=df, kind='reg')
    #j.annotate(stats.pearsonr) ์‚ฌ๋ผ์ง..
    return plt.show()
    
train_df = pd.concat([X_train, y_train], axis=1)

plot_join_plot(train_df, 'compression-ratio','price')
plot_join_plot(train_df, 'horsepower','price')

 

 

๐Ÿ‘‰ horsepower๋Š” ์„ ํƒ๋œ ํŠน์„ฑ์— ๋งž๊ฒŒ price์— ๋”ฐ๋ฅธ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋Œ€์ฒด์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ ๋ฐ˜ํ•ด, compression-ratio๋Š” ํšŒ๊ท€์„ ์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ฑฐ์˜ 0์œผ๋กœ price์˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ compression-ratio ๋ณ€ํ™”๊ฐ€ ๊ฑฐ์˜ ์—†๋‹ค๊ณ  ๋ด๋„ ๋ฌด๋ฐฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— selectKBest๊ฐ€ ์„ ํƒ๋œ ํŠน์„ฑ์—์„œ ์ œ์™ธํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

- ๋” ๋‹ค์–‘ํ•œ feature selection ๊ธฐ๋ฒ• ์ถ”ํ›„ ํฌ์ŠคํŒ… ์˜ˆ์ • ๐Ÿฆพ -


* ์ถœ์ฒ˜) https://www.kaggle.com/jepsds/feature-selection-using-selectkbest

* kBest ์ถœ์ฒ˜) https://www.youtube.com/watch?v=UW9U0bYJ-Ys 

* jointplot - ์ถœ์ฒ˜) https://www.geeksforgeeks.org/python-seaborn-jointplot-method/ 

'Machine Learning > Fundamentals' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

PCA(concepts)  (0) 2022.05.30
Feature Selection vs. Feature Extraction  (0) 2022.05.18
Ordinal Encoding  (0) 2022.04.20
train vs. validation vs. test set  (0) 2022.04.18
Gradient Descent (concepts) (+momentum)  (0) 2022.04.18

๋Œ“๊ธ€