Machine Learning/Fundamentals

PCA(w/code)

metamong 2022. 5. 31.

๐Ÿ‘ ์ €๋ฒˆ ์‹œ๊ฐ„์— PCA์˜ ๊ฐœ๋… ๋ฐ ์ฃผ์ถ•์„ ์ฐพ๊ธฐ๊นŒ์ง€์˜ ์ž์„ธํ•œ ๊ณผ์ •์„ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฉด์œผ๋กœ ๋ถ„์„ํ•ด๋ณด๊ณ  ์•Œ์•„๋ณด์•˜๋‹ค.

 

PCA(concepts)

* dimensionality reduction ๊ธฐ๋ฒ• ์ค‘ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์ธ PCA์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž! concepts> โ‘  ๊ณ ์ฐจ์›์˜ data๋ฅผ ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ ์ฐจ์› ์ถ•์†Œํ•˜๋Š” ๊ธฐ๋ฒ• (dimensionality reduction) โ‘ก ๊ธฐ์ค€์ด ๋˜๋Š” ์ •๋ณด๋Š” data์˜ ๋ถ„์‚ฐ! (๋ถ„์‚ฐ

sh-avid-learner.tistory.com

๐Ÿ‘ ์ด์ œ๋Š” ์ง์ ‘ code๋กœ ์‹คํ–‰ํ•ด scree plot์œผ๋กœ ์‹œ๊ฐํ™”ํ•ด๋ณด๊ณ  ์ฃผ์–ด์ง„ unsupervised data๋ฅผ ์•Œ๋งž๊ฒŒ clusteringํ•ด ์‹ค์ œ data๊ฐ€ PC ์ถ•์— ๋งž๊ฒŒ ์ž˜ ๋ถ„๋ฆฌ๊ฐ€ ๋˜๋Š”์ง€ ์ฒดํฌํ•ด๋ณด๋Š” ๊ณผ์ •๊นŒ์ง€ ํ•ด ๋ณด๋ ค ํ•œ๋‹ค!

 

๐Ÿ‘ PCA์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ๋˜ ๊ฐœ๋…์„ ์•„๋ž˜ ์ž์„ธํžˆ 6๊ฐ€์ง€์˜ step์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค.

 

โ‘  standardization

(np.mean๊ณผ np.std๋กœ ์ง์ ‘ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅ / ๋˜๋Š” sklearn์˜ StandardScaler() ์‚ฌ์šฉ)

โ‘ก constructing a covariance matrix

(np.cov ์‚ฌ์šฉ  / ๋˜๋Š” sklearn์˜ PCA๊ฐ€ ์•Œ์•„์„œ internallyํ•˜๊ฒŒ ์—ฐ์‚ฐํ•ด์คŒ) 

โ‘ข performing eigendecomposition of covariance matrix (decompose the matrix into its eigenvectors & eigenvalues)

(np.linalg.eig ์‚ฌ์šฉ / ๋˜๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ sklearn์˜ PCA ํ•จ์ˆ˜๊ฐ€ ์•Œ์•„์„œ ์—ฐ์‚ฐ) 

โ‘ฃ selection of most important eigenvectors & eigenvalues

(์ง์ ‘ eigenvalue๋ฅผ ๋‚˜์—ดํ•ด PC๋ฅผ ๊ณ ๋ฅด๊ฑฐ๋‚˜ / sklearn์˜ PCA parameter๋กœ ์ •ํ•ด๋†“์€ ์ˆ˜๋งŒํผ PC๋ฅผ ์•Œ์•„์„œ ๊ณจ๋ผ์ค€๋‹ค)

โ‘ค constructing a projection matrix (using selected eigenvectors) - feature matrix๋ผ๊ณ  ํ–ˆ์Œ!

โ‘ฅ transformation of training/test dataset


Q. diesel์„ ์›๋ฃŒ๋กœ ์‚ฌ์šฉํ•œ ์ฐจ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ์ฐจ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ฒจ ์žˆ๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ์žˆ๋‹ค. PCA๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ์— ์ ์ ˆํ•œ feature columns๋งŒ ๊ณจ๋ผ feature reduction์„ ์ง„ํ–‰ํ•ด ๋‹จ ๋‘ ๊ฐœ์˜ ์ฃผ์ถ• PC1๊ณผ PC2๋งŒ ๋‚จ๊ฒจ๋ณด์ž. ๊ทธ๋ฆฌ๊ณ , target์œผ๋กœ ์„ธ์› ๋˜ diesel ์‚ฌ์šฉ์—ฌ๋ถ€๋ฅผ ๋‹จ ๋‘ ๊ฐœ์˜ ์ƒˆ๋กœ์šด feature(PC1, PC2)๋กœ clustering์ด ์‰ฝ๊ฒŒ ๊ฐ€๋Šฅํ•œ์ง€ pca plot๊ณผ scree plot ์‹œ๊ฐํ™”๋ฅผ  ํ†ตํ•ด clustering problem์˜ ํ•ด๊ฒฐ ์—ฌ๋ถ€ ์ •๋‹น์„ฑ์„ ๋ณด์—ฌ๋ผ.

 

A. 

1> dataset ์ค€๋น„ (null column ์ œ๊ฑฐ)

#1) load the dataset
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/module_5_auto.csv'
car = pd.read_csv(path)
car.to_csv('module_5_auto.csv')
car=car._get_numeric_data()

#1) + check if there is null data in the dataset
nulls_checked = car.isnull().sum()
nulls_col = nulls_checked[nulls_checked != 0].index[0] #deleting 'stroke' col

car = car.loc[:, car.columns!=nulls_col]
car = car.iloc[:, 3:19]

 

2> feature์™€ target ๋ถ„๋ฆฌํ•˜๊ณ  feature standardization

(์—ฌ๊ธฐ์„œ target์€ diesel ์‚ฌ์šฉ ์—ฌ๋ถ€๋ฅผ ํƒํ–ˆ์œผ๋ฉฐ, ์‹ค์ œ PCA๋Š” target ๋ถ„๋ฆฌ ์—†์ด data ๊ทธ ์ž์ฒด๋ฅผ ์ค„์ด๋Š” ๋ฐ ๋ชฉ์ ์ด ์žˆ์ง€๋งŒ, ์šฐ๋ฆฌ๋Š” PCA์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ๋‹จ 2์ฐจ์› ๋งŒ์œผ๋กœ๋„ data๊ฐ€ ์ž˜ ๋ถ„๋ฆฌ๊ฐ€ ๋˜์—ˆ๋Š” ์ง€ ํ™•์ธํ•ด๋ณด๊ณ ์ž target์„ ๋”ฐ๋กœ ๋ถ„๋ฆฌํ•˜๊ณ  ํ›„์— ์‹œ๊ฐํ™”ํ•  ์˜ˆ์ •!)

#2) standardization
from sklearn.preprocessing import StandardScaler
target = 'diesel'
# Separating out the features
x = car.loc[:, car.columns != target].values #15 dimensions

# Separating out the target
y = car.loc[:,[target]].values

# Standardizing the features
x = StandardScaler().fit_transform(x)

print(x)

 

- ํ‘œ์ค€ํ™” ์ดํ›„์˜ dataframe number -

 

 

3> PCA ์ง„ํ–‰ (n_components๋Š” 2๋กœ ์„ค์ •!)

 

โ–ถsklearn.decomposition.PCA docu()โ—€

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

 

class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)

 

#3) perform PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2) #dimensionality 15 to 2

principalComponents = pca.fit_transform(x) #perform PCA

principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])

finalDf = pd.concat([principalDf, car[[target]]], axis = 1)

finalDf

 

 

4> pca plot & scree plot ์‹œ๊ฐํ™” (๋‘ target์˜ ๊ตฌ๋ถ„์„ ์œ„ํ•ด ์ƒ‰์ƒ ๋‹ค๋ฅด๊ฒŒ ์ž…ํž˜)

#4) visualize 2D projection - pca plot
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('PC1', fontsize = 15)
ax.set_ylabel('PC2', fontsize = 15)
ax.set_title('PCA w/ two components', fontsize = 20)

targets = [0, 1]
colors = ['r', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['diesel'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'PC1']
               , finalDf.loc[indicesToKeep, 'PC2']
               , c = color
               , s = 50)
ax.legend(['no diesel', 'yes diesel'])
ax.grid()

 

def scree_plot(pca):
    num_components = len(pca.explained_variance_ratio_)
    ind = np.arange(num_components)
    vals = pca.explained_variance_ratio_
    
    ax = plt.subplot()
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals, color = ['#00da75', '#f1c40f',  '#ff6f15', '#3498db']) # Bar plot
    ax.plot(ind, cumvals, color = '#c0392b') # Line plot 
    
    for i in range(num_components):
        ax.annotate(r"%s" % ((str(vals[i]*100)[:3])), (ind[i], vals[i]), va = "bottom", ha = "center", fontsize = 13)
     
    ax.set_xlabel("PC")
    ax.set_ylabel("Variance")
    plt.title('Scree plot')
    
scree_plot(pca)

 

 

5> ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ ๋ถ„์„

๐Ÿ‘‰ ์ด 15๊ฐœ์˜ feature๊ฐ€ ์กด์žฌํ•˜๋Š” car data๋ฅผ variance๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ๋‹จ ๋‘ ๊ฐœ์˜ ์ถ•, PC1๊ณผ PC2๋งŒ์œผ๋กœ ๋ฝ‘์•„๋‚ด์–ด ์‹œ๊ฐํ™”ํ–ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, PC1๋งŒ์œผ๋กœ๋Š” diesel ์‚ฌ์šฉ ์œ ๋ฌด๋ฅผ ์ •ํ™•ํžˆ ํŒ๋ณ„ํ•  ์ˆ˜ ์—†์—ˆ์ง€๋งŒ, PC2 ์ฃผ์ถ•์ด ๋”ํ•ด์ง„๋‹ค๋ฉด diesel ์œ ๋ฌด๊ฐ€ ๊ทน๋ช…ํžˆ clustering๋  ์ˆ˜ ์žˆ์Œ์„ pca plot์„ ํ†ตํ•ด ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿ‘‰ scree plot์œผ๋กœ ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ ์—ญ์‹œ PC1๋งŒ์œผ๋กœ๋Š” ์ „์ฒด variance์˜ ์•ฝ 54%๋งŒ ์„ค๋ช…๋  ์ˆ˜ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์— PC2 17%๊ฐ€ ๋”ํ•ด์ง„ ํ•ฉ๊ณ„ ์•ฝ 70%์˜ ์„ค๋ช…๋ ฅ์œผ๋กœ diesel ์‚ฌ์šฉ ์œ ๋ฌด๊ฐ€ clustering์ด ๊ฐ€๋Šฅํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿ‘‰ dataset ์ž์ฒด ์–‘์œผ๋กœ๋งŒ ๋ดค์„ ๋•Œ๋Š” diesel์ด ์žˆ์„ ๊ฒฝ์šฐ์˜ data๊ฐ€ ์••๋„์ ์œผ๋กœ ๋น„์ค‘์ด ์ ๊ฒŒ ์ฐจ์ง€ํ•ด, ์• ์ดˆ์— data target ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๋‹ค. ์ข€ ๋” ๊ท ์ผํ•œ target ๋ถ„ํฌ์˜ data์™€ ์ถฉ๋ถ„ํ•œ ๊ฐฏ์ˆ˜์˜ data๊ฐ€ ์กด์žฌํ–ˆ๋‹ค๋ฉด ๋” ์ข‹์€ clustering ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์•˜์„ ๊นŒ ์˜ˆ์ธกํ•ด ๋ณธ๋‹ค.

 

๐Ÿ•ต๏ธ ์ถ”ํ›„

→ ์‹ค์ œ๋กœ PCA๋ฅผ ํ†ตํ•ด data reduction(extraction) ํšจ๊ณผ๋ฅผ pca plot๊ณผ scree plot์„ ํ†ตํ•ด ์ฆ๋ช…ํ•ด ๋ณด์•˜๋‹ค! ์ถ”ํ›„์—๋Š” ์‹ค์ œ ML ๋ชจ๋ธ๋ง์— ์ ์šฉํ•  ๋•Œ PCA ์ด์ „๊ณผ ์ดํ›„ data๋ฅผ model์— ์ง‘์–ด๋„ฃ์—ˆ์„ ๊ฒฝ์šฐ์— ๋”ฐ๋ฅธ elasped time๊ณผ model accuracy์˜ ์ฐจ์ด๋ฅผ ์ฆ๋ช…ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค ๐Ÿ™Œ


* ์ถœ์ฒ˜1) https://vitalflux.com/feature-extraction-pca-python-example/

* ์ถœ์ฒ˜2) https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

'Machine Learning > Fundamentals' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

All About Evaluation Metrics (2/2) โ†’ MAPE, MPE  (0) 2022.06.11
Unsupervised Learning  (0) 2022.06.03
PCA(concepts)  (0) 2022.05.30
Feature Selection vs. Feature Extraction  (0) 2022.05.18
feature selection (1) - selectKBest (+jointplot)  (0) 2022.04.20

๋Œ“๊ธ€