Machine Learning/Models (with codes)

K-Means Clustering (concepts + w/code)

metamong 2022. 6. 8.

๐Ÿ’™ ๊ผญ K๊ฐ€ ๋“ค์–ด๊ฐ€์„œ ํ•œ๊ตญ ๋ถ€์‹ฌ์œผ๋กœ ๊ผญ ์•Œ์•„์•ผ(?)ํ•  ๊ฒƒ ๋งŒ ๊ฐ™์€ ๋ฐ,, ํ•„์ˆ˜๋กœ ์•Œ์•„์•ผ ํ•˜๋Š” ๊ฐœ๋… ๋งˆ์ฆ˜!

 

๐Ÿ’™ ์•ž์„œ ์šฐ๋ฆฌ๊ฐ€ Unsupervised Learning์˜ ์ผ์ข…์œผ๋กœ ๋Œ€ํ‘œ์ ์ธ clustering ๊ฐœ๋…์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ๊ณ , ์ด ์ค‘ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ K-Means์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž!

 

Unsupervised Learning

๐Ÿ“Œ ์‚ฌ์‹ค ML์—์„œ ์ง€๋„ํ•™์Šต(Supervised Learning)๋ณด๋‹ค ์ •๋‹ต์ด ์ฃผ์–ด์ ธ ์žˆ์ง€ ์•Š์€, ๋น„์ง€๋„ํ•™์Šต(Unsupervised Learning) ๊ธฐ๋ฒ•์ด ๋” ๊นŒ๋‹ค๋กญ๊ณ , ์ •๋‹ต์ด ์ฃผ์–ด์ ธ ์žˆ์ง€ ์•Š์•„์„œ ๋ถ„์„์— ํž˜์ด ๋“ค ๋•Œ๊ฐ€ ๋งŽ๋‹ค. ๐Ÿ“Œ ๋น„์ง€๋„ํ•™์Šต

sh-avid-learner.tistory.com

 

๐Ÿ’™ ์ผ๋‹จ k-means๋Š” unsupervised learning ๊ธฐ๋ฒ• ์ค‘ clustering - centroid-based clustering (hard clustering)๊ธฐ๋ฒ•์ด๋‹ค.

 

step-by-step>

โ‘  cluster์˜ ๊ฐœ์ˆ˜ k๋ฅผ ์ •ํ•œ๋‹ค (k๋ฅผ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ํ•˜๋‹จ ๋‚ด์šฉ ์ฐธ์กฐ!) (์•„๋ž˜ algorithm1 ์ˆœ์„œ 1)

โ‘ก ์ฃผ์–ด์ง„ data์—์„œ ๋žœ๋คํ•œ k๊ฐœ์˜ data๋ฅผ cluster์˜ ์ค‘์‹ฌ์ ์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. (์•„๋ž˜ algorithm1 ์ˆœ์„œ 2)

โ‘ข ์ฒซ ๋ฒˆ์งธ data์—์„œ ๊ฐ cluster ์ค‘์‹ฌ์ ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ cluster๋ณ„๋กœ ๊ฐ๊ฐ ๊ตฌํ•œ๋‹ค. (์•„๋ž˜ algorithm1 ์ˆœ์„œ 4)

โ‘ฃ ๊ฐ cluster๋ณ„๋กœ ๊ตฌํ•œ ๊ฑฐ๋ฆฌ ์ค‘ ์ตœ์†Ÿ๊ฐ’์— ํ•ด๋‹น๋˜๋Š” cluster๋กœ ํ•ด๋‹น data๋ฅผ ํ• ๋‹นํ•œ๋‹ค. (์•„๋ž˜ algorithm1 ์ˆœ์„œ 5)

โ‘ค โ‘ข๊ณผ โ‘ฃ์˜ ๊ณผ์ •์„ ๋ชจ๋“  data์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•œ๋‹ค. (์•„๋ž˜ algorithm line 3 - repeat)

(โ€ปscikit-learn์—์„œ๋Š” KMeans()์˜ ์ธ์ž max_iter๋กœ ์กฐ์ ˆ)

โ‘ฅ ๋ณ€๊ฒฝ๋œ cluster ๊ฐ๊ฐ์˜ ์ค‘์‹ฌ์ (mean)์„ ๋‹ค์‹œ ๊ตฌํ•œ๋‹ค.

โ‘ฆ ๋ณ€๊ฒฝ๋œ cluster ๊ธฐ์ค€์œผ๋กœ ๋‹ค์‹œ ๋ชจ๋“  data์— ๋Œ€ํ•ด โ‘ข๊ณผ โ‘ฃ์˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค. (์•„๋ž˜ algorithm ์ˆœ์„œ 6)

โ‘ง cluster๊ฐ€ ๋ณ€ํ•˜์ง€ ์•Š์œผ๋ฉด ์™„์„ฑ!

 

 

→ ์œ„ ๊ณผ์ • โ‘ฃ์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฐ cluster๋ณ„ ๊ตฌํ•œ ๊ฑฐ๋ฆฌ๋Š” '๊ฐ data๋ณ„ ๊ฐ cluster๋ณ„ centroid์™€์˜ Euclidean distance'์ด๋‹ค. ์ด distance๊ฐ€ ๊ฐ€์žฅ ์ ๊ฒŒ ๋‚˜์˜จ centroid๊ฐ€ ํฌํ•จ๋œ cluster์— data๊ฐ€ ํฌํ•จ๋œ๋‹ค.

 

๐Ÿ‘‘ clustering์˜ ํ‰๊ฐ€?

 

→ ์ค‘์‹ฌ์ ์ด ๋ณ€ํ•˜์ง€ ์•Š์•˜๊ณ  ์ตœ์ข…์ ์œผ๋กœ k-means clustering์ด ๋งˆ๋ฌด๋ฆฌ ๋˜์—ˆ๋‹ค๊ณ  ํ•ด์„œ ๋๋‚œ๊ฒŒ ์•„๋‹ˆ๋‹ค. ํ•ด๋‹น clustering์˜ ์ ์ ˆ์„ฑ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ๊ฐ๊ฐ ๋ชจ๋“  data์™€ ํ•ด๋‹น data๊ฐ€ ์†ํ•œ cluster์˜ centroid์™€์˜ squared Euclidean Distance๋ฅผ data๋ณ„๋กœ ๋ชจ๋‘ ๋”ํ•œ ๊ฐ’์„ ์‚ดํŽด๋ณธ๋‹ค.

(โ€ป ํ•ด๋‹น ๋”ํ•œ ์ตœ์ข…๊ฐ’์„ SSE(Sum of Squared Error)๋ผ๊ณ  ๊ฐ„๋žตํžˆ ๋ถ€๋ฆ„)

→ ์œ„ k-means ํŠน์ง•์— ๋Œ€ํ•ด์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด, k-means๋Š” non-deterministic์ด๋ผ ์ฒ˜์Œ์— centroid๋ฅผ ์–ด๋–ป๊ฒŒ ์žก๋Š๋ƒ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ clustering ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค. ์ฆ‰ ์ดˆ๊ธฐ ์ƒํƒœ์— ๋”ฐ๋ผ SSE๊ฐ’์ด ์ œ๊ฐ๊ฐ์ด๋ฉฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฐ๊ตฌ์ž๋“ค์€ โ‘ ๋ช‡ ๋ฒˆ์˜ ์ดˆ๊ธฐํ™”๋ฅผ ํ†ตํ•ด SSE๊ฐ€ ๊ฐ€์žฅ ์ตœ์†Œ๊ฐ’์œผ๋กœ ๋‚˜์˜จ ๊ฒฝ์šฐ๋ฅผ ์ตœ์ข… k-means clustering ๊ฒฐ๊ณผ๋กœ ํƒํ•œ๋‹ค

* KMeans() (from scikit-learn)>

ใ€Œsklearn.cluster.KMeans() docuใ€

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

 

class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')

 

- param ์„ค๋ช… -

๐ŸŒ n_clusters - cluster ๊ฐœ์ˆ˜ 

 

๐ŸŒ init - 'k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. / random’: choose n_clusters observations (rows) at random from data for the initial centroids.'

→ 'k-means++'๋ฅผ init ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋ฉด, ์ข€ ๋” ๋นจ๋ฆฌ centroid๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ์ ํ•ฉํ•œ clustering inital state์—์„œ ์‹œ์ž‘๋œ๋‹ค.

→ 'random'์ผ ๊ฒฝ์šฐ ์ ํ•ฉํ•œ initial state๊ฐ€ ์•„๋‹Œ, ์˜จ์ „ํžˆ randomํ•œ ์ง€์ ๋ถ€ํ„ฐ clustering algorithm ์—ฐ์‚ฐ์ด ์‹œ์ž‘๋œ๋‹ค.

 

๐ŸŒ n_init - ์ดˆ๊ธฐํ™” ์นด์šดํŒ…: ์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์‹คํ—˜ํ•  ๋•Œ n_init๋ฒˆ clusteirng์„ ์ง„ํ–‰ํ•˜๊ณ , n_init๋ฒˆ์˜ clusteirng ์ค‘ ๊ฐ€์žฅ SSE๊ฐ€ ์ ๊ฒŒ ๋‚˜์˜จ ๊ฒฐ๊ณผ์˜ clustering ๊ฒฐ๊ณผ๋ฅผ ์ฑ„ํƒํ•˜๊ฒŒ ๋œ๋‹ค

 

๐ŸŒ max_iter - clustering์—์„œ ์ตœ๋Œ€ max_iter๋ฒˆ ํ•œ ๋’ค์˜ ๊ฒฐ๊ณผ๋ฅผ cluster์˜ centroid๋กœ ์ •ํ•œ๋‹ค. (max_iter๋ฒˆ ํ•˜๊ธฐ ์ „์— ์ด๋ฏธ centroid๊ฐ€ ๋” ์ด์ƒ ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ๋ฐ”๋กœ clustering ๊ฒฐ๊ณผ๋กœ ์ฑ„ํƒ)

 

- attributes -

๐ŸŒ cluster_centers_ - clusteringํ•˜๊ณ  ๋‚œ ๋’ค์˜ ๊ฐ๊ฐ์˜ cluster centroid ์ขŒํ‘œ๋ฅผ returnํ•ด์ค€๋‹ค.

 

๐ŸŒ inertia_ - ์—ฌ๋Ÿฌ ๋ฒˆ์˜ initialization์œผ๋กœ ์‹คํ—˜ํ•œ, ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ตœ์ข… cluster ์ค‘ SSE๊ฐ€ ๊ฐ€์žฅ ์ตœ์†Ÿ๊ฐ’์œผ๋กœ ๋‚˜์˜จ ๊ฒฐ๊ณผ๋ฅผ return

 

๐ŸŒ labels_ - ๋ถ„๋ฅ˜ํ•œ label ์ด๋ฆ„ return

 

๐ŸŒ n_iter_ - ์œ„์—์„œ max_iter ์ตœ๋Œ€ ํšŸ์ˆ˜ ๋‚ด์—์„œ ์ตœ์ข…์ ์œผ๋กœ n_iter๋ฒˆ๋งŒํผ repeatํ•œ ๋’ค centroid๋ฅผ ํ˜•์„ฑํ–ˆ๋Š” ์ง€, n_iter ๊ฒฐ๊ด๊ฐ’์„ ์•Œ๋ ค์ค€๋‹ค

(โ€ป ๋‹น์—ฐํžˆ init๊ฐ’์„ 'k-means++'๋กœ ์„ค์ •ํ–ˆ์„ ๊ฒฝ์šฐ n_iter๊ฐ’์ด ๋” ์ ์€ ๊ฒฐ๊ด๊ฐ’์œผ๋กœ ๋‚˜์˜จ๋‹ค)

 

โ˜… kmeans.fit()์— data๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ๋! โ˜…

how to select the best K?>

* ์ผ๋‹จ KMeans() ํ•จ์ˆ˜์˜ n_clusters ์ธ์ž๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ์ธ์ž๋“ค์„ ์›ํ•˜๋Š” ๋Œ€๋กœ ์„ค์ •ํ•œ๋‹ค.

 

kmeans_kwargs = {
        "init": "random",
        "n_init": 10,
        "max_iter": 300
}

 

1> using elbow method (+KneeLocator)

ใ€‹ SSE๊ฐ’์˜ ๋ณ€ํ™”๋ฅผ ์‚ดํŽด๋ณด๋Š” method์ด๋‹ค. cluster ๊ฐœ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก SSE๊ฐ€ ๊ฐ์†Œํ•˜๋Š”๋ฐ, SSE๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋Š” ์ง€์ ์€ '๋ชจ๋“  data๊ฐ€ ๊ฐ๊ฐ์˜ cluster๋ฅผ ํ˜•์„ฑํ–ˆ์„ ๋•Œ'์ด๋‹ค. ๋”ฐ๋ผ์„œ, ์–ด๋Š ์ ์ ˆํ•œ ์‹œ์ ์— cut ํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค. cut ํ•ด์ฃผ๋Š” ์ง€์ ์„ SSE๊ฐ€ ํฐ ํญ์œผ๋กœ ๊ฐ์†Œํ•˜๋Š” ์ง€์ ์œผ๋กœ ์ •ํ•จ!

 

ใ€‹ SSE๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ kmeans์˜ inertia_ ์†์„ฑ!

 

ใ€‹ 1) ์ง์ ‘ ์‹œ๊ฐํ™”ํ•ด์„œ ํฌ๊ฒŒ ๊ฐ์†Œํ•œ point์˜ cluster ๊ฐœ์ˆ˜๋ฅผ ์žก์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•!

 

# A list holds the SSE values for each k
sse = []

for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
    kmeans.fit(scaled_features)
    sse.append(kmeans.inertia_)

plt.style.use("fivethirtyeight") #ํ›„์— ์‚ฌ์šฉํ•  ๋•Œ๋Š” default๋กœ ๋ฐ”๊ฟ”์ค˜์•ผ ํ•จ!
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()

 

 

→ number of clusters๊ฐ€ 3์ธ point์—์„œ ํฐ ํญ์œผ๋กœ ๊ฐ์†Œ๋œ ์ง€์ ์ด๋ผ ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ k๋Š” 3์œผ๋กœ ์ง€์ •ํ•ด์•ผ ์ ๋‹นํ•˜๋‹ค.

 

ใ€‹ 2) KneeLocator๋ฅผ ์‚ฌ์šฉ - elbow ์†์„ฑ ์ถœ๋ ฅ

 

kl = KneeLocator(
    range(1, 11), sse, curve="convex", direction="decreasing"
)

kl.elbow #3

 

→ 3 ๋ฐ”๋กœ ์ถœ๋ ฅ

2> using silhouette coefficient

๐Ÿฅ‘ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋Š” ์ด ๋‘ ๊ฐ€์ง€๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค. (1) ํ•ด๋‹น data๊ฐ€ cluster ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ data์™€ ๊ฐ€๊น๊ฒŒ ์žˆ๋Š” ์ง€ / (2) ํ•ด๋‹น data๊ฐ€ ๋‹ค๋ฅธ cluster์™€ ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋Š” ์ง€ - ์ฆ‰! ํ•ด๋‹น data๊ฐ€ ๋‹ค๋ฅธ cluster๋กœ๋ถ€ํ„ฐ๋Š” ๋ฉ€๋ฆฌ, ์ž์‹ ์ด ์†ํ•œ cluster์—์„œ๋Š” ๋‹ค๋ฅธ data์™€ ๊ฐ€๊นŒ์šด ์ง€ - ์–ผ๋งˆ๋‚˜ ์ž˜ ๋ชจ์—ฌ์žˆ๋Š” ์ง€๋ฅผ ํ™•์ธํ•˜๋Š” ์ฒ™๋„!

 

๐Ÿฅ‘ -1๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, ๋‹จ์ ์œผ๋กœ ๊ฐ’ ํ•˜๋‚˜๋กœ ์ถœ๋ ฅ์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— k๊ฐ’์„ ๊ฒฐ์ •ํ•˜๊ธฐ ๋งค์šฐ ํŽธ๋ฆฌํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿฅ‘ 1์— ๊ฐ€๊นŒ์šด ํฐ ๊ฐ’์ผ์ˆ˜๋ก best choice๋ผ ํ‰๊ฐ€ํ•จ!

 

๐Ÿฅ‘ silhouette_score method๋ฅผ ํ†ตํ•ด ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฒฐ๊ณผ๋ฅผ ์‚ฐ์ถœํ•˜๊ธฐ ์œ„ํ•ด์„  ๋‹น์—ฐํžˆ ์ตœ์†Œ ๋‘ ๊ฐœ์˜ cluster๊ฐ€ ํ•„์š”(๋‹ค๋ฅธ cluster์™€์˜ ๋จผ ์ •๋„๋ฅผ ์ธก์ •ํ•˜๋ฏ€๋กœ)ํ•˜๊ณ  data(์ขŒํ‘œ)์™€ ํ•จ๊ป˜ ์ง‘์–ด๋„ฃ์œผ๋ฉด ๋œ๋‹ค

 

from sklearn.metrics import silhouette_score

 

# A list holds the silhouette coefficients for each k
silhouette_coefficients = []

# Notice you start at 2 clusters for silhouette coefficient
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
    kmeans.fit(scaled_features)
    score = silhouette_score(scaled_features, kmeans.labels_)
    silhouette_coefficients.append(score)
    
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

 

 

→ ๊ฐฏ์ˆ˜๋กœ ์ •ํ•œ 3์—์„œ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’์ด ๊ฐ€์žฅ ํฐ ๊ฐ’์ด ๋‚˜์˜ค๋ฏ€๋กœ ์—ญ์‹œ (ํ•ด๋‹น data์˜ ๊ฒฝ์šฐ) 3์„ k๋กœ ์ •ํ•˜๋ฉด ๋œ๋‹ค.

w/code(using scikit-learn)>

โ‘  import ๋ฐ clustering์„ ์œ„ํ•œ random data ์ค€๋น„ (make_blobs ์‚ฌ์šฉ)

(unsupervised learning์˜ ์ผ์ข…์ด๋ผ label ์ž์ฒด๊ฐ€ ์—†๋Š” data๋ฅผ clusteringํ•˜๋Š” ๋ฐ,

ํ•ด๋‹น data์˜ ๊ฒฝ์šฐ label์ด ์ด๋ฏธ ์ฃผ์–ด์ ธ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” ์šฐ๋ฆฐ ๋‹จ์ˆœํžˆ KMeans()๋กœ ์‹ค์ œ๋กœ ์ž˜ ๋ถ„๋ฅ˜๊ฐ€ ๋˜์—ˆ๋Š” ์ง€ ํ™•์ธํ•˜๋Š” ๊ณผ์ •์ผ ๋ฟ)

 

import matplotlib.pyplot as plt
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as pyplot
from matplotlib import pyplot as plt

 

โ–ฃ make_blobs docu() โ–ฃ

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

'clustering์„ ์œ„ํ•œ data๋ฅผ random์œผ๋กœ ์ƒ์„ฑํ•ด์ค€๋‹ค'

 

sklearn.datasets.make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(- 10.0, 10.0), shuffle=True, random_state=None, return_centers=False)

 

โ–ถ 500๊ฐœ์˜ data๊ฐ€ 3๊ฐœ์˜ cluster(cluster ํ‘œ์ค€ํŽธ์ฐจ๋Š” 3.75)๋ฅผ ํ˜•์„ฑํ•˜๋Š” randomํ•œ data๋ฅผ ์„ ํƒํ•œ๋‹ค.

 

#data prepared
features, true_labels = make_blobs(
    n_samples=500,
    centers=3,
    cluster_std=3.75
)

 

โ‘ก standardization (*StandardScaler() ์‚ฌ์šฉ)

โ–ถ ์ถ• ๋ณ„๋กœ ๋™์ผํ•œ scale์„ ๊ฐ€์ง€๊ฒŒ๋” standardization ๊ณผ์ •์„ ๊ฑฐ์นจ์œผ๋กœ์„œ, data์˜ ์–ด๋Š ํ•œ variable์— ์น˜์šฐ์นœ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š๊ฒŒ ํ•œ๋‹ค. 

 

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

 

โ‘ข cluster ๊ฐœ์ˆ˜ k ์ •ํ•˜๊ธฐ

 

โ–ถ ์œ„ elbow method & silhouette coefficients ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด k๋ฅผ 3์œผ๋กœ ์ •ํ–ˆ์Œ (์œ„ ์ฐธ์กฐ)

 

โ‘ฃ KMeans() ๋Œ๋ฆฌ๊ธฐ & data fitting

 

โ–ถ ์ด ๋•Œ scaling๋œ data๋ฅผ ์ง‘์–ด๋„ฃ์–ด์•ผ ํ•จ์„ ์žŠ์ง€ ๋ง๊ธฐ!

 

kmeans = KMeans(
        init = 'random',
        n_clusters = 3,
        n_init=10,
        max_iter=300
        #random_state=42
)

kmeans.fit(scaled_features)

 

โ‘ค ์ •๋ฆฌ๋œ dataframe ๋งŒ๋“ค๊ธฐ

 

โ–ถ2์ฐจ์› data์ด๋ฏ€๋กœ scaled๋œ ๋ชจ๋“  data์˜ x์™€ y์ขŒํ‘œ - ๊ฐ๊ฐ์˜ data์˜ ์‹ค์ œ label & KMeans๋กœ ๋ถ„๋ฅ˜ํ•œ label์„ ๋‚˜๋ž€ํžˆ ๋งŒ๋“ค์–ด๋ณธ๋‹ค.

 

#making a dataframe
df = pd.DataFrame(scaled_features)
df.columns = ['x', 'y']
df['true_label'] = true_labels.tolist()
df['KMeans_label'] = KMeans_labels.tolist()

df

 

 

โ–ถ ์—ฌ๊ธฐ์„œ true_label๊ณผ KMeans_label ์ˆซ์ž๊ฐ€ ๋งž์ง€ ์•Š์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ˆซ์ž๋Š” kmeans initializationํ•  ๋•Œ ๋งˆ๋‹ค ๊ทธ ๋•Œ ๊ทธ ๋•Œ ๋‹ฌ๋ผ์ง€๋Š” ์ž„์˜์˜ ๊ฐ’์ด๋ฏ€๋กœ ์ƒ๊ด€ํ•  ํ•„์š” x

 

โ‘ฅ ์‹œ๊ฐํ™”

 

โ–ถ kmeans๋กœ ์ž˜ ๋ถ„๋ฅ˜๋˜์—ˆ๋Š” ์ง€ ์œก์•ˆ์œผ๋กœ ํ™•์ธํ•ด ๋ณด๊ธฐ ์œ„ํ•ด 2์ฐจ์› graph๋กœ ์‹œ๊ฐํ™”

 

centroids = kmeans.cluster_centers_

plt.style.use("default")

def plot_clusters(df, column_header, centroids):
    colors = {0 : '#42f557', 1 : '#427ef5', 2 : '#f54842'}
    fig, ax = plt.subplots()
    plt.rcParams["figure.figsize"] = (6,6)
    ax.plot(centroids[0][0], centroids[0][1], "ok") # ๊ธฐ์กด ์ค‘์‹ฌ์ 
    ax.plot(centroids[1][0], centroids[1][1], "ok")
    ax.plot(centroids[2][0], centroids[2][1], "ok")
    
    grouped = df.groupby(column_header)
    
    for key, group in grouped:
        group.plot(ax = ax, kind = 'scatter', x = 'x', y = 'y', label = key, color = colors[key])
    plt.show()

plot_clusters(df, 'KMeans_label', centroids)

 

 

→ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ clustering๋œ data๋ฅผ ์ƒ‰๊น”๋ณ„ ํ™•์ธ ๊ฐ€๋Šฅ!


* ์ถœ์ฒ˜1) ๊ฐ“ STATQUEST https://www.youtube.com/watch?v=4b5d3muPQmA 

* ์ถœ์ฒ˜2) K-means & hierchical clustering https://www.youtube.com/watch?v=QXOkPvFM6NU 

* ์ถœ์ฒ˜3) kmeans code ์ฐธ์กฐ https://realpython.com/k-means-clustering-python/

* ์ถœ์ฒ˜4) kmeans ์‹œ๊ฐํ™” & elbow method https://www.askpython.com/python/examples/plot-k-means-clusters-python

'Machine Learning > Models (with codes)' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

(L1 Regularization) โ†’ LASSO Regression (concepts)  (0) 2022.06.23
Decision Trees (concepts)  (0) 2022.04.27
Logistic Regression Model (w/code)  (0) 2022.04.25
Polynomial Regression Model  (0) 2022.04.24
Logistic Regression Model (concepts)  (0) 2022.04.24

๋Œ“๊ธ€