Math & Linear Algebra/Concepts

vector similarity

metamong 2023. 2. 9.

๐Ÿง™๐Ÿป scalar & vector fundamentals ํฌ์ŠคํŒ…์—์„œ scalar์™€ vector์˜ ๊ฐœ๋…์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด์•˜๋‹ค.

 

Scalar & Vector (fundamentals)

โ–ถ Linear Algebra ํ•˜๋ฉด? ๋‹น์—ฐํžˆ ์•Œ์•„์•ผ ํ•  ๊ธฐ๋ณธ์€ 'Scalar(์Šค์นผ๋ผ)' & 'Vector(๋ฒกํ„ฐ)' & 'Matrix(ํ–‰๋ ฌ)' โ—€ 1. Scalar * concepts = "๋‹จ์ˆœํžˆ ๋ณ€์ˆ˜๋กœ ์ €์žฅ๋˜์–ด ์žˆ๋Š” ์ˆซ์ž" → vector ํ˜น์€ matrices์— ๊ณฑํ•ด์ง€๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ๊ฐ’

sh-avid-learner.tistory.com

 

๐Ÿง™๐Ÿป ์ด์ œ๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ vector similarity ๊ฐœ๋…์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž!

 

๐Ÿง™๐Ÿป ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ์˜ vector๋Š” ๊ฐ ๊ด€์ธก์น˜์˜ ๋ณ€์ˆ˜(ํŠน์„ฑ) ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ด€์ธก์น˜๋ฅผ ํ•˜๋‚˜์˜ vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

ex)

๊ด€์ธก์น˜ A B
1 50 1
2 60 2
3 100 50

→ ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ A,B ๋‘ ๊ฐœ๋กœ 2์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฐ๊ฐ์˜ ๊ด€์ธก์น˜๋Š” (50,1) (60,2) (100,50)์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ

ํ•ด๋‹น vector ๋‚ด์˜ element์— ์˜ํ•ด์„œ vector์˜ ์œ„์น˜๊ฐ€ ๊ฒฐ์ •๋œ๋‹ค.

vector๊ฐ„์˜ ์œ ์‚ฌ๋„๋Š” vector๊ฐ„์˜ ๊ฑฐ๋ฆฌ ๊ฐœ๋…๊ณผ ์—ฐ๊ด€๋จ - ์œ ์‚ฌ๋„ ๊ฐœ๋…์„ ์‚ฌ์šฉํ•ด ML์— ์ ์šฉ ๊ฐ€๋Šฅ


Euclidian Distance

๐Ÿง™๐Ÿป ๋‘ 2์ฐจ์› ๋ฒกํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉด, ๋ฒกํ„ฐ๋Š” 2์ฐจ์› ์ƒ์—์„œ point๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ๋ฒกํ„ฐ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋Š” ๊ณง point๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋œปํ•œ๋‹ค.

๐Ÿง™๐Ÿป ์œ„ ๊ทธ๋ฆผ์—์„œ p1๊ณผ p2 ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ($v_1 = x_2 - x_1, v_2 = y_2 - y_1$)

 

$$\overline{p_1p_2} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} = \sqrt{v_1^2 + v_2^2} = ||v_2 - v_1||$$

 

๐Ÿง™๐Ÿป ๋”ฐ๋ผ์„œ, ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ euclidian distance๋Š” L2 norm ํ•จ์ˆ˜์ธ np.linalg.norm()์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

a = np.array([1,2])
b = np.array([2,2])
c = np.array([-3, -3])

print(np.linalg.norm(b-a)) #1.0
print(np.linalg.norm(c-a)) #6.4031242374328485

Cosine Similarity / Cosine Distance

๐Ÿง™๐Ÿป ์œ„์—์„œ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ์ด์šฉํ•œ ๊ฑฐ๋ฆฌ ์œ ์‚ฌ๋„๋กœ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋ฅผ euclidian distance๋ฅผ ๋“ค์—ˆ๋‹ค. ์ด์™€ ๋‹ฌ๋ฆฌ, ๋ฒกํ„ฐ์˜ ๋˜ ๋‹ค๋ฅธ ์„ฑ์งˆ์ธ ๋ฐฉํ–ฅ์„ ์ด์šฉํ•œ ๋ฐฉํ–ฅ ์œ ์‚ฌ๋„๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

๐Ÿง™๐Ÿป ๋ฒกํ„ฐ์˜ ๋ฐฉํ–ฅ์€ ์›์†Œ์˜ ๊ฐ’์— ์˜ํ•ด์„œ ๊ฒฐ์ • - ์ฆ‰ ๋ฒกํ„ฐ๊ฐ„์˜ ๋ฐฉํ–ฅ์ด ์„œ๋กœ ์œ ์‚ฌํ• ์ˆ˜๋ก ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค๋Š” ๋œป

 

๐Ÿง™๐Ÿป ๋‘ ๋ฒกํ„ฐ์˜ ๋ฐฉํ–ฅ์ด ์œ ์‚ฌํ•œ ์ •๋„๋Š” ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฐ(์‚ฌ์ž‡๊ฐ)์„ ์ด์šฉํ•ด์„œ ํ‘œํ˜„

 

โ€ป ์‚ฌ์ž‡๊ฐ์ด ์ž‘์„์ˆ˜๋ก (0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก) ๋ฐฉํ–ฅ์„ฑ์ด ๋” ์œ ์‚ฌ

โ€ป ์‚ฌ์ž‡๊ฐ์ด ํด์ˆ˜๋ก (180์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก) ๋ฐฉํ–ฅ์ด ๋ฐ˜๋Œ€

 

๐Ÿง™๐Ÿป ์ด๋ฅผ ์ˆ˜์น˜์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜ ์‚ฌ์šฉ

 

๐Ÿง™๐Ÿป ๋ฐฉํ–ฅ์„ฑ์ด ์œ ์‚ฌํ• ์ˆ˜๋ก → ์‚ฌ์ž‡๊ฐ์ด ์ž‘์•„์ง → cos๊ฐ’์ด ์ปค์ง(cos๊ฐ’์€ $\cfrac{v_1 · v_2}{|v_1||v_2|}$๋กœ ๊ตฌํ•จ)

(np.dot()์œผ๋กœ ๋ฒกํ„ฐ ๊ฐ„์˜ ๋‚ด์  ์—ฐ์‚ฐ / np.linalg.norm()์œผ๋กœ ๊ฐ ๋ฒกํ„ฐ์˜ L2-norm ๊ณ„์‚ฐ)

 

๐Ÿง™๐Ÿป n์ฐจ์›์˜ ๋‘ vector๊ฐ„์˜ cosine similarity ์‹์„ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

๐Ÿง™๐Ÿป ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ cosine distance๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค

 

cosine distance = $1 - \cos\theta$

 

๐Ÿง™๐Ÿป ๋ฐฉํ–ฅ์„ฑ์ด ์œ ์‚ฌํ• ์ˆ˜๋ก, cos๊ฐ’์ด ์ปค์ง€๋ฏ€๋กœ, ์ด๋ฅผ ๊ฑฐ๋ฆฌ ๊ฐœ๋…์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด cosine distance ์‹์„ ๋งŒ๋“ค์—ˆ๋‹ค.

๐Ÿง™๐Ÿป ๋ฐฉํ–ฅ์„ฑ์ด ์œ ์‚ฌํ• ์ˆ˜๋ก → ์‚ฌ์ž‡๊ฐ์ด ์ž‘์•„์ง → cos๊ฐ’์ด ์ปค์ง → cosine distance๊ฐ€ ์ž‘์•„์ง

(๋”ฐ๋ผ์„œ! ๋ฐฉํ–ฅ์„ฑ์ด ์œ ์‚ฌํ• ์ˆ˜๋ก cosine distance๊ฐ€ ์ž‘์•„์ง - ์ฆ‰ ๊ฐ€๊นŒ์›Œ์ง์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค)

 

๐Ÿง™๐Ÿป ์ฝ”๋“œ๋กœ ๋‚˜ํƒ€๋‚ด์ž๋ฉด, ์ง์ ‘ ์ œ2์ฝ”์‚ฌ์ธ๋ฒ•์น™์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• & scipy์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

a = np.array([1,2])
b = np.array([2,2])
print(1-np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)))
#0.05131670194948623

import scipy.spatial.distance as dst
print(dst.cosine(a,b))
#0.05131670194948623

 

๐Ÿง™๐Ÿป ex) euclidian distance์™€ cosine similarity ๋น„๊ต

euclidian distance ๊ธฐ์ค€์œผ๋กœ๋Š” agriculture corpus์™€ history corpus๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ fodd corpus์™€ agriculture corpus๊ฐ„ ์œ ์‚ฌ๋„๋ณด๋‹ค ๋†’๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ, cosine similarity ๊ธฐ์ค€์œผ๋กœ ๋ณด์•˜์„ ๋•Œ๋Š”, food corpus์™€ agriculture corpus๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ ๋” ๋†’๋‹ค(์‚ฌ์ž‡๊ฐ์ด ๋” ์ž‘์œผ๋ฏ€๋กœ)

 

๐Ÿง™๐Ÿป cosine similarity๋Š” euclidian distance์ธ ๋‹จ์ˆœ ๊ฑฐ๋ฆฌ ๊ฐœ๋…์œผ๋กœ ๋น„๊ต๊ฐ€ ๊ถŒ์žฅ๋˜์ง€ ์•Š๋Š”, ๋” spareํ•œ data point๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ถ”ํ›„ ์—ฌ๋Ÿฌ ์‚ฌ๋ก€๋ฅผ ๋ถ„์„ํ•˜๋ฉด์„œ ๊ฒฝํ—˜ํ•˜๋„๋ก ํ•˜์ž!

Jaccard Distance / Jaccard Index

๐Ÿง™๐Ÿป 0๊ณผ 1๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ binary array์—๋งŒ ์ ์šฉ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

 

๐Ÿง™๐Ÿป $\cfrac{C_{1,0} + C_{0,1}}{C_{1,1} + C_{1,0} + C_{0,1}}$

($C_{1,0}$์€ ๊ฐ™์€ ์ž๋ฆฌ์— ์žˆ๋Š” ์›์†Œ๋“ค ์ค‘์—์„œ ์ฒซ ๋ฒˆ์งธ vector์˜ ์›์†Œ๊ฐ’์€ 1์ด๊ณ , ๋‘ ๋ฒˆ์งธ vector์˜ ์›์†Œ๊ฐ’์€ 0์ธ ์›์†Œ ์ž๋ฆฌ์˜ ์ˆ˜)

(์œ„ ์‹์€ 0๊ณผ 1๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ binary array ํ•œ์ •)

 

๐Ÿง™๐Ÿป Cosine Similarity๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์œ ์‚ฌ์„ฑ์ด ๋†’์„์ˆ˜๋ก ๋ฐ˜๋Œ€๋กœ Cosine Distance ๊ฑฐ๋ฆฌ๊ฐ€ ์งง๋‹ค๋Š” ๊ณต์‹์„ ๋งŒ๋“ค์—ˆ๋“ฏ์ด($1 - cos\theta$)

๐Ÿง™๐Ÿป Jaccard Index๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‘ ์œ ์‚ฌ์„ฑ์ด ๋†’์„ ๋œปํ•˜๊ณ , ๋ฐ˜๋Œ€๋กœ Jaccard Distance ๊ฑฐ๋ฆฌ๊ฐ€ ์งง๋‹ค๋Š” ๊ณต์‹์„ ๋งŒ๋“ฆ

 

→ ์•„๋ž˜ A์™€ B๊ฐ€ ์žˆ์„ ๋•Œ, A์™€ B์˜ ํ•ฉ์ง‘ํ•ฉ์€ $M_{10} + M_{01} + M_{11}$์ด๋‹ค.

→ ์ด ํ•ฉ์ง‘ํ•ฉ์—์„œ A์™€ B์˜ ๊ต์ง‘ํ•ฉ์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ์„ Jaccard Similarity Coefficient (J)๋ผ๊ณ  ํ•œ๋‹ค. 

→ ๋ฐ˜๋Œ€๋กœ Jaccard distance $d_j = 1 - J$๋กœ ํ‘œํ˜„ (๋”ฐ๋ผ์„œ ๋ชจ๋‘ ๋น„์œจ์„ ๋œปํ•˜๋ฏ€๋กœ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค)

 

๐Ÿง™๐Ÿป $J(A,B), d_j(A,B)$ ๊ณต์‹

 

ex) ์˜ˆ๋ฅผ ๋“ค์–ด A[1,0,0]๊ณผ B[0,1,0]์ธ ๋‘ column vector๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•˜๋ฉด, ๋‘ column vector๊ฐ„์˜ Jaccard Index๋Š” 0/2. ์ฆ‰ (1,1)์ธ ๊ฒฝ์šฐ๊ฐ€ ์—†์œผ๋ฏ€๋กœ 0์ด๋‹ค. ๋ฐ˜๋Œ€๋กœ Jaccard Distance๋Š” 1

import scipy.spatial.distance as dst

print(dst.jaccard([1, 0, 0],[1, 1, 0])) #0.5
print(distance.jaccard([1, 0, 0], [1, 2, 0])) #0.5

Hamming Distance

๐Ÿง™๐Ÿป hamming ๊ฑฐ๋ฆฌ๋Š”, ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ค‘์— ์„œ๋กœ ๊ฐ™์ง€ ์•Š์€ ์Œ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋œปํ•œ๋‹ค. ๊ฐ™์ง€ ์•Š์€ ์Œ์ด ๋งŽ์„์ˆ˜๋ก ์œ ์‚ฌ๋„๋Š” ๊ฐ์†Œํ•˜๊ณ , ๊ฑฐ๋ฆฌ๋Š” ๊ณง ๋ฉ€์–ด์ง€๋Š” ๊ฐœ๋…์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ

 

๐Ÿง™๐Ÿป1-D arrays๊ฐ€ ์ ์šฉ๋˜๊ณ , ๋‘ array u์™€ v๊ฐ€ ์žˆ๋‹ค๋ฉด, u์™€ v๊ฐ€ ๋ชจ๋‘ boolean vector์ผ ๊ฒฝ์šฐ hamming distance๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋‹ค.

(n์€ element ๊ฐœ์ˆ˜)

$$\cfrac{C_{1,0} + C_{0,1}}{n}$$

print(dst.hamming([1, 0, 0], [0, 1, 0])) #0.6666666666666666
print(distance.hamming([1, 0, 0], [1, 1, 0])) #0.3333333333333333
print(distance.hamming([1, 0, 0], [3, 0, 0])) #0.3333333333333333

Cityblock Distance(Manhattan Distance)

๐Ÿง™๐Ÿป L2-norm์€ euclidian distance๋กœ ๋‚˜ํƒ€๋ƒˆ๋‹ค๋ฉด, L1-norm์€ cityblock distance๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿง™๐Ÿป 1-D array๊ฐ€ ์ ์šฉ๋˜๊ณ  ๋‘ array๊ฐ€ u์™€ v๋ผ๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์ด cityblock distance๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

$$\sum_{i}^{} |u_i - v_i|$$

๐Ÿง™๐Ÿป ์œ„ ๊ทธ๋ฆผ์—์„œ ์ดˆ๋ก์ƒ‰ ์„  ์ œ์™ธ ๋‚˜๋จธ์ง€ ๋ชจ๋“  ์ƒ‰๊น”์˜ ์„ ์€ cityblock distance๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋‘ ์  ์‚ฌ์ด์— ์—ฌ๋Ÿฌ block์„ ๊ฐ€๋กœ์™€ ์„ธ๋กœ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ๊ฐˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ „์ œ ํ•˜์— ๊ณ ์•ˆ๋œ ๊ฑฐ๋ฆฌ

 

๐Ÿง™๐Ÿป vector๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์—ฌ๋Ÿฌ ์ถ•์œผ๋กœ ํ‘œํ˜„ํ–ˆ์„ ๋•Œ ๊ฐ ์ถ•์— ๋Œ€์‘๋˜๋Š” ๊ฐ’ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ถ• ๋ณ„๋กœ ๋ˆ„์ ํ•ด์„œ ๋”ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

print(dst.cityblock([1, 0, 0], [0, 1, 0])) #2
print(dst.cityblock([1, 0, 0], [0, 2, 0])) #3

* ์ถœ์ฒ˜1) https://en.wikipedia.org/wiki/Cosine_similarity

* ์ถœ์ฒ˜2) Jaccard Index https://en.wikipedia.org/wiki/Jaccard_index

* ์ถœ์ฒ˜3) ๋Œ€ํ•™์› ์‚ฌ์ „๊ต์œก - ์ˆ˜ํ•™ ๊ธฐ์ดˆ

 

'Math & Linear Algebra > Concepts' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

SVD(Singular Value Decomposition)  (0) 2023.02.20
eigendecomposition  (0) 2023.02.19
Linear Equation & Linear System / Rank & det(A)  (0) 2023.02.01
Matrix (fundamentals)  (0) 2022.07.31
Odds Ratio & log(Odds Ratio)  (0) 2022.07.11

๋Œ“๊ธ€