Statistics/Concepts(+codes)

Maximum Likelihood Estimation(MLE)

metamong 2022. 6. 26.

๐ŸŒŸ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ํฌ์ŠคํŒ…์—์„œ MLE๊ธฐ๋ฒ•์„ ํ†ตํ•ด model์„ ๊ฒฐ์ •ํ•œ๋‹ค๊ณ  ํ•˜์˜€๋‹ค. ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ์‹์„ ๋” deepํ•˜๊ฒŒ ์ˆ˜ํ•™์ ์œผ๋กœ ๋“ค์–ด๊ฐ€, ์–ด๋–ค ๋ชจ๋ธ์„ ๊ณ ๋ฅผ ์ง€ ์ˆ˜์‹์œผ๋กœ ์—ฐ์‚ฐํ•˜๋Š” ๊ณผ์ •์—์„œ MLE๊ฐ€ ํ•ต์‹ฌ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š”๋ฐ, ์ด๋ฒˆ ์‹œ๊ฐ„์—๋Š” MLE๋ฅผ ์ˆ˜ํ•™์ ์ธ ๊ฐœ๋…์œผ๋กœ ์ข€ ๋” ์ž์„ธํ•˜๊ฒŒ(๊ฝค deepํ•˜๊ฒŒ) ์•Œ์•„๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

๐ŸŒŸ ์ถ”๊ฐ€๋กœ ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ์ด ์ถ”ํ›„ ํฌ์ŠคํŒ…์—์„œ๋Š” ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ MLE๋ฅผ ๊ฐ™์ด ์—ฐ๊ด€์‹œ์ผœ ์•Œ์•„๋ณด์ž!

* ์ •์˜ & concepts>

๐ŸŒŸ ๋ชจ์ˆ˜์ ์ธ ๋ฐ์ดํ„ฐ ๋ฐ€๋„ ์ถ”์ •๋ฐฉ๋ฒ•์œผ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ $\theta = (\theta_1, ... , \theta_m)$์œผ๋กœ ๊ตฌ์„ฑ๋œ ์–ด๋–ค ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜ $P(x|\theta)$์—์„œ ๊ด€์ธก๋œ ํ‘œ๋ณธ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ $x = (x_1, x_2, ... , x_n)$์ด๋ผ ํ•  ๋•Œ, ์ด ํ‘œ๋ณธ๋“ค์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ $\theta = (\theta_1, ... , \theta_m)$๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•

 

๐ŸŒŸ ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด, ์ฃผ์–ด์ง„ data ์—ฌ๋Ÿฌ x๋“ค์ด ์žˆ๋Š”๋ฐ, ํ•ด๋‹น x๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์ตœ์ ์˜ distribution์„ ์ฐพ๋Š” ๊ฒƒ - (ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜ ์ฐพ๊ธฐ - ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜ ๊ด€๋ จ parameter $\theta$ ์ฐพ๊ธฐ)

 

๐ŸŒŸ ex) ์˜ˆ๋ฅผ ๋“ค์–ด ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์ฃผ์–ด์ง„ ์—ฌ๋Ÿฌ x data ์ ๋“ค์ด ์žˆ๊ณ  ํ•ด๋‹น data ์ ๋“ค์„ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์ตœ์ ์˜ 'normal distribution'์„ ์ฐพ๋Š”๋‹ค๋ฉด MLE ์ฃผํ™ฉ์ƒ‰ ๊ณก์„ ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. (์‹์„ ์ด์šฉํ•œ MLE ๊ธฐ๋ฒ•์€ ๋‹ค์Œ ํฌ์ŠคํŒ…์— ์ž์„ธํžˆ ์„ค๋ช… ์˜ˆ์ •)

 

 

๐ŸŒŸ ์ง๊ด€์ ์œผ๋กœ ๋ณด์ž๋ฉด, ํŠน์ • ๋ชจ์ˆ˜ parameter๋“ค์˜ ์ง‘ํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ํŠน์ • ๋ถ„ํฌ๊ฐ€ ์ฃผ์–ด์ง„ x๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ๋ถ„ํฌ์ธ์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด, ํ•ด๋‹น x์—์„œ์˜ ๋ถ„ํฌ๊นŒ์ง€์˜ ๋†’์ด๋ฅผ ๋ชจ๋“  x๋ณ„๋กœ ๋‹ค ๊ณ„์‚ฐํ•ด์„œ ๊ฐ๊ฐ ๊ณฑํ•œ๋‹ค. ์ด ๊ณฑํ•œ ๊ฒฐ๊ณผ๋ฅผ 'likelihood(๊ฐ€๋Šฅ๋„)'๋ผ ํ•˜๋ฉฐ, ํ•ด๋‹น likelihood๊ฐ€ ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” ๋ถ„ํฌ๋ฅผ ์ตœ์ข…์ ์œผ๋กœ ์ •ํ•œ๋‹ค.

 

๐ŸŒŸ ์‹

โ‘  $L$ = $L_1$ x $L_2$ x $L_3$ x ... x $L_n$

(์ด n๊ฐœ์˜ x data๊ฐ€ ์กด์žฌํ•˜๊ณ  ๊ฐ๊ฐ์˜ likelihood๋ฅผ ๋ชจ๋‘ ๊ณฑํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ตœ์ข…์ ์ธ likelihood๋ผ ํ•˜์ž.)

 

โ‘ก $\prod_{i=1}^{N} L_{i}$ = $p(x_1|\hat{\theta})$ x $p(x_2|\hat{\theta})$ x $p(x_3|\hat{\theta})$ x ... x $p(x_N|\hat{\theta})$

(์ด likelihood๋Š” ๊ฐ x๋ณ„ ๋ชจ์ˆ˜ $\theta$ ๋ชจ์Œ์„ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ถ„ํฌ - ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๊ฐ’์„ ๋ชจ๋‘ ๊ณฑํ•œ ๊ฒƒ์ž„์„ ๋œปํ•œ๋‹ค.)

 

โ‘ข $L$ = $\prod_{i=1}^{N} p(x_i|\hat{\theta})$

(์ฆ‰, ์ •๋ฆฌํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.)

 

โ‘ฃ $lnL$ = $\sum_{i=1}^{N} lnp(x_i|\hat{\theta})$

(์ดํ›„ ๋ฏธ๋ถ„์˜ ํŽธ์˜์„ฑ์„ ์œ„ํ•ด ์–‘๋ณ€์— log๋ฅผ ๋ถ™์—ฌ๋ณด์ž - log likelihood function)

 

โ‘ค $\frac{\partial}{\partial \theta}$ $lnL(\theta|x)$ = $\sum_{i=1}^{N}$ $\frac{\partial}{\partial \theta}$ $lnp(x_i|\hat{\theta})$ = 0

(๋ถ„ํฌ ๋ชจ์ˆ˜ $\theta$์— ๋Œ€ํ•ด์„œ ํŽธ๋ฏธ๋ถ„์„ ํ•˜๊ณ , ๋ฏธ๋ถ„๊ฒฐ๊ณผ๊ฐ’์ด 0์ผ ๋•Œ์˜ $\theta$ ๊ฐ’์„ ์ฐพ๋Š”๋‹ค!

 

∴ ๊ทธ ๊ฒฐ๊ณผ ๊ตฌํ•ด์ง„ $\theta$๋กœ ์ด๋ฃจ์–ด์ง„ ๋ถ„ํฌ๊ฐ€ ์šฐ๋ฆฌ๊ฐ€ ์ฃผ์–ด์ง„ x data๋กœ ์ด๋ฃจ์–ด์ง„ ์›ํ•˜๋Š” ๋ถ„ํฌ.
(๋ชจ์ˆ˜ $\theta$๊ฐ€ ๋‘ ์ข…๋ฅ˜ ์ด์ƒ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ๊ฐ ์ข…๋ฅ˜๋ณ„๋กœ ๋ฏธ๋ถ„ํ•œ ๊ฒฐ๊ณผ(์ฆ‰ ๊ฐ๊ฐ ํŽธ๋ฏธ๋ถ„)๋ฅผ 0์œผ๋กœ ๋‘์–ด ํ•ด๋‹น ๋ชจ์ˆ˜ ๊ฐ’์„ ๊ตฌํ•˜๋ฉด ๋œ๋‹ค)

* w/logistic regression?>

 

Logistic Regression Model (concepts)

** ML ๊ฐœ์š” ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ฃฌ 'supervised learning'์€ ์•„๋ž˜์™€ ๊ฐ™์€ ์ ˆ์ฐจ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ํ–ˆ๋‹ค (↓↓↓↓ํ•˜๋‹จ ํฌ์ŠคํŒ… ์ฐธ์กฐ ↓↓↓↓) ML Supervised Learning → Regression → Linear Regression 1. ML ๊ธฐ๋ฒ• ๊ตฌ๋ถ„ ๐Ÿ’†..

sh-avid-learner.tistory.com

๐ŸŒŸ ์ผ๋ช… '์˜ค์ฆˆ๋น„'๋กœ ๋ถˆ๋ฆฐ odds๋Š” ํ™•๋ฅ ๋กœ 0์ด์ƒ 1์ดํ•˜์˜ ๊ฐ’์„ ๊ฐ€์ง€๋‚˜, ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜ ํ˜•ํƒœ๋กœ ๊ณก์„ ์„ ๋ณด์ด๊ธฐ์—, ์šฐ๋ฆฌ๋Š” '์˜ค์ฆˆ๋น„'์— ๋กœ๊ทธ๋ฅผ ์ทจํ•ด ์ง์„  ํ•จ์ˆ˜ ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ฃผ์—ˆ๋‹ค.

 

๐ŸŒŸ ์ง์„ ์œผ๋กœ ๋ฐ”๋€Œ์—ˆ์ง€๋งŒ, ln ๋กœ๊ทธ๋ฅผ ์ทจํ•จ์œผ๋กœ์จ y๊ฐ’์˜ range๊ฐ€ $-\infty$ ~ $+\infty$์˜ ๋ฒ”์œ„๋ฅผ ๋ณด์—ฌ OLS -least squares method๋กœ ์ฃผ์–ด์ง„ data์™€ ์ง์„ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ ์ œ๊ณฑํ•ฉ์˜ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด ์ตœ์ ์˜ fitting line์„ ์ฐพ์„ ์ˆ˜๊ฐ€ ์—†๋‹ค. (OLS๋ฅผ logisitc์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋Š” ์ด์œ )

 

 

๐ŸŒŸ ๋”ฐ๋ผ์„œ ์ตœ์ ์˜ logistic model์„ ์ฐพ๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ MLE(Maximum Likelihood Estimaiton) ๊ธฐ๋ฒ•์„ ํƒํ•จ. ์ฃผ์–ด์ง„ x data๋ฅผ ์ด์šฉํ•ด data๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” logistic model์„ ์ฐพ์•„๋ณด์ž.

 

 

๐ŸŽˆ MLE์˜ likelihood๋Š” Odds์—์„œ์˜ p ์„ฑ๊ณตํ™•๋ฅ ์„ ๋œปํ•œ๋‹ค.

 

๐ŸŽˆ ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ์ž„์˜์˜ LR ์ง์„ ์„ ์„ธ์šฐ๊ณ , logit transformationํ•œ ๋‹ค์Œ, ๊ฐ data๋ณ„ p ์„ฑ๊ณตํ™•๋ฅ (likelihood)์„ ๋ชจ๋‘ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค.

(์—ฌ๊ธฐ์„œ 1์ด ์•„๋‹Œ 0์œผ๋กœ class๊ฐ€ ๋ถ„๋ฅ˜๋˜๋Š” data๋Š” (1- SLR ๊ณก์„  ๊ฐ’)์ด likelihood์ด๋‹ค.)

 

๐ŸŽˆ ใ€Šlogistic model selection ์ ˆ์ฐจใ€‹

 

โ‘  $\theta^T = [\theta_0, \theta_1, ... , \theta_n]$๋ฅผ ๋ชจ์ˆ˜๋กœ ๊ฐ–๊ณ  ์žˆ๊ณ , ๋‘ class(A, B)๋กœ ๋ถ„๋ฅ˜๋œ LR์„ ๋จผ์ € ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด

 

$P(y_i = A | x_i;\hat{\theta}) = \theta^T \bar{x}$

$\theta^T = [\theta_0, \theta_1, ... \theta_n], \bar{x} = [1, x_0, x_1, ... , x_n]^T$

 

โ‘ก likelihood๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด, [0, 1] range๋กœ ๊ตฌ์„ฑ๋œ sigmoid ๊ณก์„ ์œผ๋กœ์˜ ๋ณ€ํ™˜ - logit transformation ์ˆ˜ํ–‰

 

$h(x) = \sigma(\theta^{T}\bar{x}) = \cfrac{1}{(1+e^{-\theta^{T}\bar{x}})}$

→ ์ฆ‰, likelihood๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด $P(y_i = A | x_i;\hat{\theta}) = \cfrac{1}{(1+e^{-\theta^{T}\bar{x}})}$

→ class B์˜ ๊ฒฝ์šฐ likelihood๋Š” $P(y_i = B | x_i;\hat{\theta}) = 1 - \cfrac{1}{(1+e^{-\theta^{T}\bar{x}})}$

 

โ‘ข binary problem๊ณผ ๊ฐ™์ด, ๋‘ class A์™€ class B์˜ ๊ฐ likelihood๋ฅผ ํ•ฉ์น˜๋ฉด (bernoulli distribution์— ์˜ํ•ด - ๊ณง ํฌ์ŠคํŒ…)

 

$P(y_i | x_i;\hat{\theta}) = h(x_i)^{y_i} (1-h(x_i))^{1-y_i}$

 

→ ๊ฐœ๋ณ„ $y_i$๊ฐ€ ์•„๋‹Œ, ์ „์ฒด y๋กœ ๋‚˜ํƒ€๋‚ด์ž๋ฉด

 

$P(y|x;\theta)$ = $\prod_{i=1}^{m} h(x_i)^{y_i} (1-h(x_i))^{1-y_i}$ = likelihood of $\theta$ = $L(\theta)$

(๊ฐ data์˜ likelihood๋ฅผ ๋ชจ๋‘ ๊ณฑํ•œ๋‹ค๋Š” ๋ถ€๋ถ„์—์„œ๋Š” ๊ฐ data์˜ ๋…๋ฆฝ์„ฑ์ด ์ „์ œ๋จ)

 

โ‘ฃ maximing $L(\theta)$ - ๋กœ๊ทธ ์ทจํ•˜๊ธฐ

 

$L(\theta)$ = $\sum_{i=1}^{m} y_i log(\sigma(\theta^{T}\bar{x_i})) + ... + (1-y_i)log((1-\sigma(\theta^{T}\bar{x_i})))$

 

โ‘ค $\theta$์— ๋Œ€ํ•ด ํŽธ๋ฏธ๋ถ„ํ•˜๊ธฐ

 

$\cfrac{\partial L(\theta)}{\partial \theta}$ = $\cfrac{y}{\sigma(\theta^{T}\bar{x_i})}$$\cfrac{\partial \sigma(\theta^{T}\bar{x_i})}{\partial \theta}$ + ... + $\cfrac{1 - y}{1 - \sigma(\theta^{T}\bar{x_i})}$$(-1)\cfrac{\partial \sigma(\theta^{T}\bar{x_i})}{\partial \theta}$

 

+ ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๋Š” $\cfrac{\partial \sigma(\theta^{T}\bar{x_i})}{\partial \theta}$๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ๊ณ 

$\cfrac{\partial \sigma(\theta^{T}\bar{x_i})}{\partial \theta}$ = $\cfrac{\partial \sigma(\theta^{T}\bar{x_i})}{\partial (\theta^T\bar{x})}$$\cfrac{\partial (\theta^T\bar{x})}{\partial \theta}$

 

+ ์‰ฝ๊ฒŒ ๋ฏธ๋ถ„๋˜๋Š” sigmoid ์„ฑ์งˆ์„ ์ด์šฉํ•ด ๋‹ค์‹œ ํ’€์–ด์“ฐ๋ฉด

 

$\cfrac{\partial \sigma(\theta^{T}\bar{x_i})}{\partial (\theta^T\bar{x})}$$\cfrac{\partial (\theta^T\bar{x})}{\partial \theta}$ = $\sigma(\theta^{T}\bar{x_i})$$(1-\sigma(\theta^T\bar{x}))$$\bar{x}$

 

+ ํ’€์–ด ์“ด ์œ„ ํ•ญ์„ ๋ณธ ์‹์— ๋Œ€์ž…ํ•˜๋ฉด

 

$\cfrac{\partial L(\theta)}{\partial \theta}$ = $(y - \sigma(\theta^T\bar{x})\bar{x}$์„ ์ตœ์ข…์ ์œผ๋กœ ์–ป๋Š”๋‹ค.

 

โ‘ฅ GD(Gradient Descent) ๊ฐœ๋…์„ ์ ์šฉํ•ด $\theta$์— ๋Œ€์ž…ํ•˜๋ฉด

 

$\theta^+$ = $\theta^{-}$ + $\alpha\cfrac{\partial L(\theta)}{\partial \theta}$

 

โ‘ฆ ์œ„ ์ฃผ์–ด์ง„ GD์‹์„ ํ†ตํ•ด ์ตœ์ ์˜ $\theta$๋ฅผ ๊ตฌํ•˜๋ฉด ๋œ๋‹ค.

 

โ‘ง ๊ตฌํ•œ ์ตœ์ ์˜ $\theta$๋กœ ๊ตฌ์„ฑ๋œ ์ตœ์ ์˜ logistic model - sigmoid function์„ ๊ตฌํ•˜๋ฉด ๋œ๋‹ค.

 

- ๋! -

 

๐Ÿ™Œ๐Ÿป ์œ„ ์ˆ˜์‹ ์ง„ํ–‰๊ณผ์ •์ฒ˜๋Ÿผ ์™„์ „ํžˆ ๋˜‘๊ฐ™์€ ๊ณผ์ •์œผ๋กœ MLE ๊ธฐ๋ฒ•์„ normal distribution์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. MLE ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ฃผ์–ด์ง„ x data์— ๊ฐ€์žฅ ์ž˜ ๋งž๋Š” normal distribution์˜ ๋ชจ์ˆ˜ $\mu$์™€ $\sigma$๋ฅผ ์•Œ์•„๋ณด๋Š” ๊ณผ์ •์„ ์ˆ˜ํ•™์ ์œผ๋กœ ํ’€์–ด๋ณด์ž. (์ถ”ํ›„ ํฌ์ŠคํŒ…)


* ์ธ๋„ฌ ์ถœ์ฒ˜) https://programmathically.com/maximum-likelihood-estimation/

* ์ถœ์ฒ˜1) ์ตœ๋Œ€์šฐ๋„๋ฒ• ๊ฐœ๋… ์„ค๋ช… https://www.youtube.com/watch?v=XhlfVtGb19c 

* ์ถœ์ฒ˜2) MLE ๊ธฐ๋ฒ• ๋ฌธํ—Œ http://www.sherrytowers.com/mle_introduction.pdf

* ์ถœ์ฒ˜3) using MLE - logistic regression https://www.youtube.com/watch?v=TM1lijyQnaI 

* ์ถœ์ฒ˜4) ๊ฐ“ STATQUEST - MLE w/ logistic regression https://www.youtube.com/watch?v=BfKanl1aSG0 

* ์ถœ์ฒ˜5) ๊ฐ“ STATQUEST - MLE explained  https://www.youtube.com/watch?v=XepXtl9YKwc 

๋Œ“๊ธ€