Statistics/Concepts(+codes)

Central Limit Theorem (CLT; ์ค‘์‹ฌ๊ทนํ•œ์ •๋ฆฌ)

metamong 2022. 5. 5.

๐Ÿ• ํ†ต๊ณ„์ ์œผ๋กœ ๊ผญ ์•Œ์•„์•ผ ํ•  ๊ฐœ๋…(๊ธฐ์ดˆ)์ธ CLT - ์ค‘์‹ฌ๊ทนํ•œ์ •๋ฆฌ์— ๋Œ€ํ•ด์„œ ์งš๊ณ  ๋„˜์–ด๊ฐ€๋ณด์ž!

concepts> * ์šฉ์–ด ์ •ํ™•ํžˆ (terminology)

๐Ÿ• โ‘ sample ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก โ‘กsample์˜ ํ‰๊ท ์€ โ‘ข๋ชจ์ง‘๋‹จ์— ๊ด€๊ณ„์—†์ด โ‘ฃ์ •๊ทœ๋ถ„ํฌ์— ๊ทผ์‚ฌํ•œ ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค

 

- ์ž์„ธํžˆ ๋ณด์ž -

โ‘ sample ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก

๐Ÿ™†๐Ÿพ‍โ™€๏ธ sample size๋Š” ์ตœ์†Œ 30์ด์–ด์•ผ ํ•œ๋‹ค

 

โ‘กsample์˜ ํ‰๊ท 

๐Ÿ™†๐Ÿพ‍โ™€๏ธ sample์˜ ํ‰๊ท ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ„ํฌ(๋ชจ์ง‘๋‹จ)์—ฌ์•ผ ํ•œ๋‹ค (๊ทผ๋ฐ cauchy ๋ถ„ํฌ๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๊ฑฐ์˜ ์—†๋‹ค ๋ณด๋ฉด ๋จ!)

(sample์˜ ํ‰๊ท  = sample ๋‚ด sample size ๊ฐœ์ˆ˜๋งŒํผ ์žˆ๋Š” data๋“ค์˜ ํ‰๊ท )

 

โ‘ข๋ชจ์ง‘๋‹จ์— ๊ด€๊ณ„์—†์ด

๐Ÿ™†๐Ÿพ‍โ™€๏ธ ๋ชจ์ง‘๋‹จ์ด ์–ด๋–ค ๋ถ„ํฌ์ด๋“  ์ƒ๊ด€์ด ์—†๋‹ค. ๋ฌด์กฐ๊ฑด ๋ชจ๋“  ๋ถ„ํฌ ๋‹ค ํ•ด๋‹น๋˜๋Š” theorem

 

โ‘ฃ์ •๊ทœ๋ถ„ํฌ

๐Ÿ™†๐Ÿพ‍โ™€๏ธ normally distributed๋˜๋Š” sample mean! ์ •๊ทœ๋ถ„ํฌ ์ข… ๋ชจ์–‘์˜ ํ˜•ํƒœ๋ฅผ ๋ฌด์กฐ๊ฑด ๋”!

 

๐Ÿ• ํ™œ์šฉ?

โ˜… ๊ทธ ์–ด๋–ค ๋ชจ์ง‘๋‹จ ๋ถ„ํฌ์ด๋“ , ๋ถ„ํฌ ์ข…๋ฅ˜์— ์ƒ๊ด€์—†์ด sample ํ‰๊ท ์€ ๋ฌด์กฐ๊ฑด์ ์œผ๋กœ normally distributed๋จ!

(๋ชจ์ง‘๋‹จ ๋ถ„ํฌ์— ์‹ ๊ฒฝ ์“ธ ํ•„์š” ์—†์Œ)

โ˜… ์ฆ‰! mean's normal distribution์œผ๋กœ ์‹ ๋ขฐ๊ตฌ๊ฐ„, t-test, ANOVA ๋“ฑ๋“ฑ sample mean์„ ์ด์šฉํ•˜๋Š” ๋‹ค์–‘ํ•œ statistical tests์— ์ ์šฉ ๊ฐ€๋Šฅ!

 

๐Ÿ™‰ ์ฃผ์˜!

โ˜… ๋ชจ์ง‘๋‹จ์—์„œ sampling์„ ํ•  ๋•Œ sample ๋‚ด์˜ data ๊ฐœ์ˆ˜๋ฅผ sample size๋ผ๊ณ  ๋ถ€๋ฆ„! ๋ช‡ ๋ฒˆ samplingํ•˜๋Š” ์ง€๋ฅผ ๋งํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋‹ค.

(ํ•ด๋‹น sample ๋‚ด์˜ size ์ˆ˜๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก sample๋‚ด data์˜ mean์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋ณด์ธ๋‹ค๋Š” ๋œป)

 

๐Ÿ™‰ sample์ด ๊ฐ–์ถ”์–ด์•ผ ํ•  ์กฐ๊ฑด๋“ค

 

โ˜… ๋ณต์›๋žœ๋ค์ถ”์ถœ์˜ ๊ฒฐ๊ณผ๋กœ ์–ป์–ด์ง„ sample๋“ค์ด์–ด์•ผ ํ•œ๋‹ค

โ˜… sample์€ ๋ชจ์ง‘๋‹จ์„ ๋Œ€ํ‘œํ•ด์•ผ ํ•จ

โ˜… ๊ฒฐ๋ก ์„ ๋‚ด๋ฆฌ๊ธฐ์— ์ถฉ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด sample size๋Š” 30๋ณด๋‹ค ๊ฐ™๊ฑฐ๋‚˜ ์ปค์•ผ ํ•œ๋‹ค

โ˜… sample ๋‚ด์˜ replacement๊ฐ€ ์—†๋‹ค๋ฉด ๋ชจ์ง‘๋‹จ์˜ 10% ์ด๋‚ด๋งŒ sample์— ํฌํ•จ๋˜์–ด์•ผ ํ•œ๋‹ค (sample size๊ฐ€ ๋˜ ๋„ˆ๋ฌด ํฌ๋‹ค๋ฉด ์„œ๋กœ ์˜ํ–ฅ์„ ์ฃผ๋Š” sample๋“ค์ด ์„ž์—ฌ ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์„ ๋ฐฐ์ œํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ!)

w/code

Q. ์„ฑ๊ณตํ™•๋ฅ  0.5๋ฅผ ๊ฐ–๋Š” ์‚ฌ๊ฑด์„ 6๋ฒˆ ์‹คํ–‰ํ•˜๋Š” ์ดํ•ญ๋ถ„ํฌ(binomial distribution)๊ฐ€ ๋ชจ์ง‘๋‹จ์ด๋ผ ๊ฐ€์ •ํ•˜์ž. ์ด ์ดํ•ญ๋ถ„ํฌ์˜ ํ™•๋ฅ ๋ณ€์ˆ˜ ํ‰๊ท  (์ฆ‰ ๋ชจํ‰๊ท )์„ ์ถ”์ •ํ•˜๊ณ  ์‹ถ๋‹ค. ๋ชจ์ง‘๋‹จ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ชจ๋ฅธ๋‹ค๋Š” ๊ฐ€์ • ํ•˜, ํ•ด๋‹น ๋ชจ์ง‘๋‹จ์—์„œ x๊ฐœ์˜ sample์„ ๋ฝ‘๋Š” sampling์„ 10,000๋ฒˆ์„ ๋ฐ˜๋ณตํ•œ๋‹ค(๋ณต์›์ถ”์ถœ). ๊ทธ ๊ฒฐ๊ณผ x๊ฐ’์ด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก x๊ฐœ์˜ sample ํ‰๊ท (sample์˜ ํ™•๋ฅ ๋ณ€์ˆ˜ X ํ‰๊ท ๊ฐ’)์ด ์ ์  ์ •๊ทœ๋ถ„ํฌ ๋ชจ์–‘์„ ๋„๋Š” ์ง€ ์ฆ๋ช…ํ•ด๋ผ.

(๋‹จ, sample๋“ค์€ ์„œ๋กœ indepedentํ•˜๋‹ค ๊ฐ€์ • & sample ๋‚ด์˜ replacement ํ—ˆ์šฉ)

 

A.

 

1> ์‹œ๊ฐํ™” ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ

 

def viz_CLT_binomial(size):
    sample_means = []

    for x in range(0, 10000):
      coinflips = np.random.binomial(n = 6, p = 0.5, size = size) # binomial distribution
      sample_means.append(coinflips.mean()) 
    
    pd.DataFrame(sample_means).hist(color = '#4000c7', bins=1000);

 

2> sample size๋ฅผ ์ตœ์†Œ size๊ฐ€ ์ ์šฉ๋˜๋Š” 30๋ถ€ํ„ฐ 300, 3000, 300000๊นŒ์ง€ ์ฐจ๋ก€๋กœ ์ ์šฉํ•ด ์‹œ๊ฐํ™” ํ•ด๋ณธ๋‹ค

 

for s in [30, 300, 3000, 300000]:
	viz_CLT_binomial(s)

 

3> ๊ฒฐ๊ณผ

 

- (์™ผ์ชฝ ์œ„๋ถ€ํ„ฐ ์‹œ๊ณ„๋ฐฉํ–ฅ์œผ๋กœ) size๊ฐ€ 30์ผ ๋•Œ, 300์ผ ๋•Œ, 3000์ผ ๋•Œ, 300000์ผ ๋•Œ์˜ sample mean ๋ถ„ํฌ-

 

4> ๊ฒฐ๊ณผ ํ•ด์„

๐Ÿ” ์„ฑ๊ณตํ™•๋ฅ  0.5๋ฅผ ๊ฐ–๋Š” ์‚ฌ๊ฑด์„ 6๋ฒˆ ์‹œํ–‰ํ•˜๋Š” ์ดํ•ญ๋ถ„ํฌ์—์˜ ํ‰๊ท  ์ถ”์ •์„ ๊ฐ€๋Š ํ•  ์ˆ˜ ์žˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ดํ•ญ๋ถ„ํฌ์ธ ์ง€ ๋ชจ๋ฅธ๋‹ค ํ•˜๋”๋ผ๋„ ๋‹จ์ˆœํžˆ sample size๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด์„œ sampling์„ ํ•ด ๋ถ„ํฌ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒฐ๊ณผ, ์šฐ๋ฆฌ๋Š” ๋ชจํ‰๊ท ์ด 3์— ๊ฐ€๊น๋‹ค๋Š” ๊ฒƒ์„ CLT์— ์˜ํ•ด ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

(์‹ค์ œ๋กœ ํ™•๋ฅ  0.5๋ฅผ ๊ฐ–๋Š” ์‚ฌ๊ฑด์„ 6๋ฒˆ ์‹œํ–‰ํ•˜๋Š” ๋ถ„ํฌ์ด๋ฏ€๋กœ ํ™•๋ฅ ๋ณ€์ˆ˜ ํ‰๊ท ์€ 3์— ๊ฐ€๊นŒ์šด ์ดํ•ญ๋ถ„ํฌ - ๋ชจ์ง‘๋‹จ์ด๋‹ค!)

 

๐Ÿ” ๋˜ํ•œ ์šฐ๋ฆฌ๋Š” sample mean์˜ ๋ถ„ํฌ๊ฐ€ ์ข… ๋ชจ์–‘์„ ๋„๋Š” ์ •๊ทœ๋ถ„ํฌ๋ฅผ ํ˜•์„ฑํ•จ์„ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด์„œ ์•Œ ์ˆ˜ ์žˆ๋‹ค

 

๐Ÿ” ์ด๋ ‡๋“ฏ ์šฐ๋ฆฌ๋Š” CLT์— ์˜ํ•ด ๋ชจ๋ฅด๋Š” ๋ชจ์ง‘๋‹จ์˜ ํ‰๊ท ์„ approximateํ•  ์ˆ˜ ์žˆ๋‹ค!

 

๐Ÿ” ์ถ”๊ฐ€์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ๋ชจ์ง‘๋‹จ์— ๋Œ€ํ•œ ๋ถ„ํฌ๋Š” ๋ชฐ๋ž์ง€๋งŒ, ์ด์ œ sampling์„ ํ†ตํ•ด ์ •๊ทœ๋ถ„ํฌ๋ฅผ ์ฐพ์•„๋ƒˆ์œผ๋ฏ€๋กœ ํ•ด๋‹น ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์–‘ํ•œ statistical tests๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ! ๐Ÿ‘

(์ดํ•ญ๋ถ„ํฌ ์„ค๋ช…์€ ์ถ”ํ›„ ํฌ์ŠคํŒ…!)


* ์ถœ์ฒ˜1) https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1

* ์ถœ์ฒ˜2) KhanAcademyโ˜€๏ธ https://www.youtube.com/watch?v=JNm3M9cqWyc 

* ์ถœ์ฒ˜3) STATQUEST๐ŸŒ https://www.youtube.com/watch?v=YAlJCEDH2uY 

๋Œ“๊ธ€