Python/Pandas&Numpy

pandas Tricks_12๐Ÿ‘‰๐Ÿป 'pivot_table()' (Kevin by DataSchool)

metamong 2022. 4. 14.

๐Ÿ‘† tidy data ํฌ์ŠคํŒ…์—์„œ ์šฐ๋ฆฌ๋Š” tidyํ•œ data๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๊ธฐ์กด dataframe์„ meltํ•œ๋‹ค๊ณ  ํ•˜์˜€๊ณ , ๋‹ค์‹œ ์›์œ„์น˜ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด pivot_table()์„ ํ™œ์šฉํ•œ๋‹ค๊ณ  ๋ฐฐ์› ๋‹ค.

 

Tidy Data

* ์‹ค์ œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ๋Š” ํ•˜๋‚˜์˜ ์™„๋ฒฝํ•œ dataset์ด ์•„๋‹Œ ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ์ชผ๊ฐœ์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค! → ๋”ฐ๋ผ์„œ ๋ถ„์„ํ•˜๊ธฐ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋‚˜์˜ dataset์œผ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ๊ณผ์ •์ด ํ•„์š” 1. Tidy Data๋ž€? "๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์—ด์ด

sh-avid-learner.tistory.com

๐Ÿ‘€ ๊ทผ๋ฐ ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ๊ฑด, pivot_table()์€ ๊ธฐ์กด messy data๋กœ ์›์œ„์น˜์‹œํ‚ค๋Š” ๊ธฐ๋Šฅ๋ณด๋‹ค๋Š”, ์ฃผ์–ด์ง„ dataframe์˜ ์ผ๋ถ€ ํ•„์š”ํ•œ ์ •๋ณด๋งŒ ๊ณจ๋ผ (์ง‘๊ณ„ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ) dataframe ํ˜•ํƒœ๋กœ ๋‹ค์‹œ ๋ณด์—ฌ์ฃผ๊ณ  ์‹ถ์„ ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฃผ ๊ธฐ๋Šฅ์ด ๋‹ด๊ฒจ ์žˆ๋‹ค!

(์ฃผ๋กœ EDA ๊ณผ์ •์—์„œ ๋ณธ dataframe์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ  ์‹ถ์„ ๋•Œ pivot_table์„ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค)

 

- ์œ„ ํฌ์ŠคํŒ…์—์„œ ์ด๋ ‡๊ฒŒ ๋ฐฐ์›€ -

 

 

๐Ÿ‘ ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” pivot_table์ด ์‹ค์ œ dataframe์—์„œ ์–ด๋–ป๊ฒŒ ์œ ์šฉํ•˜๊ฒŒ ์“ฐ์ด๋Š” ์ง€ ์ด๋ฒˆ ํฌ์ŠคํŒ…์„ ํ†ตํ•ด ๋” ์ƒ์„ธํ•˜๊ฒŒ ์•Œ์•„๋ณด๊ณ ์ž ํ•จ!

 

๐Ÿ‘ฉ‍๐Ÿฆณ pivot_table() ๐Ÿ‘ฉ‍๐Ÿฆณ

https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

 

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

 

โœ๏ธ data = pivot_table๋กœ ๋ฐ”๊ฟ€ ์› data

โœ๏ธ values = ์–ด๋–ค column์„ value๋กœ ํ•˜์—ฌ ๊ฒฐ๊ณผ๋กœ ํ‘œ์‹œํ•  ์ง€ ๋ณด์—ฌ์คŒ

โœ๏ธ index = index๋กœ ๋‚˜ํƒ€๋‚ผ column ์ด๋ฆ„ (์„ธ๋กœ๋ฐฉํ–ฅ)

โœ๏ธ columns = column์œผ๋กœ ๋‚˜ํƒ€๋‚ผ data์˜ column ์ด๋ฆ„ ์ข…๋ฅ˜๋“ค (๊ฐ€๋กœ๋ฐฉํ–ฅ)

โœ๏ธ aggfunc = values๋ฅผ ์ •ํ–ˆ์œผ๋ฉด ์–ด๋–ค aggfunc์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ง€ ์ •ํ•  ์ˆ˜ ์žˆ์Œ

โœ๏ธ fill_value = ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๋Š” ๋ฐฉ๋ฒ•

โœ๏ธ margins = True๋กœ ์„ค์ •๋˜๋ฉด aggfunc์— ์˜ํ•ด ๊ตฌํ•œ ๊ฐ index์™€ columns์˜ ๊ฒฐ๊ณผ๋ฅผ ์ƒˆ๋กœ์šด ํ–‰๊ณผ ์—ด๋กœ ํ‘œ์‹œ

โœ๏ธ dropna = True๋กœ ์„ค์ •ํ•˜๋ฉด ๋ชจ๋“  ๊ฐ’์ด NaN์ธ column์€ aggfunc ์—ฐ์‚ฐ์— ํฌํ•จ์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค๋Š” ๋œป

โœ๏ธ margins_name = margins๋ฅผ True๋กœ ์„ค์ •์‹œ ์ƒˆ๋กœ์šด ์ง‘๊ณ„ํ•จ์ˆ˜๋กœ ๋‚˜์˜ฌ column ์ด๋ฆ„ ์„ค์ • (default๋Š” All)

โœ๏ธ observed = categorical variable์— ๊ด€ํ•˜์—ฌ ์ง‘๊ณ„ํ•จ์ˆ˜ ์‚ฌ์šฉ ์‹œ ์‚ฌ์šฉํ•œ column๋“ค๋งŒ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ (True๋กœ ์„ค์ • ์‹œ)

โœ๏ธ sort = ์ถœ๋ ฅ๊ฒฐ๊ณผ๋ฅผ sorting

 

๐Ÿฆธ‍โ™‚๏ธ ์ฆ‰! ์šฐ๋ฆฌ๋Š” dataframe์„ EDAํ•˜๋Š” ๊ณผ์ •์—์„œ ์–ด๋–ค ์ง‘๊ณ„ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋‘ column ์‚ฌ์ด์˜ ์—ฐ๊ด€์„ฑ์„ ๋”ฐ์ง€๊ณ  ์‹ถ์„ ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค! ๐Ÿฆธ‍โ™‚๏ธ

 

- titanic dataset ์˜ˆ์‹œ -

 

Q. ๊ฐ ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ class ๋“ฑ๊ธ‰๋ณ„ ์ƒ์กด ์—ฌ๋ถ€์˜ ํ‰๊ท ์„ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด?

(์ด ๋•Œ ๊ฐ ์„ฑ๋ณ„ & ๊ทธ๋ฆฌ๊ณ  ๊ฐ class ๋ณ„ ์ƒ์กด ์—ฌ๋ถ€ ํ‰๊ท ๊นŒ์ง€๋„ ์ƒˆ๋กœ์šด column ๋ฐ row๋ฅผ ์ด์šฉํ•ด์„œ ์ƒ์„ฑํ•œ๋‹ค)

 

pd.pivot_table(data=titanic, index='sex', columns='pclass', values='survived', aggfunc='mean', margins_name = 'Total', margins=True)

 

Q. ๊ฐ ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ class ๋“ฑ๊ธ‰๋ณ„ ์ƒ์กดํ•œ ์‚ฌ๋žŒ ์ˆ˜๋ฅผ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด?

(์ด ๋•Œ ๊ฐ ์„ฑ๋ณ„ & ๊ทธ๋ฆฌ๊ณ  ๊ฐ class ๋ณ„ ์ƒ์กด ์—ฌ๋ถ€ ํ•ฉ๊ณ„๊นŒ์ง€๋„ ์ƒˆ๋กœ์šด column ๋ฐ row๋ฅผ ์ด์šฉํ•ด์„œ ์ƒ์„ฑํ•œ๋‹ค)

 

pd.pivot_table(data = titanic, index='sex', columns='pclass', values='survived', aggfunc='count', margins_name = 'Total', margins=True)

 

- ์ถœ๋ ฅ๊ฒฐ๊ณผ (์ขŒ) mean / (์šฐ) count -

 

 

→ ์ด๋ ‡๊ฒŒ pivot_table()์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ์ •๋ณด๋งŒ ๊ณจ๋ผ ํ•œ๋ˆˆ์— ์ง‘๊ณ„ํ•จ์ˆ˜๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ  ๋ฐ”๋žŒ์งํ•œ EDA ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ!

 

** pivot_table() ์™„๋ฃŒ ๐Ÿ‘ฑ**


* ๋‚ด์šฉ ์ถœ์ฒ˜) https://www.youtube.com/watch?v=RlIiVeig3hc&t=1338s 

* ์ธ๋„ค์ผ ์ถœ์ฒ˜) https://www.pngegg.com/en/png-eyrdq

๋Œ“๊ธ€