Data Science Fundamentals/Pandas&Numpy and Questions

๐Ÿฅฐ StrataScratch PythonPandas Medium I - 18 Solved

metamong 2025. 3. 9.

02099 Election Results

# Import your libraries
import pandas as pd

# Start writing code
voting_results.head() #voter / candidate

#boolean indexing - filtering voters who didn't vote
voting_results = voting_results[~voting_results['candidate'].isna()] #boolean indexing(condiiton included)
#voting_results = voting_results.dropna()

#voter rate
voting_results['voter_rate'] = voting_results['voter'].apply(lambda x : 1 / (voting_results['voter'] == x).sum())

result = voting_results.groupby('candidate')['voter_rate'].sum() 
#print(result)
result = result.reset_index() #index(new numbers), originial index -> new column
# print(result)

result[result['voter_rate'] == result['voter_rate'].max()] #row


result[result['voter_rate'] == result['voter_rate'].max()]['candidate'] #final ans

 

๐Ÿฅ  

Q. ๊ฐ voter๊ฐ€ candidate์—๊ฒŒ ํˆฌํ‘œ ์ง„ํ–‰. ์ด ๋•Œ, ์–ด๋–ค voter๋Š” ์—ฌ๋Ÿฌ ๋ช…์˜ candidate์—๊ฒŒ ํˆฌํ‘œํ•˜์˜€๊ณ , ๋‘๋ช…์—๊ฒŒ ํˆฌํ‘œํ•˜๋ฉด, ๊ฐ candidate์€ 1ํ‘œ๊ฐ€ ์•„๋‹Œ 0.5ํ‘œ๋ฅผ ๋‚˜๋ˆ  ๊ฐ€์ง. ์ตœ๊ณ  ๋“ํ‘œ candidate ์ถœ๋ ฅํ•˜๊ธฐ

 

โ˜… ์–ด๋–ค voter๋Š” candidate ํˆฌํ‘œ๋ฅผ ํ•˜์ง€ ์•Š์•˜์Œ(null ์ฒ˜๋ฆฌ ํ•„์š”)

โ˜… ๋‘ ๋ช… ์ด์ƒ์—๊ฒŒ ํˆฌํ‘œ ์‹œ ๊ฐ candidate์—๊ฒŒ ๊ทธ๋งŒํผ ๋‚˜๋ˆ ๊ฐ€์ง„ ํˆฌํ‘œ์ˆ˜ ๋ถ€์—ฌ

 

* ์ฒซ๋ฒˆ์งธ โ˜… ํ•ด๊ฒฐ: boolean indexing์œผ๋กœ ~voting_results['candidate'].isna()์ธ ์กฐ๊ฑด์„ indexing์œผ๋กœ ๋„ฃ์–ด, null์ด ์•„๋‹Œ ํ–‰๋งŒ filtering. ๋˜๋Š” dropna()๋กœ ํ•œ ๋ฐฉ์—๋„ ๊ฐ€๋Šฅ

 

* ๋‘๋ฒˆ์งธ โ˜… ํ•ด๊ฒฐ: ์ƒˆ๋กœ์šด voter_rate ์นผ๋Ÿผ์„ ๋งŒ๋“ค๊ณ , ๊ฐ ํ–‰ ๋ณ„ apply() ํ•จ์ˆ˜ ์ ์šฉ. ๋งŒ๋“ค ํ•จ์ˆ˜๋Š” lambda๋ฅผ ์‚ฌ์šฉํ•ด lambda x : 1 / (voting_results['voter'] == x).sum()) ๋‚ด์šฉ ์ ์šฉ. voting_results['voter'] == xํ•˜๋ฉด, ๊ฐ voter๊ฐ€ 2๊ฐœ ์ด์ƒ์ผ ๊ฒฝ์šฐ, ๊ทธ๋งŒํผ True๊ฐ€ ์ƒ๊น€. True๋Š” 1์ด๊ณ , False๋Š” 0์ด๋ฏ€๋กœ, sum()์„ ์ ์šฉํ•ด, ๊ฐ row๋ณ„ ํ•ด๋‹น voter๊ฐ€ ๋ช‡ ๋ช…์ธ์ง€ ์•Œ ์ˆ˜ ์žˆ์Œ. ์ด ๋•Œ, ๋™์ผ ์ด๋ฆ„(ex Sally)์ด 3๋ช… ์žˆ๋‹ค๋ฉด, Sally์˜ ๋“ํ‘œ rate์€ 1์ด ์•„๋‹Œ 1/3์ด ๋œ๋‹ค. ์ฆ‰, 1 / (voting_results['voter'] == x).sum()) ์ ์šฉํ•จ์œผ๋กœ์จ ๊ฐ row์˜ voter rate ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

* groupby() ์ ์šฉํ•ด ๊ฐ candidate๋ณ„ voter_rate ํ•ฉ ๊ตฌํ•˜๊ธฐ(groupby๋Š” ๋’ค์— aggregation ํ•จ์ˆ˜ ํ•„์ˆ˜ ์ ์šฉ)

 

* reset_index() ์ ์šฉํ•ด candidate์„ index๊ฐ€ ์•„๋‹Œ ์ƒˆ๋กœ์šด column์œผ๋กœ ๋ณ€๊ฒฝ

 

* boolean indexing์œผ๋กœ ์ตœ๋Œ“๊ฐ’ voter_rate์ธ candidate ๊ตฌํ•˜๋ฉด ๋!


10060 Top Cool Votes

# Import your libraries
import pandas as pd

# Start writing code
yelp_reviews.head()

#boolean indexing
res = yelp_reviews[yelp_reviews['cool'] == yelp_reviews['cool'].max()]

#shown in dataframe along with selected columns
res[['business_name', 'review_text']]

 

๐Ÿฅ  Q. ๊ฐ€์žฅ ๋†’์€ cool votes ๋ฐ›์€ ํ–‰(more or equal than 1)์˜ busines_name๊ณผ review_text ๋‘ ํ–‰์„ ๋ชจ์€ dataframe ํ˜•ํƒœ๋กœ ์ถœ๋ ฅ

 

* ๋‘ ๊ฐœ ์ด์ƒ์˜ selected rows๋ฅผ dataframe ํ˜•ํƒœ๋กœ ๋งŒ๋“ค์–ด์„œ ์ถœ๋ ฅํ•˜๋ ค๋ฉด res[['business_name', 'review_text']]์™€ ๊ฐ™์ด dataframe_name[[]]์œผ๋กœ [] ์•ˆ์— []์„ ๋” ๋„ฃ๊ณ , column names ๋‚˜์—ด.


10353 Workers With The Highest Salaries

# Import your libraries
import pandas as pd

# Start writing code
worker.head()

#inner-merge based on the same column 'id'
worker_merged = worker.merge(title, how = 'inner', left_on = 'worker_id' , right_on = 'worker_ref_id')
#inner-merge: show only intersected rows

#boolean indexing
worker_merged[worker_merged['salary'] == worker_merged['salary'].max()]['worker_title']

 

๐Ÿฅ  ๊ฐ€์žฅ ๋†’์€ Salary๋ฅผ ๋ฐ›๋Š” worker_title ๊ตฌํ•˜๊ธฐ

 

* ๋ฌธ์ œ๋Š” ๊ฐ„๋‹จํ•˜๋‚˜, ํ…Œ์ด๋ธ”์ด ๋‘ ๊ฐœ์ด๋ฏ€๋กœ merge ํ•„์ˆ˜. worker.merge(title, how = 'inner', left_on = 'worker_id' , right_on = 'worker_ref_id') ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•ด worker_id์™€ worker_ref_id ๊ณตํ†ต ์นผ๋Ÿผ์œผ๋กœ, ๋™์ผ id๊ฐ€ ์กด์žฌํ•˜๋Š” ํ–‰์œผ๋กœ๋งŒ merge(inner-merge) ์ง„ํ–‰(left_on๊ณผ right_on์œผ๋กœ ๊ฐ ํ…Œ์ด๋ธ”์˜ key column ๋ช… ์ง€์ • ๊ฐ€๋Šฅ)

 

* boolean indexing์œผ๋กœ salary ์ตœ๋Œ“๊ฐ’์ธ ๊ฒฝ์šฐ์—๋งŒ worker_title ๊ตฌํ•˜๊ธฐ


02102 Flags per Video

# Import your libraries
import pandas as pd

# Start writing code
user_flags.head()

user_flags = user_flags[user_flags['flag_id'].notnull()]

user_flags['name'] = user_flags['user_firstname'].astype(str) + ' ' + user_flags['user_lastname'].astype(str)

answer = user_flags.groupby(by='video_id')['name'].nunique().reset_index()

 

๐Ÿฅ  ๊ฐ ๋น„๋””์˜ค ๋ณ„๋กœ ์ค‘๋ณต ์ œ์™ธํ•œ ์ด๋ฆ„๋ณ„ ์‚ฌ์šฉ์ž ๋ช‡ ๋ช…์ธ์ง€ ์ •๋ฆฌํ•ด์„œ ์ถœ๋ ฅํ•˜๋Š” ๋ฌธ์ œ

 

(1) ๋จผ์ € null ๊ฐ’ ์ œ๊ฑฐ

(2) name ์—ด ๋งŒ๋“ค ๋•Œ firstname๊ณผ lastname ํ•ฉ์ณ์„œ ๋งŒ๋“ค๊ธฐ. ์ด ๋•Œ, astype(str) ํ•จ์ˆ˜ ํ™œ์šฉํ•ด์„œ ๊ฐ ์—ด์„ ๋ฌธ์ž์—ด๋กœ ๋ฐ”๊ฟ”์„œ ํ•ฉ์นœ ๋’ค name ์—ด ๋‚ด์šฉ ์ƒ์„ฑ

(3) ์ฃผ์–ด์ง„ user_flags ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ video_id๋กœ groupby ์ง„ํ–‰ํ•˜๊ณ  ๊ฐ name๋ณ„ nunique() ํ•จ์ˆ˜ ์‚ฌ์šฉํ•ด ์‹ค์ œ video ๋ณ„ ์ธ์› ์ˆ˜ ์ •๋ฆฌํ•ด์„œ ์ถœ๋ ฅ. ์ด ๋•Œ, groupby ์ง„ํ–‰ํ—€์œผ๋ฏ€๋กœ reset_index() ํ•„์ˆ˜


02005 Share of Active Users

# Import your libraries
import pandas as pd

# Start writing code
fb_active_users.head()

df = fb_active_users[fb_active_users['status'] == 'open']
df_us = df[df['country'] == 'USA']
output = df_us.shape[0] * 100/ fb_active_users.shape[0]

 

๐Ÿฅ  ์กฐ๊ฑด์— ๋งž๋Š” ํ–‰ ํ•„ํ„ฐ๋ง ์ง„ํ–‰ ํ›„, ์ „์ฒด ํ–‰ ๋Œ€๋น„ ์กฐ๊ฑด์— ๋งž๋Š” ํ–‰์˜ ๊ฐœ์ˆ˜์ธ ๋น„์œจ ์ถœ๋ ฅ


09610 Find Students At Median Writing

# Import your libraries
import pandas as pd

# Start writing code
sat_scores.head()

sat_scores[sat_scores['sat_writing'] == sat_scores['sat_writing'].median()]['student_id']

#sat_scores['sat_writing'].quantile(0.25)

 

๐Ÿฅ  sat_writing์ด median ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ student_id ์ถœ๋ ฅํ•˜๊ธฐ

: median() ํ•จ์ˆ˜ ๋ฐ”๋กœ ์ง„ํ–‰. ์‚ฌ๋ถ„์œ„ ์ˆ˜ ๊ธฐ์ค€ P1, P3๋ผ๋ฉด quantile(0.25), quantile(0.75) ํ™œ์šฉ


09650 Top 10 Songs 2010

# Import your libraries
import pandas as pd

# Start writing code
billboard_top_100_year_end.head()

billboard_top_100_year_end = billboard_top_100_year_end[(billboard_top_100_year_end['year'] == 2010) & (billboard_top_100_year_end['year_rank'].between(1, 10))]

billboard_top_100_year_end.sort_values(by='year_rank', ascending=True)

billboard_top_100_year_end[['year_rank', 'group_name', 'song_name']].drop_duplicates()

 

๐Ÿฅ  2010๋…„ ๋นŒ๋ณด๋“œ top 10์˜ ๋ชฉ๋ก ์ถœ๋ ฅ. ๋‹จ, ๋žญํฌ ๊ธฐ์ค€ ์˜ค๋ฆ„์ฐจ์ˆœ ์ถœ๋ ฅํ•˜๊ณ , ๋™์ผ ๋…ธ๋ž˜ ์ค‘๋ณต ์ถœ๋ ฅ ๊ธˆ์ง€

(1) between(1,10)์„ ํ™œ์šฉํ•ด rank์˜ 1์œ„๋ถ€ํ„ฐ 10์œ„๋ฅผ ํ•œ๋ฒˆ์— ํ•„ํ„ฐ๋ง ๊ฐ€๋Šฅ

(2) sort_values(by = 'year_rank', ascending = True)๋กœ ํŠน์ • ์—ด ๊ธฐ์ค€ ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

(3) drop_duplicates() ํ•จ์ˆ˜ ํ™œ์šฉํ•ด ํŠน์ • ๋ฐ์ดํ„ฐ์˜ ์ค‘๋ณต ์—ด๋งŒ ์ œ๊ฑฐ ๊ฐ€๋Šฅ


10182 Number of Streets Per Zip Code

# Import your libraries
import pandas as pd

# Start writing code
sf_restaurant_health_violations.head()

sf_restaurant_health_violations['street_str']=sf_restaurant_health_violations['business_address'].str.split()

sf_restaurant_health_violations['street_str']=sf_restaurant_health_violations['street_str'].apply(lambda x: x[1].lower() if x[0].isnumeric() else x[0].lower())

sf_restaurant_health_violations_grouped = sf_restaurant_health_violations.groupby('business_postal_code')['street_str'].nunique().reset_index()

sf_restaurant_health_violations_grouped.sort_values(by=['street_str','business_postal_code'],ascending=[False,True])

 

๐Ÿฅ  street_str์—์„œ street ์ด๋ฆ„๋งŒ ์ถ”์ถœ. ์ด๋ฆ„ ์ถ”์ถœํ•˜๊ณ , postal_code ๋ณ„ street ๊ฐœ์ˆ˜ ์ถ”์ถœ. ์ดํ›„ ๋‘ ์—ด ๊ธฐ์ค€์— ๋”ฐ๋ผ ๋‚ด๋ฆผ/์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

(1) .str.split()์œผ๋กœ ๋จผ์ € ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜ ๋’ค split()์œผ๋กœ ๊ฐ ๋นˆ์นธ ๋ณ„ split

(2) apply(lambda) ํ•จ์ˆ˜ ์ ์šฉํ•ด if x[0].isnumeric()์ผ ๊ฒฝ์šฐ x[1].lower() ์•„๋‹ˆ๋ฉด x[0].lower()๋กœ street name ์„ ํƒ

(3) groupby - nunique() - reset_index()

(4) sort_values()๋กœ by ํ™œ์šฉํ•ด ๋‘ ์—ด ๊ฐ๊ฐ ๋‚ด๋ฆผ/์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ. ์ด ๋•Œ by์™€ ascending ๊ฐ๊ฐ[]๋กœ ๋„ฃ์Œ


10351 Activity Rank

# Import your libraries
import pandas as pd

# Start writing code
google_gmail_emails.head()

google_gmail_emails = google_gmail_emails.groupby('from_user').size().reset_index(name='total_emails').sort_values(by=['total_emails','from_user'],ascending=[False,True])

google_gmail_emails['Rank'] = range(1, len(google_gmail_emails) + 1)
#google_gmail_emails['rank'] = google_gmail_emails['total_emails'].rank(method='first', ascending=False)

google_gmail_emails

 

๐Ÿฅ  ๊ฐ€์žฅ ๋งŽ์€ ์ด๋ฉ”์ผ ๋ณด๋‚ธ ์ˆœ์œผ๋กœ ์ •๋ ฌ(๋‘๊ฐ€์ง€ ์ด์ƒ ๊ธฐ์ค€)ํ•˜๊ณ , rank ์—ด ๋งŒ๋“ค๊ธฐ

(1) groupby์—์„œ size()๋กœ ๊ฐ group ๋‚ด์˜ ์—ด ๊ฐœ์ˆ˜๋กœ grouping ๊ฐ€๋Šฅ - reset_index()์•ˆ์— name ์ธ์ž๋กœ size()๋กœ grouping๋œ ์ƒˆ๋กœ์šด ๊ฐ ๊ทธ๋ฃน ๋‚ด์˜ ๊ฐœ์ˆ˜ ์—ด์˜ ์ด๋ฆ„์„ 'total_emails'๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

(2) sort_values๋กœ ๋‘ ๊ฐ€์ง€ ์ด์ƒ์˜ ์—ด ๊ธฐ์ค€ ์ •๋ ฌ

(3) ์ƒˆ๋กœ์šด Rank ์—ด ๋งŒ๋“ค ๋•Œ range(1, len(df) +1) ๋˜๋Š” .rank(method = 'first', ascendig = False)๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.


09915 Highest Cost Orders

 

# Import your libraries
import pandas as pd

# Start writing code
customers.head()

df = customers.merge(orders, how="inner", left_on="id", right_on="cust_id")

df = df[('2019-02-01' <= df['order_date']) & (df['order_date'] <= '2019-05-01')]
# df = df[df['order_date'].between('2019-02-01', '2019-05-01')]

df = df.groupby(['first_name','order_date'])['total_order_cost'].sum().reset_index(name='max_cost').sort_values('max_cost',ascending = False)
#df = df.groupby(['first_name','order_date'])['total_order_cost'].sum().to_frame('max_cost').reset_index().sort_values('max_cost',ascending = False)

res = df[df['max_cost'] == df['max_cost'].max()]

 

๐Ÿฅ  ์ฃผ์–ด์ง„ ๊ธฐํ•œ ๋ณ„ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ customer ๋ณ„ daily basis total_order_cost ์ตœ๊ณณ๊ฐ’์„ ์ฐ์€ customer ์ •๋ณด ์ถœ๋ ฅ

(1) ๋จผ์ € between() ๋˜๋Š” ์ง์ ‘ ๋ถ€๋“ฑ์‹์œผ๋กœ datetime ์ •ํ•ด์ง„ ๊ธฐํ•œ๋ณ„ ๋ฐ์ดํ„ฐ filtering

(2) groupby ๋‚ด์— ๋‘ ๊ฐ€์ง€ ์ด์ƒ์˜ ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋„ฃ์–ด grouping ๊ฐ€๋Šฅ. reset_index()์•ˆ์— name ์ธ์ž๋ฅผ ๋„ฃ์–ด total_order_cost ํ•ฉ๊ณ„ ์ด๋ฆ„ ๋ฐ์ดํ„ฐ ์—ด์„ ์ƒˆ๋กœ์šด ์—ด ์ด๋ฆ„์œผ๋กœ ์„ค์ •์œผ๋กœ ๊ฐ€๋Šฅ. / ๋˜๋Š” reset_index()์•ˆ์— name์ธ์ž๋ฅผ ์•ˆ๋„ฃ๊ณ , ์–ด์ฐจํ”ผ groupby()์˜ ๊ฒฐ๊ณผ๋Š” Series์ด๋ฏ€๋กœ to_frame()์„ ์ ์šฉํ•ด to_frame()์•ˆ์— ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค ์—ด์˜ ์ด๋ฆ„์„ ๋„ฃ์œผ๋ฉด ๋จ. to_frame()์€ Series๋ฅผ Dataframe์œผ๋กœ ๋ฐ”๊พธ๋Š” ํ•จ์ˆ˜.

(3) boolean indexing์œผ๋กœ ์ตœ๋Œ“๊ฐ’ filtering


02064 Difference Between Times

# Import your libraries
import pandas as pd

# Start writing code
marathon_male.head()

male=(abs(marathon_male['gun_time']-marathon_male['net_time'])).mean()
female=(abs(marathon_female['gun_time']-marathon_female['net_time'])).mean()

abs(female-male)

 

๐Ÿฅ ๋‘ column ์‚ฌ์ด์˜ ์ฐจ์ด(์ ˆ๋Œ“๊ฐ’)์˜ ํ‰๊ท ์„ ์œ„ ์ฝ”๋“œ ๋‘ ์ค„๊ณผ ๊ฐ™์ด ํ•œ ๋ฐฉ์— ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 


(โ˜…) 02014 Hour With The Highest Order Volume

# Import your libraries
import pandas as pd

# Start writing code
postmates_orders.head()

postmates_orders["hour"] = postmates_orders[ "order_timestamp_utc" ].dt.hour
#.apply(lambda x: x.hour)

postmates_orders["date"] = postmates_orders["order_timestamp_utc"].dt.date

result = postmates_orders.groupby([postmates_orders['date'], postmates_orders['hour']])['id'].count().reset_index(name='n_orders')

result = result.groupby('hour')['n_orders'].mean().reset_index() #name x -> n_orders

result[result['n_orders'] == result['n_orders'].max()]

 

๐Ÿฅ  ์–ด๋–ค ๋‚ ์˜ ์–ด๋–ค ์‹œ๊ฐ„์ด ๊ฐ€์žฅ ๋งŽ์€ order_volumn ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง€๋Š” ์ง€ ์•Œ์•„๋ณด๋Š” ๋ฌธ์ œ

* (1) hour์™€ date ์นผ๋Ÿผ์„ ๋งŒ๋“ค๊ณ (dt.hour์™€ dt.date ์‚ฌ์šฉ), date์™€ hour ์ˆœ์„œ๋Œ€๋กœ groupby ์ง„ํ–‰ํ•ด ๊ฐ ๋‚ ์งœ์˜ ๊ฐ ์‹œ๊ฐ„๋ณ„ ์ฃผ๋ฌธ ๊ฐœ์ˆ˜ ๊ตฌํ•˜๊ธฐ

(reset_index ์ธ์ž๋กœ ()์•ˆ์— ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค์–ด์ง€๋Š” ์นผ๋Ÿผ ์ด๋ฆ„ ๋„ฃ๊ธฐ)

* (2) ๋‹ค์‹œ hour๋กœ grouping์„ ์ง„ํ–‰ํ•ด hour๋ณ„ ์ฃผ๋ฌธ๊ฐœ์ˆ˜ ํ‰๊ท ์„ ๊ตฌํ•˜๊ธฐ

* (3) ์ตœ๋Œ“๊ฐ’ filtering

 

Q. groupby๋ฅผ ๋‘ ๋ฒˆ ์ง„ํ–‰ํ•˜์ง€ ์•Š๊ณ , ๋จผ์ € ๊ฐ ์‹œ๊ฐ„๋ณ„๋กœ groupingํ•  ์ˆ˜๋Š” ์—†์„๊นŒ?

A. 3/11์ผ 17์‹œ order๊ฐ€ 3๊ฐœ, 3/12์ผ 17์‹œ order๊ฐ€ 1๊ฐœ๋ผ๋ฉด, ์• ์ดˆ์— hour๋กœ๋งŒ groupingํ•˜๋ฉด, count๋Š” 4๊ฐ€ ๋œ๋‹ค. ํ•˜์ง€๋งŒ, day๋ณ„ hour๋ณ„ average์˜ ์ตœ๋Œ“๊ฐ’์ด๊ธฐ์—, ์šฐ๋ฆฐ ๋จผ์ € day๋ณ„ hour๋ณ„ grouping ์ง„ํ–‰ ํ•„์ˆ˜. ๊ทธ๋Ÿฌ๋ฉด 3/11์ผ 17์‹œ๋Š” 3, 3/12์ผ 17์‹œ๋Š” 1์ด ๋˜๊ณ , hour๋กœ groupingํ•˜๋ฉด ํ‰๊ท ๊ฐ’์ธ 2๊ฐ€ ๋‚˜์˜จ๋‹ค. ๋”ฐ๋ผ์„œ, ์ค‘๋ณต ๋ฐ์ดํ„ฐ๋กœ ์ธํ•ด groupby๋ฅผ ๋‘ ๋ฒˆ ์ง„ํ–‰ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์— ๋งค์šฐ ์ฃผ์˜


02042 Employees' Years In Service

# Import your libraries
import pandas as pd

# Start writing code
uber_employees.head()

uber_employees['still_employed'] = uber_employees['termination_date'].apply(lambda x: 'yes' if pd.isna(x) else "no")

uber_employees['termination_date'] = uber_employees['termination_date'].fillna(pd.to_datetime('2021-05-01'))
uber_employees['years_worked'] = ((uber_employees['termination_date'] - uber_employees['hire_date']).dt.days)/365

result = uber_employees[uber_employees['years_worked'] > 2][['first_name', 'last_name', 'years_worked', 'still_employed']]

 

๐Ÿฅ  ๊ทผ๋ฌด๊ธฐ๊ฐ„ 2๋…„ ๋„˜์€ ๊ฒฝ์šฐ์˜ ์‚ฌ์›๋“ค์ด ํ˜„์žฌ๋„ employed๋˜์—ˆ๋Š” ์ง€ ์•Œ์•„๋ณด๋Š” ๋ฌธ์ œ

(1) ๋จผ์ € lambda๋ฅผ ํ™œ์šฉํ•ด isna(x)๋กœ yes๋‚˜ no ๋ฐ์ดํ„ฐ ๋„ฃ๋Š” apply() ํ•จ์ˆ˜ ์ ์šฉ

(2) fillna()๋กœ null์ธ ๊ฒฝ์šฐ 2021-05-01์„ ๋„ฃ๋Š”๋ฐ pd.to_datetime() ํ™œ์šฉ. ๊ทธ๋ฆฌ๊ณ  years_worked ๊ตฌํ•˜๊ณ  (dt.days/365) >2์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋งŒ ๊ด€๋ จ ๋‚ด์šฉ ์ถœ๋ ฅ


02153 Average On-Time Order Value

 

# Import your libraries
import pandas as pd

# Start writing code
delivery_details.head()

delivery_details['duration_time_minute'] = (delivery_details['delivered_to_consumer_datetime'] - delivery_details['customer_placed_order_datetime']).dt.total_seconds()/60

delivery_details_under_45 = delivery_details[delivery_details['duration_time_minute'] <= 45]

delivery_filtered = delivery_details[delivery_details['driver_id'].isin(delivery_details_under_45['driver_id'].drop_duplicates())]

delivery_filtered.groupby('driver_id')['order_total'].mean().reset_index(name='avg_order_value')

 

๐Ÿฅ  ์ •์˜ํ•œ duration time์ด 45๋ถ„ ์ดํ•˜์ธ order๊ฐ€ ์ ์–ด๋„ ํ•œ ๊ฐœ ์žˆ๋Š” driver์˜ average order value ๊ตฌํ•˜๊ธฐ.

 

โ˜… ์ฃผ์˜์ ) duration time์ด 45๋ถ„ ์ดํ•˜์ธ order๋งŒ ๋ชจ์•„์„œ average order value๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, duration time์ด 45๋ถ„ ์ดˆ๊ณผ์ธ order๊ฐ€ ์žˆ๋Š” driver๋ผ๋„, driver์˜ order ์ค‘ duration time์ด 45๋ถ„ ์ดํ•˜์ธ order๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด, ํ•ด๋‹น driver์˜ 45๋ถ„ ์ดˆ๊ณผ์ธ order๊นŒ์ง€ ๋ชจ๋‘ ํฌํ•จํ•ด์„œ average order value ๊ตฌํ•˜๊ธฐ.

 

(1) duration time column ๋งŒ๋“ค๊ธฐ(datetime - datetime ํ•˜๋ฉด timedelta data type์ด ๋‚˜์˜ค๊ณ , total_seconds()/60์„ ํ™œ์šฉํ•ด duration minute ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

(2) ๋ณ„๋„์˜ under_45 dataframe๋งŒ๋“ค์–ด์„œ ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“  column์ด 45๋ถ„ ์ดํ•˜์ผ ๋•Œ์˜ ์ •๋ณด๋งŒ ๋ชจ์€ ๊ณณ์—์„œ์˜  driver_id ๊ตฌํ•˜๊ธฐ.

(3) ๊ตฌํ•œ driver_id์˜ drop_duplicates() ํ•จ์ˆ˜ ์จ์„œ ์ค‘๋ณต ์—†๋Š” driver id์— ๊ธฐ์กด driver_id๊ฐ€ ํฌํ•จ๋˜๋Š” ์ง€ ์•ˆ ๋˜๋Š” ์ง€ isin()์œผ๋กœ ํŒ๋‹จ

(4) driver_id๊ฐ€ ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ๋งŒ delivery_filtered๋กœ ์ตœ์ข… dataframe์œผ๋กœ ํŒ๋‹จ.

(5) ๋งˆ์ง€๋ง‰์œผ๋กœ groupby ์ง„ํ–‰ํ•ด์„œ mean()ํ•จ์ˆ˜๋กœ avg_order_value ๊ตฌํ•˜๊ธฐ.

 

โ˜… ์‹œ๊ฐ„ ๊ด€๋ จ data type) datetime๊ณผ timedelta 2๊ฐœ ์ •ํ™•ํžˆ ๊ตฌ๋ถ„ํ•˜๊ธฐ


02113 Extremely Late Delivery

# Import your libraries
import pandas as pd
from datetime import timedelta

# Start writing code
delivery_orders.head()
delivery_orders = delivery_orders[delivery_orders['actual_delivery_time'].notnull()]
delivery_orders['is_extreme'] = (delivery_orders['actual_delivery_time'] > (delivery_orders['predicted_delivery_time'] + timedelta(minutes=20))).astype(int)

#delivery_orders['month'] = (delivery_orders['order_placed_time'].dt.month)
delivery_orders['month'] = delivery_orders['order_placed_time'].dt.to_period('M')

res = delivery_orders.groupby('month')['is_extreme'].agg(lambda x: x.mean()*100).reset_index(name='perc_extremely_delayed')
res

 

๐Ÿฅ  ๋…„-์›”๋ณ„ extremely late orders์˜ ๋น„์œจ์„ %๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ

 

(1) ๋จผ์ €, extreme ์—ฌ๋ถ€ column ์ƒ์„ฑ) datetime + timedelta(duration time)๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, timedelta(minutes=20)๋ฅผ ์“ฐ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ง์ ‘ from datetime import timedelta ์‚ฌ์šฉ

(2) ์ด ๋•Œ, ๋…„-์›” ๊ฐ์ฒด๋กœ ๋‚˜๋ˆ„๊ธฐ ์œ„ํ•ด dt.to_period('M') ์‚ฌ์šฉ. to_period()๋Š” ๊ธฐ๊ฐ„๋‹จ์œ„์˜ ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜

โ˜… dt.month๋Š” year ํฌํ•จํ•˜์ง€ ์•Š์€ ๋ง ๊ทธ๋Œ€๋กœ month ์ž์ฒด์ด๋ฏ€๋กœ yearํฌํ•จํ•ด์„œ ์ถœ๋ ฅ x

(3) ๊ธฐ๊ฐ„๋‹จ์œ„์ธ '์›”' ๋‹จ์œ„์˜ ๊ฐ์ฒด๋กœ ๋ฐ”๊พผ ['month']๋ฅผ groupby๋กœ ์ง„ํ–‰ํ•˜๊ณ  is_extreme์„ ๊ฐ€์ง€๊ณ  agg() ํ•จ์ˆ˜ ์ง„ํ–‰ํ•ด x.mean()*100์œผ๋กœ ์‹ค์ œ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. (agg๋ง๊ณ  apply๋„ ์ ์šฉ ๊ฐ€๋Šฅ)


09601 Find the Best Day for Trading AAPL Stock

# Import your libraries
import pandas as pd

# Start writing code
aapl_historical_stock_price.head()

aapl_historical_stock_price['day_of_month'] = aapl_historical_stock_price['date'].dt.day

res1 = aapl_historical_stock_price.groupby('day_of_month')['open'].mean().reset_index()
res2 = aapl_historical_stock_price.groupby('day_of_month')['close'].mean().reset_index()

res = res1.merge(res2, how='inner', left_on = 'day_of_month', right_on = 'day_of_month')
res['diff'] = res['close']-res['open']

res[res['diff'] == res['diff'].max()][['day_of_month', 'open', 'close']]

 

๐Ÿฅ  ๊ฐ ์›” ๋ณ„ ์ผ์ค‘, open๊ณผ close์˜ ์ฐจ์ด๊ฐ€ ๊ฐ€์žฅ ํฐ ์ผ์ž ์ถœ๋ ฅํ•˜๊ธฐ

* ์—ฌ๊ธฐ์„œ ์ฃผ์˜์ ์€, '์ผ'์€ ์›” ๋‚ด์˜ ๋ชจ๋“  '์ผ'์„ ๋œปํ•œ๋‹ค. ๋”ฐ๋ผ์„œ, 3์›” 1์ผ์ด๋‚˜ 4์›” 1์ผ์ด๋‚˜ ๊ฐ™์€ 1์ผ๋กœ ๊ฐ„์ฃผ. ๋”ฐ๋ผ์„œ .dt.day ์‚ฌ์šฉ ๊ฐ€๋Šฅ


09762 Find the Day of the Week that most People check-in

# Import your libraries
import pandas as pd

# Start writing code
airbnb_contacts.head()

airbnb_contacts['dayofweek'] = airbnb_contacts['ds_checkin'].dt.dayofweek

res = airbnb_contacts.groupby('dayofweek').size().reset_index(name='size')

res[res['size'] == res['size'].max()]

 

๐Ÿฅ  day-of-week๋ณ„ ๊ฐ€์žฅ ๋งŽ์€ ๋ฐฉ๋ฌธ ํšŸ์ˆ˜ ๊ตฌํ•˜๊ธฐ

(1) day-of-week์€ dt.dayofweek ํ™œ์šฉ

(2) groupby().size()๋กœ ๊ฐ ๊ทธ๋ฃน๋ณ„ ๊ฐœ์ˆ˜ size ๊ตฌํ•˜๊ธฐ


02154 Top 2 Sales Time Combinations

def categorize_time(hour):
    if hour < 12:
        return 'Morning'
    elif 12 <= hour <= 15:
        return 'Early Afternoon'
    else:
        return 'Late Afternoon'

# Import your libraries
import pandas as pd

# Start writing code
sales_log.head()

sales_log['day_of_week'] = sales_log['timestamp'].dt.day_name()

sales_log['time_of_day'] = sales_log['timestamp'].dt.hour.apply(categorize_time)

res = sales_log.groupby(['day_of_week','time_of_day']).size().reset_index(name='total_orders')

top_orders = res['total_orders'].nlargest(2)

res[res['total_orders'].isin(top_orders)].sort_values(by='total_orders', ascending = False)

 

๐Ÿฅ day_of_week ์ค‘ time_of_day ์ค‘ order์˜ ๊ฑด์ˆ˜ top 2(๊ฑด์ˆ˜ ์ค‘๋ณต์ด๋ฉด ๋ชจ๋‘ ํฌํ•จ) ์ •๋ณด ์ถœ๋ ฅ

(1) ๋จผ์ € day_of_week๋ฅผ datetime column์—์„œ ๋ฝ‘์„ ๊ฑด๋ฐ, 0~6 ์ˆซ์ž๊ฐ€ ์•„๋‹Œ Saturday, Friday์™€ ๊ฐ™์ด ์š”์ผ ์ด๋ฆ„์œผ๋กœ ๋ฝ‘์„๋ ค๋ฉด dt.day_name() ํ™œ์šฉ

(2) time_of_day๋Š” apply() ํ•จ์ˆ˜๋ฅผ ๋”ฐ๋กœ ์ ์šฉํ•˜์—ฌ, ๋ณ„๋„์˜ ํ•จ์ˆ˜ categorize_time์—์„œ ์—ฌ๋Ÿฌ ๋ถ„๊ธฐ๋ฌธ์œผ๋กœ ์ฒ˜๋ฆฌ

(3) groupby ๋‘ ์นผ๋Ÿผ grouping ์ง„ํ–‰ / ๊ฐ group๋ณ„ size ๊ฐœ์ˆ˜์ด๋ฏ€๋กœ size() ์ ์šฉ

(4) ์ผ๋‹จ ์ค‘๋ณต ๋ฐ์ดํ„ฐ ํฌํ•จ top ์ˆœ์œ„ ๋ฐ์ดํ„ฐ ๊ณ ๋ฅด๋ ค๋ฉด nlargest(x)๋กœ ๋‚ด์šฉ์„ ๋ฝ‘๊ณ , isin() ํ™œ์šฉํ•ด ํ•ด๋‹น ๋ชจ๋“  ๋‚ด์šฉ ๊ฐ€์ ธ์˜ค๊ธฐ


 

 

 

 

 


* ์œ„ ๋ฌธ์ œ ํ•ต์‹ฌ ์ฝ”๋“œ ํ‚ค์›Œ๋“œ ์ •๋ฆฌ

(1) boolean indexing์„ ํ™œ์šฉํ•œ ~isna()๋กœ null data ํฌํ•จ ํ–‰ deletion

(2) apply() ํ™œ์šฉํ•ด lambda x : 1 / (voting_results['voter'] == x).sum()) ํ•จ์ˆ˜๋กœ ๊ฐ data์˜ share ๊ตฌํ•˜๊ธฐ

(3) groupby + aggregation

(4) reset_index(): index 0~ ์ง€์ • + index๋œ ๋ถ€๋ถ„์€ ์ƒˆ๋กœ์šด column์œผ๋กœ ์ง€์ •

(5) dataframeํ™” ํ•˜๊ธฐ ์œ„ํ•ด [[]]

(6) merge - inner merge / left_on / right_on

(7) ๋‘ ๋ฌธ์ž์—ด ์—ด ํ•ฉ์น  ๋•Œ ๊ฐ ์—ด astype(str) ํ™œ์šฉํ•ด์„œ ๊ฐ๊ฐ ๋ถ™์—ฌ ์ƒˆ๋กœ์šด ์—ด ์ƒ์„ฑ

(8) nunique()์€ ๊ฐ ์—ด์˜ ๋ฐ์ดํ„ฐ ์ค‘๋ณต ์ œ๊ฑฐ ๊ณ ์œ ๊ฐ’ ๊ฐœ์ˆ˜

(9) df.shapeํ™œ์šฉํ•ด ์ „์ฒด ํ–‰์˜ ๊ฐœ์ˆ˜ ๋ฐ ์—ด์˜ ๊ฐœ์ˆ˜ ์•Œ ์ˆ˜ ์žˆ์Œ

(10) median(), quantile(0.2.5), quantile(0.75) 

(11) between() ์ ์šฉํ•ด ํŠน์ • ์—ด์˜ ๋ฒ”์œ„์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋งŒ ํ•„ํ„ฐ๋ง / sort_values(by = '์—ด ์ด๋ฆ„')์œผ๋กœ ํŠน์ • ์—ด ๊ธฐ์ค€ ์˜ค๋ฆ„/๋‚ด๋ฆผ์ฐจ์ˆœ / drop_duplicates() ํ™œ์šฉํ•ด ํŠน์ • ์—ด์˜ ๋ชจ์Œ๋งŒ ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ ๊ฐ€๋Šฅ

(12) .str.split()์œผ๋กœ ์—ด์˜ ๊ฐ ๋‚ด์šฉ์„ ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ๋นˆ์นธ ๋ณ„ split() / apply(lambda x)๋กœ if๋ฌธ์— ๋”ฐ๋ผ ๋‚ด์šฉ ์ถ”์ถœ / sort_values()๋กœ ๋‘ ๊ฐœ ์ด์ƒ์˜ ์—ด ์ •๋ ฌ ๊ฐ€๋Šฅ(by์™€ ascending ๊ฐ’์„ ์ •๋ ฌ ์šฐ์„ ์ˆœ์œ„์— ๋”ฐ๋ผ []์•ˆ์— ๋„ฃ์œผ๋ฉด ๋œ๋‹ค.

(13) groupby()๋‚ด์˜ ๊ทธ๋ฃนํ™”๋œ ์—ด์˜ ๊ฐœ์ˆ˜๋กœ groupingํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด groupby('์—ด ์ด๋ฆ„').size()๋กœ ๊ฐ€๋Šฅ, ์ด ๋•Œ ๊ฐœ์ˆ˜๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์ด ๋งŒ๋“ค์–ด์ง€๋ฏ€๋กœ reset_index()์•ˆ์— name ์ธ์ž๋กœ ์ƒˆ๋กœ์šด ์—ด ์ด๋ฆ„ ์ง€์ • ๊ฐ€๋Šฅ(๋˜๋Š” groupby ๊ฒฐ๊ณผ Series๋ฅผ to_frame()์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ฐ”๊พธ๋ฉด์„œ to_frame()์•ˆ์— ์ด๋ฆ„ ์„ค์ •๋„ ๊ฐ€๋Šฅ)

(14) ์–ด๋–ค ๋ฐ์ดํ„ฐ์˜ ์ˆœ์œ„ ์—ด ๋งŒ๋“ค ๋•Œ, .rank(method = 'first', ascendig = False)๋กœ ์ง€์ • ๊ฐ€๋Šฅ. method = 'first'๋Š” ๋™์ผ ์ˆœ์œ„ ์‹œ, ์œ„ ์—ด ๋จผ์ € ์•ž์˜ ๋“ฑ์ˆ˜ ๋ถ€์—ฌ

(15) datetime filtering ๋ถ€๋“ฑ์‹ ๋˜๋Š” between() ํ™œ์šฉ

(16) groupby์‹œ ๋‘ ๊ฐ€์ง€ ์ด์ƒ์˜ ์—ด์„ ๋„ฃ์–ด ๊ธฐ์ค€์œผ๋กœ grouping ๊ฐ€๋Šฅ.

(17) ๋‘ ์นผ๋Ÿผ ์‚ฌ์ด์˜ ์ ˆ๋Œ“๊ฐ’์˜ ํ‰๊ท ๊ฐ’ ์ž์ฒด๋ฅผ ํ•œ ๋ฐฉ์— ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ.

(18) 02014 ๋ฌธ์ œ๋ฅผ ๋ณด๋ฉฐ groupby ๋‘๋ฒˆ์˜ ํ•„์š”์„ฑ ๊ธฐ์–ต / .dt.hour & .dt.date ํ™œ์šฉ

 

 

 

 

 

๋Œ“๊ธ€