Data Science Fundamentals/Pandas&Numpy and Questions

concat & append & merge & join

metamong 2022. 4. 9.

👋 data는 무수히 여러 종류로 나누어져 있다(for 보안 & 효율성). 합치는 과정을 data preprocessing 과정에서 반드시 겪게 되는 데,,

어떤 case에 어떤 최적의 함수를 사용해야 할 지 이번 포스팅을 통해

총 네 가지! concat, append, merge, join에 대해서 알아보자

👉 docu list

- concat) https://pandas.pydata.org/docs/reference/api/pandas.concat.html

- append - dataframe) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

- append - series) https://pandas.pydata.org/docs/reference/api/pandas.Series.append.html#pandas.Series.append

- merge) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

- join) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

1. concat

→ 세 method와 다른 점은 pandas object형이면 언제나 concat을 사용할 수 있다는 점이다.

→ 말 그대로 'concatenate' - 즉 갖다 붙인다는 뜻

→ Series와 dataframe 모두에 적용가능

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

"Concatenate pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number."

axis (default 0(index))

"The axis to concatenate along."

→ default는 한 Series(dataframe)을 index 기준 정렬, 즉 한 Series(dataframe)의 아래부분에 갖다 붙인다는 의미 (물리적 붙이기 개념)

→ 1(columns)라면 한 Series(dataframe)의 오른쪽에 갖다 붙임 (즉, 여러 columns가 생성되므로 dataframe 반환)

s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2], axis=0)

'''
0    a
1    b
0    c
1    d
dtype: object
'''

ignore_index (default False)

"If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join."

→ True로 설정하면 index가 차례대로 0부터 n-1까지 설정된다. 즉, 위의 코드에서 볼 수 있듯이 default 인자로는 index가 0부터 3까지 차례로 정렬되지 않음을 확인할 수 있다.

→ 따라서 index가 지저분하고 깔끔하게 만들고 싶다면 ignore_index인자를 꼭! True로 바꾸자 (만약 기존 index 정보가 무의미하다면)

keys (default False)

"If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level."

→ 두 개 이상의 복합 hierarchical index를 만들기 위해서 사용하는 인자

→ 만들고 싶은 index의 인자를 list 형태로 만들어 인자에 집어넣는다.

→ 아래와 같이 s1과 s2가 index 형태로 들어가 multiindex로 설정했다.

s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2], keys=['s1', 's2'])

'''
s1  0    a
    1    b
s2  0    c
    1    d
dtype: object
'''

pd.concat([s1, s2], keys=['s1', 's2']).index

'''
MultiIndex([('s1', 0),
            ('s1', 1),
            ('s2', 0),
            ('s2', 1)],
           )
'''

names (default none)

"Names for the levels in the resulting hierarchical index."

→ keys인자를 통해 multiindex를 만들었을 경우 2개 이상의 index에 이름을 부여할 때 names 인자에 index이름들을 list 형태로 집어넣음

→ (아래 ex) 두 개의 index에 각각 'Series name'와 'Row ID' naming

pd.concat([s1, s2], keys=['s1', 's2'], names=['Series name', 'Row ID'])

'''
Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object
'''

verify_integrity(default False)

"Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation."

→ 동일한 index끼리 concatenate하는 지 duplicated index values를 check해주는 인자이다

→ 예를 들어 아래와 같이 동일한 index 'a'가 있으면 무작정 concatenate하지 말고 오류를 나타내란 뜻!

(ValueError 뜸 - Indexes have overlapping values)

df5 = pd.DataFrame([1], index=['a'])
df6 = pd.DataFrame([2], index=['a'])

pd.concat([df5, df6], verify_integrity=True)

'''
ValueError: Indexes have overlapping values: Index(['a'], dtype='object')
'''

join(default outer)

"How to handle indexes on other axis (or axes)."

→ 기본 default는 outer다. 즉, concatenate할 때 Series이든 dataframe이든 붙이고 난 나머지 공간은 모두 NaN 처리

→ 특히 두 dataframe끼리 concat할 때 동일한 column이 아닌 data가 있으면 그냥 붙이고 남은 data는 모두 NaN

→ 예를 들어 아래와 같이 두 dataframe이 있는데 붙이려는 dataframe에 animal이라는 새 column이 있다. 이럴 경우 기존 dataframe에의 animal column 값은 모두 NaN 처리

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
                   columns=['letter', 'number', 'animal'])
pd.concat([df1, df3])

→ 여기서 만약 인자를 inner로 설정했다면 animal column은 삭제된다 (공통된 column letter와 number만 살아남음)

→ 즉 교집합 data만 만들어 내고 NaN은 절대 출력되지 않는게 inner join 설정

pd.concat([df1, df3], join = 'inner')

- 두 칼럼만 -

sort(default False)

"Sort non-concatenation axis if it is not already aligned when join is ‘outer’. This has no effect when join='inner', which already preserves the order of the non-concatenation axis."

→ True로 설정하면 column name이 Sorting되어 출력된다.

→ inner join일 경우 이미 join하면서 column의 순서가 속에 정렬되기에 적용되지 않음!

2. append (DataFrame.append & Series.append)

→ concat의 특별한 case - 즉, concat의 인자 join = outer & axis = 0일 경우 append라 함

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)

"Append rows of other to the end of caller, returning a new object. Columns in other that are not in the caller are added as new columns."

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])

df.append(df2)

!-- 근데 Warning이 발생한다 --!

😅 append는 곧 사라질테니 concat쓰라고 친절히 알려줌 😅

~~(Series도 마찬가지..!)~~

concat 쓰자

3. merge

→ how 인자에 따라 inner, outer, left, right 형태의 merge로 나뉜다

→ concat은 그대로 붙이는 함수이지만, merge는 어떤 기준에 의해 '공통된 부분 + alpha(인자에 따라 다름)'을 보여준다

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

"Merge DataFrame or named Series objects with a database-style join. A named Series object is treated as a DataFrame with a single named column. The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed."

how(default 'inner')

"left: use only keys from left frame, similar to a SQL left outer join; preserve key order"

"right: use only keys from right frame, similar to a SQL right outer join; preserve key order"

"outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically"

"inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys"

ex) 두 dataframe을 만들고 교집합으로 만들 key는 column 'a'로 설정!

(모든 merge 종류 확인하기 + concat과 비교!)

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})

df1.merge(df2, how='inner', on='a')
df1.merge(df2, how='left', on='a')
df1.merge(df2, how='right', on='a')
df1.merge(df2, how='outer', on='a')

#concat과 비교하기
pd.concat([df1,df2],axis=1,join='inner')

(하단) (왼쪽부터) inner - left -right - outer (상단 오른쪽은 concat(inner))

→ 상단 그림을 통해 차이를 확인할 수 있다. concat inner join의 경우 merge inner join과 달리 말 그대로 갖다 붙인 것이라 column 명이 중복된 채로 물리적으로 붙여진 것을 확인할 수 있다.

→ 하지만 merge의 경우 기본적으로 두 데이터프레임 간의 공통점은 단 한 번 등장하며 그 이후 안 겹치는 부분 등장 여부에 따라 merge 종류가 나뉜다 볼 수 있슴...!

on, left_on, right_on(default None)

"on)label or list - column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_on) label or list, or array-like - Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_on) label or list, or array-like - Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns."

→ 위의 예를 통해서 알 수 있듯이 merge되는 기준이다.

→ on을 쓴다는 것은 두 dataframe 간 비교할 column 이름이 동일하다는 뜻

→ 만약 column 이름이 다르다면 left_on과 right_on으로 각 df간 비교할 column 이름을 명시한다!

(left_on과 right_on 사용해서 merge 사용 시 left_on과 right_on 내용이 각각 return됨 - 그리고 value_x와 value_y로 각각 왼쪽과 오른쪽 key값의 value가 온다)

→ ex)

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

df1.merge(df2, left_on='lkey', right_on='rkey')

- 기준이 되는 column들이 모두 명시됨 -

suffixes (default "_x", "_y")

"A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None."

→ merge하려는데 공통된 부분이 아닌 다른 column들 중 다른 dataframe에서 왔는데도 column명이 똑같으면 헷갈릴 수 있으므로 column명 뒤에 접두어를 붙여 구분이 되게 나타내겠다는 뜻 (기본적으로 on 설정을 안하면 똑같은 column명이 merge기준이 되므로 suffixes가 필요없지만 on 설정을 하면 똑같은 column명을 가진 column이 merge기준이 되지 않을 수 있으므로 이때 suffixes를 통해 column명을 구분해준다)

→ 위 예의 경우 value라는 두 columns들이 있으므로 뒤에 left, right를 각각 붙여 구분하겠다.

(default로는 _x와 _y를 붙임. 그래서 위 그림에 보면 value_x와 value_y로 구분되어 있음 ㅇㅇ)

df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_1', '_2'))

- value_1과 value_2로 나눠짐 -

4. join

→ 기본적으로 merge와 원리가 똑같다. 한 기준에 의해 두 dataframe을 합치는 함수

→ 그렇다면 merge와의 차이는? merge는 내가 원하는 column을 기준으로 합칠 수 있지만, join은 index를 기준으로 합칠 수 있다!

(하지만 그렇다고 index만 기준으로 합칠 수 있는게 아니라, on 인자로 원하는 column을 기준으로 dataframe을 합칠 수 있다)

(또 merge 입장에서도 left_index와 right_index를 각각 True로 바꿔주면 index를 기준으로 합치기 가능)

→ merge의 하위호환 버전이라 join보다는 merge 추천...!

→ join을 붙이는 dataframe이 left, join안의 other parameter에 들어가는 dataframe이 right dataframe

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

"Join columns of another DataFrame. Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list."

ex) index가 자동으로 join 기준이 되는 join!

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'valuel': [1, 2, 3, 5]}, index=['1','4','5','6'])
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'valuer': [5, 6, 7, 8]}, index=['2','3','5','7'])
                    
df1.join(df2)

- 공통 index 5를 기준으로 left-join 형태 출력! -

* 출처1) https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/

* 출처2) https://stackoverflow.com/questions/15819050/pandas-dataframe-concat-vs-append

저작자표시 비영리 변경금지 (새창열림)

'Data Science Fundamentals > Pandas&Numpy and Questions' 카테고리의 다른 글

pandas Tricks_09&10👉🏻 'EXPANDING → a string & a series of lists - into a DF' (Kevin by DataSchool) (0)	2022.04.14
pandas Tricks_08👉🏻 'missing values - dropna() & isna() (advanced)' (Kevin by DataSchool) (0)	2022.04.09
pandas Tricks_07👉🏻 'Filtering - isin & tilde(~)&nlargest' (Kevin by DataSchool) (0)	2022.03.31
pandas Tricks_05 & 06👉🏻 'Create a DataFrame from the clipboard & Split a DataFrame into 2 random subsets' (Kevin by DataSchool) (0)	2022.03.30
pandas Tricks_04 👉🏻 'Build a DataFrame from multiple files (row-wise & column-wise) ' (Kevin by DataSchool) (0)	2022.03.25