Machine Learning/Fundamentals

One-Hot encoding

metamong 2022. 4. 17.

≫ ML 데이터 전처리 part에서 model이 이해할 수 있는 data로 변환하기 위해 여러 encoding 기법들이 적용된다고 하였고, 오늘은 그 중 하나인 'One-Hot encoding' 기법에 대해서 배우려고 한다

intro. Machine Learning

1. 개론 → ML은 빅데이터를 분석할 수 있는 강력한 tool의 일종이다. 기존 통계학 및 시각화로는 해결할 수 없는 한계를 보완함! 👏 데이터를 기반으로 앞으로의 미래를 예측하는 기법 👋 주어진

sh-avid-learner.tistory.com

concepts

'One hot encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model. One Hot Encoding is a common way of preprocessing categorical features for machine learning models.'

→ 즉 one-hot encoding은 categorical variable을 머신러닝 모델 분석에 쓰기 위해 encoding해주는 과정이다!

Intro + Aesthetics, Data Type & Scales (source from <Fundamentals of DV by Claus O.Wilke>)

* Intro "Data visualization is part art and part science. The challenge is to get the art right without getting the science wrong and vice versa. A data visualization first and foremost has to accu..

sh-avid-learner.tistory.com

👆 위 posting에서 알 수 있듯이 categorical 변수들은 크게 두 종류로 나눌 수 있다 👉 nominal & ~~ordinal~~

→ nominal은 순서가 정해지지 않은 카테고리형 변수들 & ordnial은 순서가 정해진 카테고리형 변수들이다 (ordinal 쓰면 ❌ )

🤦 qualitative categorical variable의 비애... → ML 모델링 한계 극복법? 🤦

→ nominal data (qualitative 전제)를 ML 모델에 넣을 때 qualitative, 즉 양적 변수가 아니므로 일부 ML 모델이 input으로 받아들일 수가 없다! - 따라서 우리는 one-hot encoding을 사용해야 한다 (data preprocessing 과정 일부)

- one-hot encoding 과정 -

: 위 그림과 같이 한 column 내에 모든 종류의 category들을 각각 한 column으로 나누고 해당 column 내용이 맞으면 1, 아니면 0으로 numerical하게 표현하면 ML 모델이 input으로 받아들일 수 있다!

→ 그러나 원핫인코딩을 수행하면 각 카테고리에 해당하는 변수들이 모두 차원에 더해진다. 따라서 ~~카테고리가 너무 많은 경우~~(high cardinality)에는 사용하기 적합하지 않음 (column이 너무 많아지면서 data 양 증가로 인해 부담이 갈 수 있음 ㅇㅇ)

(여기서 cardinality는 category type의 category 개수를 뜻함 - 예를 들어 위 color는 cardinality가 red, yellow, green으로 3임)

Q. one-hot 써야 할 때와 추천하지 않을 때 when?

1> 추천

- The categorical features present in the data is not ordinal - nominal

(순서가 내포되어 있지 않은, 단순한 항목의 나열일 때 사용)

- When the number of categorical features present in the dataset is less so that the one-hot encoding technique can be effectively applied while building the model. (카테고리 data column 수와 카테고리 종류 수가 적어서 해당 종류가 모두 one-hot encoded 되어도 효과적으로 modelling 가능할 때)

2> 비추천

- When the categorical features present in the dataset are ordinal i.e for the data being like Junior, Senior, Executive, Owner.

(순서가 들어가 있는 ordinal은 사용 x - 의미가 들어가 있기에 의미가 없다고 생각해 1과 0으로 나눠버리면 정확성 사라짐)

- When the number of categories in the dataset is quite large. One Hot Encoding should be avoided in this case as it can lead to high memory consumption. (너무 수가 많으면 modelling 성능 떨어짐 - 메모리 누수 현상도 걱정해야 함)

↓ 아래 글에서 발췌한 것으로 매우 자세히 one-hot encoding에 대해서 설명해 줌

'As a machine can only understand numbers and cannot understand the text in the first place, this essentially becomes the case with Deep Learning & Machine Learning algorithms. One hot encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model. One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. One hot encoding is a highly essential part of the feature engineering process in training for learning techniques. For example, we had our variables like colors and the labels were “red,” “green,” and “blue,” we could encode each of these labels as a three-element binary vector as Red: [1, 0, 0], Green: [0, 1, 0], Blue: [0, 0, 1]. The Categorical data while processing, must be converted to a numerical form. One-hot encoding is generally applied to the integer representation of the data. Here the integer encoded variable is removed and a new binary variable is added for each unique integer value. During the process, it takes a column that has categorical data, which has been label encoded and then splits the following column into multiple columns. The numbers are replaced by 1s and 0s randomly, depending on which column has what value. While the method is helpful ~~for some ordinal situations~~, some input data does not have any ranking for category values, and this can lead to issues with predictions and poor performance.

(w/ code)

크게 두 가지로 나눌 수 있다 - get_dummies method 사용하는 법 & category_endoers 라이브러리 사용하는 법

pandas 👉 get_dummies()

♪ pandas.get_dummies() docu ♪

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

'Convert categorical variable into dummy/indicator variables'

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None

- 다음과 같은 datafrme이 있다고 할 때

df = pd.DataFrame({
    'City': ['Seoul', 'Seoul', 'Seoul', 'Busan', 'Busan', 'Busan', 'Incheon', 'Incheon', 'Seoul', 'Busan', 'Incheon'],
    # - city = categorical unordered
    'Room': [3, 4, 3, 2, 3, 3, 3, 3, 3, 3, 2],
    # - numerical discrete
    'Price': [55000, 61000, 44000, 35000, 53000, 45000, 32000, 51000, 50000, 40000, 30000]
    # - numerical continuous
})

- prefix 인자에 one-hot encoding하고 싶은 column name을 집어넣으면 끝!

df_oh = pd.get_dummies(df, prefix=['City'])

- df와 one-hot encoding 후의 df_oh

- 이제 City 열도 ML 모델링에 참여 가능 👏 -

category_encoders library

👍 category_encoders 라이브러리를 가져와서 해당 library의 OneHotEncoder method를 사용한다

** OneHotEncoder docu **

https://contrib.scikit-learn.org/category_encoders/onehot.html

classcategory_encoders.one_hot.OneHotEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', use_cat_names=False)

- 위 똑같은 dataframe의 City column을 encoding 해보자

1> 설치하고 import

!pip install category_encoders

## import OneHotEncoder
from category_encoders import OneHotEncoder

2> OneHotEncoder 객체 생성

encoder = OneHotEncoder(use_cat_names = True)

→ use_cat_names = True 설정 (default가 False라 편의를 위해 True 추천)

'if True, category values will be included in the encoded column names. Since this can result in duplicate column names, duplicates are suffixed with ‘#’ symbol until a unique name is generated. If False, category indices will be used instead of the category values.'

- categorical variable 값 자체를 encoded된 각 column들의 이름으로 naming해서 쉽게 column들을 알아볼 수 있게 한다!

3> fit_transform & transform 사용

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

→ fit_transform) 'Fit to data, then transform it' - fitting을 통해서 만들어진 encoder가 어떤 식으로 encoding하는 지 학습시키고 transform으로 직접 해당 data를 one-hot encoding한다

→ transform) 'Perform the transformation to new categorical data' - fitting 완료했으면 transform을 통해 학습된 encoder가 categorical data를 one-hot encoding한다

(※ 여기서 중요한 건 한 번 fitting하면 그 이후는 transform만 사용해야 한다는 점 - encoder가 학습되었으므로 또 학습시킬 필요 없다)

(위 df을 train & test data로 나누어 각각 encoding 했다 - fit_transform & transform 차이를 보이기 위해 data를 나누어 coding함)

4> 결과는 위 get_dummies와 결과 똑같음!

(cf - 결과에서 encoded된 한 column을 지우기를 추천 (자유도 개념) - 값이 0, 0 인 경우 나머지 column이 자동적으로 1이 되므로 ㅇㅇ)

정리 끄-읕 ✌️

** 썸네일 출처) https://www.crunchbase.com/organization/one-zero

** 출처1) https://analyticsindiamag.com/when-to-use-one-hot-encoding-in-deep-learning/

** 출처2) https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe

저작자표시 비영리 변경금지 (새창열림)

'Machine Learning > Fundamentals' 카테고리의 다른 글

train vs. validation vs. test set (0)	2022.04.18
Gradient Descent (concepts) (+momentum) (0)	2022.04.18
Cross-Validation (concepts + w/code) (0)	2022.04.17
Overfitting/Underfitting & Bias/Variance Tradeoff (0)	2022.04.17
All About Evaluation Metrics(1/2) → MSE, MAE, RMSE, R^2 (0)	2022.04.16