Machine Learning/Fundamentals

Ordinal Encoding

metamong 2022. 4. 20.

๐Ÿ‘€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ๋ชจ๋“  input๊ณผ output์ด numeric ์ˆ˜์น˜ํ˜•์ด์–ด์•ผ ํ•œ๋‹ค๋Š” ์ „์ œ์กฐ๊ฑด์ด ๊น”๋ ค ์žˆ๋‹ค! ์ฆ‰, ์šฐ๋ฆฌ๊ฐ€ ๋งˆ์ฃผํ•œ data๊ฐ€ categorical variable์ด๋ผ๋ฉด ๋ชจ๋ธ์— ์ง‘์–ด๋„ฃ๊ธฐ ์ „์— ๋ฏธ๋ฆฌ numeric ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” ์ž‘์—…์ด ํ•„์š”ํ•œ ๊ฒƒ์ด๋‹ค.

 

๐Ÿ•ต๏ธ‍โ™‚๏ธ ๊ทธ ์ค‘ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋กœ One-Hot Encoding ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ๋ฐฐ์› ๋‹ค

 

 

One-Hot encoding

โ‰ซ ML ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ part์—์„œ model์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” data๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ encoding ๊ธฐ๋ฒ•๋“ค์ด ์ ์šฉ๋œ๋‹ค๊ณ  ํ•˜์˜€๊ณ , ์˜ค๋Š˜์€ ๊ทธ ์ค‘ ํ•˜๋‚˜์ธ 'One-Hot encoding' ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด์„œ ๋ฐฐ์šฐ๋ ค๊ณ  ํ•œ๋‹ค intro. Machine L.

sh-avid-learner.tistory.com

 

- ์ด 6๊ฐ€์ง€ data type ์ค‘ ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ categorical data type -

 

→ ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” qualitative categorical variable ์ค‘ unordered, ์ฆ‰ ์ˆœ์„œ๊ฐ€ ์ •ํ•ด์ง€์ง€ ์•Š์€ data์˜ ๊ฒฝ์šฐ ์›ํ•ซ์ธ์ฝ”๋”ฉ ๊ธฐ๋ฒ•์ด ์ œ์ผ ์ ์ ˆํ•˜๋‹ค๊ณ  ๋ฐฐ์› ๋‹ค.

 

!-- ๊ทธ๋ ‡๋‹ค๋ฉด ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” qualitative categorical ordered variable์€? --!

 

→ ์•ž์„œ one-hot encoding์˜ ํ•œ๊ณ„์ ์œผ๋กœ ์–ธ๊ธ‰ํ•œ, '์ˆœ์„œ๊ฐ€ ์žˆ๋Š” data'์˜ ๊ฒฝ์šฐ ordinal encoding์ด๋ผ๋Š” ๊ธฐ๋ฒ•์„ ๋”ฐ๋กœ ์“ฐ๋ฉด ๋œ๋‹ค!

 

> ordinal-encoding ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณดza <

w/scikit-learn

๐Ÿ‘จ‍๐Ÿซ ordinal encoding์—์„œ๋Š” data์— ๋‚ด์žฌ๋œ ์ˆœ์„œ๋Œ€๋กœ ์ ์  ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋Š” ์ •์ˆ˜๋“ค์ด ํ• ๋‹น๋œ๋‹ค. category_enocders library๋ฅผ ์ด์šฉํ•ด์„œ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ์ดํ•ดํ•ด๋ณด์ž

category_encoders library

** OrdinalEncoder docu **

https://contrib.scikit-learn.org/category_encoders/

https://contrib.scikit-learn.org/category_encoders/ordinal.html

 

classcategory_encoders.ordinal.OrdinalEncoder(verbose=0, mapping=None, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')

 

'Encodes categorical features as ordinal, in one ordered feature. Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in; in this case, we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.'

 

 

๐Ÿงค one-hot encoder์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ordinal encoder๋„ ์ฒซ data์—๋Š” fit_transform ์ˆ˜ํ–‰ ํ›„ ์ดํ›„ data๋Š” ์—ฐ์ด์–ด์„œ transform method ์ ์šฉ

 

1> import

 

from category_encoders import OrdinalEncoder

 

2> OrdinalEncoder ๊ฐ์ฒด ์ƒ์„ฑ

 

enc = OrdinalEncoder()

 

→ ์ž ๊น. ์—ฌ๊ธฐ์„œ! OrdinalEncoder() ๋‹ค์–‘ํ•œ ์ธ์ž๋“ค ์•Œ์•„๋ณด๊ธฐ ←

 

โ‘  cols = default๋กœ๋Š” encoder์— ์ง‘์–ด๋„ฃ์€ ๋ชจ๋“  data๋ฅผ ์ธ์ฝ”๋”ฉํ•ด์ฃผ๋Š”๋ฐ, ํŠน์ • column๋“ค๋งŒ encodingํ•˜๊ณ  ์‹ถ์œผ๋ฉด ์›ํ•˜๋Š” column name์„ list๋กœ ๋ฌถ์–ด์„œ cols์— ํ• ๋‹นํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

'a list of columns to encode, if None, all string columns will be encoded.'

 

โ‘ก return_df = encoding๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ ๋ฐ˜ํ™˜ํ˜•์„ ๊ฒฐ์ •ํ•ด์ค€๋‹ค. default๋Š” true๋กœ encoding๋œ ๊ฒฐ๊ณผ๋ฌผ์ด dataframe์˜ ํ˜•ํƒœ๋กœ ๋‚˜์˜จ๋‹ค. false๋กœ ์„ค์ •ํ•˜๋ฉด numpy array๋กœ ๋‚˜์˜จ๋‹ค.

'boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).'

 

โ‘ข mapping = ๋‚ด์žฌ๋œ ์ˆœ์„œ๊ฐ€ ์•„๋‹ˆ๋ผ ๋ช…์‹œ์ ์œผ๋กœ ์—ฌ๋Ÿฌ data ์ˆœ์„œ๋ฅผ ์ž„์˜์ ์œผ๋กœ ์ •ํ•œ๋‹ค. ์›ํ•˜๋Š” column์˜ ๋ชจ๋“  data์— ์›ํ•˜๋Š” ์ˆœ์„œ๋ฅผ ์ผ์ผ์ด ๋ถ€์—ฌ ๊ฐ€๋Šฅ! 

'a mapping of class to label to use for the encoding, optional. the dict contains the keys ‘col’ and ‘mapping’. the value of ‘col’ should be the feature name. the value of ‘mapping’ should be a dictionary of ‘original_label’ to ‘encoded_label’. example mapping:

[{‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}, {‘col’: ‘col2’, ‘mapping’: {None: 0, ‘x’: 1, ‘y’: 2}}]'

 

โ‘ฃ handle_unknown = ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ง€ ์„ค์ •ํ•˜๋Š” ์ธ์ž์ด๋‹ค. default๋Š” 'value'๋กœ ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ์œผ๋ฉด -1์„ ๋ฆฌํ„ดํ•จ

'options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which will impute the category -1.'

 

3> fitting

 

๐Ÿ‘‚ ์—ฌ๊ธฐ์„œ fit_transform์ด ์•„๋‹Œ fit ํ•จ์ˆ˜๋งŒ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ encodingํ•  ์˜ˆ์‹œ๋ฅผ ๊ฐ€์ ธ์™€ 'ํ•ด๋‹น column์— ์ด๋Ÿฐ data๊ฐ€ ์˜ค๋ฉด ์ด๋ ‡๊ฒŒ encoding์„ ํ•˜๊ฒ ๋‹ค'๋ผ๊ณ  ๋งŒ๋“ค์–ด์ง„ encoder ๊ฐ์ฒด์—๊ฒŒ ์•Œ๋ ค์ฃผ๋Š” ์šฉ๋„๋กœ๋งŒ ์“ฐ์ž„.

(์‹ค์ œ๋กœ ํ•ด๋‹น data๋ฅผ encodingํ•œ ๊ฒฐ๊ณผ๊นŒ์ง€ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์œผ๋ฉด ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ fit_transform์„ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.)

 

X = [['Male', 1, 'Yes'], ['Female', 3, 'No'], ['Female', 2, 'None']]
enc.fit(X)

 

4> transforming

 

๐Ÿ‘ ์‹ค์ œ encodingํ•˜๊ณ  ์‹ถ์€ ์ด์ค‘ list, dict, ๋˜๋Š” dataframe, series ๋“ฑ data๋ฅผ ๋„ฃ์œผ๋ฉด encoded๋œ ๊ฒฐ๊ณผ๊ฐ€ ์ถœ๋ ฅ๋œ๋‹ค!

 

enc.transform([['Male',1,'No'],['Female', 10]]) #๊ฒฐ์ธก์น˜๋Š” -1๋กœ encoded๋จ

 

- encoded๋จ! -

 

5> (์ถ”๊ฐ€) category_mapping attribute

> ์ด ์†์„ฑ์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•ด๋‹น encoder๊ฐ€ ์–ด๋–ค column์˜ ์–ด๋–ค data๋ฅผ ๋ฌด์—‡์œผ๋กœ encodeํ–ˆ๋Š”์ง€์˜ ์ƒ์„ธ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์žˆ์Œ!

 

enc.category_mapping

 

- ์ƒ์„ธ ์ •๋ณด -

 

- ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ฒŒ. ๊ฐœ๋… ์„ค๋ช… ๋! ๐Ÿ– 

Q&A

Q. ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–จ ๋•Œ One-hot์„ ์“ฐ๊ณ  ์–ด๋–จ ๋•Œ Ordinal encoding ๊ธฐ๋ฒ•์„ ์ ์šฉํ•ด์•ผ ํ• ๊นŒ?

 

A. ์šฐ๋ฆฌ๋Š” ์ฃผ์–ด์ง„ data๊ฐ„์— ๋‚ด์žฌ๋œ ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ์ง€๋ฅผ ๋”ฐ์ ธ์•ผ ํ•œ๋‹ค. ์ฆ‰, data ๊ฐ„์— ์–ด๋–ค data ์Œ์€ ์ข€ ๋” ๋ฐ€์ ‘ํ•œ ์—ฐ๊ด€์ด ์žˆ๊ณ , ์–ด๋–ค ์Œ์€ ์—ฐ๊ด€ ์ •๋„๊ฐ€ ์ข€ ๋œํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ordinal encoding์„ ์“ฐ๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•˜๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์˜ํ™”ํ‰์ , ์นจ๋Œ€ ์‚ฌ์ด์ฆˆ, ์ž‘ํ’ˆ ํ€„๋ฆฌํ‹ฐ ์ ์ˆ˜์™€ ๊ฐ™์€ data๋Š” ๋ถ„๋ช…ํžˆ ๋†’๊ณ  ๋‚ฎ์Œ - ์ฆ‰ ๋‚ด์žฌ๋œ ์ˆœ์„œ๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ordinal encoding. ๋ฐ˜๋Œ€๋กœ ๋‹จ์ˆœํ•œ ๋„์‹œ์˜ ๋‚˜์—ด (๋„์‹œ๊ฐ„์˜ ์ง€๋ฆฌ์  ๊ทผ์ ‘์„ฑ ๊ณ ๋ ค ์•ˆํ•œ๋‹ค๋ฉด)์ด๋‚˜ ์ƒ‰๊น” ์ข…๋ฅ˜์˜ ๋‚˜์—ด๊ณผ ๊ฐ™์€ data๋Š” one-hot encoding์ด ๋” ์ ํ•ฉํ•˜๋‹ค.

(์ • ์ด๋ ‡๊ฒŒ ํ•ด์„œ๋„ ๊ตฌ๋ถ„์ด ์–ด๋ ต๋‹ค๋ฉด ํ•ด๋‹น data๊ฐ€ ๋ฐฉํ–ฅ์„ฑ์„ ๊ฐ–๋Š” ์ง€ ๋‹จ์ˆœํ•˜๊ฒŒ ์ƒ๊ฐํ•œ๋‹ค๋ฉด ๊ตฌ๋ณ„์ด ๋” ์‰ฌ์šธ ๊ฒƒ์ด๋‹ค!)

 

Q. ์‹ค์ œ ์ˆœ์„œ๊ฐ€ ๋‚ด์žฌ๋œ data์ธ๋ฐ ์˜คํžˆ๋ ค one-hot encodingํ•œ ๊ฒฐ๊ณผ๋กœ ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์กŒ๊ฑฐ๋‚˜, ๊ทธ ๋ฐ˜๋Œ€์ธ ์—ฌ๋Ÿฌ case๋“ค์ด ์žˆ์„๊นŒ?

 

A. ์ดํ›„ ํ•ด๋‹น ์‹คํ—˜์— ๋Œ€ํ•œ posting ์˜ฌ๋ฆด ์˜ˆ์ •


* ์ธ๋„ค์ผ ์ถœ์ฒ˜) https://www.shareicon.net/silhouettes-stand-up-standing-firm-order-row-people-700053

* ์ถœ์ฒ˜1) https://stackoverflow.com/questions/69052776/ordinal-encoding-or-one-hot-encoding

* ์ถœ์ฒ˜2) https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/ 

'Machine Learning > Fundamentals' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Feature Selection vs. Feature Extraction  (0) 2022.05.18
feature selection (1) - selectKBest (+jointplot)  (0) 2022.04.20
train vs. validation vs. test set  (0) 2022.04.18
Gradient Descent (concepts) (+momentum)  (0) 2022.04.18
One-Hot encoding  (0) 2022.04.17

๋Œ“๊ธ€