Computer Science/Concepts

EDA - Exploratory Data Analysis

metamong 2022. 3. 22.

1. concepts & goals

→ 탐색적 데이터 분석

- 수집한 데이터를 본격적으로 분석하기 전에 자료를 직관적으로 바라보는 과정이 필요한데, 이때 EDA를 사용한다

- 데이터 분석의 한 종류로 복잡한 모델링이나 수식을 쓰지 않고 데이터를 말 그대로 탐색하는 것이다.

- it allows you to examine the data as they are without making any assumptions..!

- 탐색! 즉 결측치, 이상치, dtypes, shape, 새로운 data 생성(기존 data 이용), fillna() 모두 포괄하는 개념이라 할 수 있다

- 시각화 같은 도구를 통해서 패턴을 발견 & 데이터의 특이성 확인 & 통계와 그래픽(혹은 시각화)을 통해 가설을 검정

* [EDA methods & targets]

≫ *1) EDA 방법 - Graphic or Non-Graphic

- Graphic) 차트 혹은 그림 등을 이용하여 data 확인

- Non-Graphic) 주로 Summary Statistics를 통해 data확인 (ex. describe())

≫ *2) Univariate or Multi-variate

- Multi-variate) 여러 변수들 간의 관계를 보며

- univariate은 data의 분포를 본다

{1} Uni + Non-Graphic

→ 'distribution' 활용

- numeric data) summary statistics (center, spread, modality, shape, outliers) 확인

- categorical data) occurence, frequency, tabulation

{2} Uni + Graphic

- 여러 시각화 방법 활용(visualization 항목 여러 포스팅 확인!)

- histogram, pie chart, steam-leaf plot, boxplot, qqplot 등등 (값들이 너무 다양하면 binning이나 tabulation 활용 가능)

{3} Multi + Non-Graphic

→ variable 간의 관계에 초점!

- cross-tabulation<categorical data> & cross-statistics(correlation, covariance) <numerical data> 주로 확인한다

- categorical & numeric 모두 해당

{4} Multi + Graphic

- 위 {2}와 마찬가지로 역시 visualization 항목 여러 포스팅 확인! 마찬가지로 여러 시각화 방법 활용

- categorical & numeric data라면 boxplot, stacked bar, parallel coordinate, heatmap

- numeric & numeric이라면 scatterplot

(↑역시 visualization posting에서 상세히 확인!)

* [EDA Goals]

* 1. understanding data using statistical & visualization tools

- initial interaction with data

- understand the distribution of the vairables / the relationships among variables

- might create histograms or scatter plots

* 2. to access and validate assumptions (future inferences will be based)

- ex.) which variables are normally distributed? / whether or not a variable is based toward a particular value

* 3. want to understand data before performing an intelligent hypothesis

- can be the source of and idea for an experiment

- not ~~a formal process of hypothesis testing or predictive modelling~~

→ ultimately develop our intuition of our data set & how it came into existence

→ by examining the data, can generate better hypothesis & determine which variables have the most predictive power & select appropriate statistical tools (to build our predictive models)

********** before using descriptive statistics with visualization, 'data preprocessing' is highly recommended!*******

2. [세부적으로] Types of data → descriptive statistics

[1] identifying the data

→ the first & critical step of the analysis

{1} categorical data

→ nominal

- values represent discrete units

- changing the order of units does not change their value

ex) male - female = female - male

→ ordinal

- also discrete units but values are inherently ordered

- distance between units is not the same

ex) 1st place, 2nd place, and so on...

{2} continuous data

→ interval - 숫자 그 자체

- ordered units with intermediate values

- distance between units is the same

- ~~absolute zero~~

- origin is arbitrary

ex) a person with an IQ score of 160 is NOT twice as smart as someone with an IQ score of 80

→ ratio - 절대적 영점이 존재하는 양

- ordered units with intermediate values

- distance between units is the same

- absolute zero

- origin is at zero

ex) a 12 inch long sandwich is twice the length of a 6 inch sandwich

[2] Descriptive Statistics (using identified data) (+visualization)

* nominal, ordinal, interval, ratio data 종류에 따라 다른 접근법으로 EDA를 실행할 수 있다..!

<we can summarize data with descriptive statistics - for EDA>

→ Central Tendency) refers to the location of distribution

- mean

- median (P50, Q2)

- mode (the most frequent value)

→ Variability) (i.e. scale)

- range = max - min

- standard deviation (s)

- interquartile range IQR = Q3-Q1

{1} nominal data

* summarize the nominal data using....

→ frequencies

- count the number of events of interest

→ proportion (relative frequency)

- divide frequency by total number of events

→ percentage

- multiply proportion by 100

→ illustrate with bar chart / pie chart

{2} ordinal data

* summarize the ordinal data using...

→ frequencies, proportions, and percentages

→ percentiles

→ mode, median (central tendency)

→ interquartile range (variability)

→ illustrate with bar / pie chart

{3} continuous data

* summarize continuous data using...

→ percentile, median, interquartile range

→ mean, median, or mode (central tendency)

→ standard deviation, range, or IQR (variability)

→ illustrate with histogram / boxplot

e.x) data using bar chart example (nominal data, ordinal data, and continuous data in order)

- unity modal distribution

- as of Ordinal Data, bar plot columns should be arranged in order -

e.x) 'boxplot' is an alternative and more robust way to illustrate 'continuous variable'

→ variability(interquartile range) is marked in the box

→ can detect outliers (other than bar plots)

→ useful way to illustrate central tendency & variability & skewness of a distribution

→ but cannot detect ~~the number of modes~~ in the distribution of a box plot (alternative for histograms)

(+) bimodal distribution - use histogram

** Relationships between Two Variables are KEYS

Q) how do values of one variable change as values on another variable change?

Q) do low scores on one variable correspond to low values on another variable?

Q) do large values of one variable correspond to large values on another variable?

→ depends on the type of data each variable represents

ex) number of computers in home by country

ex) PISA test score

ex) mathematics score by number of computers

→ two categorical variables

→ two continuous variables

→ one continuous & one categorical variable

[3] (+visualization → spotting Outliers)

= unusually small or large values

* are easy to spot with a box plot

- values smaller than lower inner fence (Q1 - 1.5IQR)

- values larger than upper inner fence (Q3 + 1.5IQR)

- apply to continuous data

1. → correct it

2. (if it is legitimate or extreme value) consider the use of robust statstics(median, interquartile range)

- median & IQR does not get influenced much compared to ~~mean & standard deviation~~

+) bivariate outliers

= a value that is unusually large or small on both variables

- can be identified with a scatterplot

- affects choice of correlation & coefficient

- had better use Spearman rank order correlation

"The best single device for suggesting, and at times answering, questions beyond those originally posed is the graphical display"

by John Tukey.

→ so we'd better use visualization for EDA!

** source) https://www.youtube.com/watch?v=zHcQPKP6NpM

** thumbnail source) https://www.fingent.com/blog/a-simple-guide-on-understanding-exploratory-data-analysis/

저작자표시 비영리 변경금지 (새창열림)

'Computer Science > Concepts' 카테고리의 다른 글

Tidy Data (0)	2022.04.13
REST API example - Coingecko API (0)	2022.03.30
Tabular Data 🗄️ (0)	2022.03.26
Data Preprocessing (0)	2022.03.25
FE - Feature Engineering (0)	2022.03.22