Computer Science/Basics

Data Analysis with Python (1/2) (from Coursera)

metamong 2022. 4. 22.

1. Importing Datasets

* Why Data Analysis?

- data is everywhere

- helps us answers from data

- discovering useful info / answering questions / predicting future or the unkown 

 

* Understanding the Data

 

- target(label) = the name of the attribute that we want to predict

- CSV(Comma Separated Value) file = source of data

 

* Python Packages for DS

 

(1) Scientific Computing Libraries

* Pandas) data structures & tools - dataframe

(Offers data structure and tools for effective data manipulation and analysis. It provides fast access to structured data. The primary instrument of Pandas is a two-dimensional table consisting of columns and rows labels which are called a DataFrame. It is designed to provide an easy indexing function)

 

'Python/Pandas' 카테고리의 글 목록

데이터와 세상에 관해 이야기를 나누자!

sh-avid-learner.tistory.com

 

* Numpy) arrays for inputs and ouputs

 

'Python/Numpy' 카테고리의 글 목록

데이터와 세상에 관해 이야기를 나누자!

sh-avid-learner.tistory.com

 

* Scipy)  math problems & data visualization

 

(2) Visualization Libraries (create graphs, charts)

* Matplotlib) most popular

* Seaborn) based on Matplotlib

 

(3) Algorithmic Libaries (tackles ML tasks)

* Scikit-learn) tool statistical library

* Statsmodels) explore data, estimate statistical models & perform tests

 

* Importing & Exporting Data in Python

 

# importing data = process of loading and reading data into Python from various resources

- format) .csv, .json, .xlsx ...

 

ex) importing a CSV in Python

import pandas as pd

url = "~"
df = pd.read_csv(url, header = None)
#header = None -> pandas automatically set the column header as a list of integers 

df
df.head(5)
df.tail(n)

#adding headers (replace default header - df.columns = headers)
headers = ["~","~", ... ]
df.columns = headers

 

ex) exporting a Pandas dataframe to CSV

(path is the location where you want your dataframe located)

import pandas as pd

path = ".../.../~.csv"
df.to_csv(path)

 

 

# data types

 

 

- why check data types?) potential info and type mismatch & compatibility with python methods

df.dtypes #check the data type in each column
df.describe() #returns a statistical summary
df.describe(include = "all") #full summary statistics including object-typed attributes
# unique / top (the most frequent object name) / freq (the number of times the top object appears)
df.info() #prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

 

ex) df.describe()

 

 

+ we can replace '?' symbol with 'NaN' and drop 'price' column using dropna() method

(for more info see # how to deal with missing data? in 2) Data Wrangling)

 

df1=df.replace('?',np.NaN)

df=df1.dropna(subset=["price"], axis=0)
df.head(20)

2. Data Wrangling

- Data Pre-processing = the process of converting or mapping data from the initial "raw" form into another format, in order to prepare the data for further analysis (=data cleaning, data wrangling)

 

* Identify & handle missing values

 

- missing values occur when no data value is stored for a variable(features) in an observation

- appears as "?" "N/A" 0 or just a blank cell (usually)

 

# identify missing data

.isnull()

- the ouput is a boolean value indicating whether the value that is passed into the argument is in fact missing data

 

ex) counting missing values in each column

- value_counts() counts the number of "True" values

 

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")   

 

 

# how to deal with missing data?

 

(1) check if the person or group that collected the data can go back and find what the actual value should be

 

(2) just remove the data where that missing value is found

(drop the whole variable / drop the single data entry)

 

(3) replacing the missing values

- replace it with an average (of similar datapoints)

- replace it by frequency (for categorical data)

- replace it based on other functions (other variables may know hidden info about missing data)

 

(4) leave it as missing data

 

<1> dropna()

 

dataframe.dropna()

 

- drop rows or columns that contain missing values like NaN
- axis = 0 drops the entire row / axis = 1 drops the entire column
- inplace = True(): allows the modification to be done on the data set directly

 

ex)

df.dropna(subse = ["price"], axis = 0, inplace = True)

 

<2> replace()

 

dataframe.replace(missing_value, new_value)

 

ex) replace by mean value

 

 

mean = df["normalized-losses"].mean()
df["normalized-losses"].replace(np.nan, mean)

 

e.x) replace by frequency

- using value_counts() or idxmax() we can see which values are the most common type

 

df['num-of-doors'].value_counts()

# or

df['num-of-doors'].value_counts().idxmax()

 

#replace the missing 'num-of-doors' values by the most frequent 
df["num-of-doors"].replace(np.nan, "four", inplace=True)

 

* Data Formatting in Python

 

= bringing data into a common standard of expression allows users to make meaningful comparison

- ensures the data is consistent and easily understandable

 

ex)

 

ex) convert "mpg" to "L/100km" in Car dataset

 

df["city-mpg"] = 235/df["city-mpg"]

df.rename(columns = {"city-mpg": "city-L/100km"}, inplace = True)

 

- incorrect data types) sometimes the wrong data type is assigned to a feature

(so that the developed models later on may not behave strangely)

- identify the type using dataframes.dtypes()

 

ex)

df["price"] = df["price"].astype("int")

 

* Data Normalization in Python (centering / scaling)

= uniform the features value with different range

 

# Not-normalized

 

age income
20 100000
30 2000
40 50000

 

→ 'age' and 'income' are in different range

→ hard to compare

→ 'income' attribute influence the result more due to its larger value

 

solution)

 

age income
0.2 0.2
0.3 0.04
0.4 1

→ similar value range

→ similar intrinsic influence on analytical model

 

<1> Simple Feature Scaling

 

'divides each value by the maximum value for that feature'

 

 

- the resulting new values range btw 0 and 1

 

ex) (with pandas)

 

df["length"] = df["length"]/df["length"].max()

 

<2> Min-Max

 

 

- the resulting new values range btw 0 and 1

 

ex) (with pandas)

 

df["length"] = (df["length"]-df["length"].min())/(df["length"].max()-df["length"].min())

 

<3> Z-score (standard score)

 

- for each value, subtract the mu (the average of the feature), and then divide the standard deviation (sigma)

 

 

- the resulting values hover around zero and typically range btw -3 and +3 (but can be higher or lower)

 

ex) (with pandas)

 

df["length"] = (df["length"]-df["length"].mean())/df["length"].std()

 

* Data Binning (pd.cut)

 

 

pandas functions - cut, qcut

♠'cut' documentation https://pandas.pydata.org/docs/reference/api/pandas.cut.html ♠'qcut' documentation https://pandas.pydata.org/docs/reference/api/pandas.qcut.html 1. cut ✂️ → bin values int..

sh-avid-learner.tistory.com

 

= grouping values into "bins"

- converts numeric into categorical variables

- group a set of numerical values into a set of "bins"

- to have a better understanding of the data distribution

 

ex) (with pandas)

 

bins = np.linspace(min(df["price"]),max(df["price"],4)

 

→ list group_names that contain the different bin names

 

group_names = ["Low","Medium","High]

 

→ use pandas cut to segment and sort the values into bins

 

df["price-binned"] = pd.cut(df["price"],bins,labels=group_names,include_lowest=True)

 

- it is clear that most cars have lower price -

 

* Categorical Values → to numeric variables

 

Q) Most statistical models cannot take in the objects/strings as input

*** for model training, only take the numbers as inputs

 

S) Add dummy variables(indicator variable) for each unique category (assign 0 or 1 in each category)


"One-hot encoding"

 

One-Hot encoding

≫ ML 데이터 전처리 part에서 model이 이해할 수 있는 data로 변환하기 위해 여러 encoding 기법들이 적용된다고 하였고, 오늘은 그 중 하나인 'One-Hot encoding' 기법에 대해서 배우려고 한다 intro. Machine L.

sh-avid-learner.tistory.com

 

ex) (with pandas)

 

 

pd.get_dummies(df['fuel'])

 

- get_dummies method automatically generates a list of numbers -

3) Exploratory Data Analysis (EDA)

→ summarize main characteristics of the data

→ gain better understanding of the data set

→ uncover relationships btw variables

→ extract important variables

 

* Descriptive Statistics

 

descriptive statistics & inferential statistics

statistics 통계학을 배운다면 반드시(?) 구분해서 알아야 할 '기술통계치' & '추론통계치'!! 1. descriptive statistics(기술통계치) 'summarizes the characteristic of a data set' ≫ 주어진 data를 'descri..

sh-avid-learner.tistory.com

= describe features of data

= giving short summaries about the sample and measures of the data

 

<1> describe() method

(any NaN values are automatically skipped in these statistics)

- applys for all continuous variables

 

df.describe()

 

- this will show:

→ the count of that variable

→ the mean

→ the standard deviation (std)

→ the minimum value

→ the IQR (Interquartile Range: 25%, 50% and 75%)

→ the maximum value

 

(++ if we code describe(include=['object']), it applys for all the variables of type 'object')

 

df.describe(include=['object'])

 

ex)

 

<2> value_counts() method

(summarize the categorical data - counts the unique data)

 

* Don’t forget the method "value_counts" only works on Pandas series, not Pandas Dataframes. As a result, we only include one bracket "df['drive-wheels']" not two brackets "df[['drive-wheels']]"

 

ex)

 

df['drive-wheels'].value_counts()

 

 

drive_wheels_counts = df["drive_wheels"].value_counts().to_frame()

drive_wheels_counts.rename(columns={'drive-wheels':'value-counts'},inplace = True)
drive_wheels_counts

 

  value_counts
drive-wheels  
fwd 118
rwd 75
4wd 8

 

<3> visualizing using Box Plots

 

 

 

box plot (+seaborn)

* 저번 EDA 개념 포스팅에서 EDA가 무엇인지 알아보았고, data 종류별 & 상황별 적절한 시각화 예에 대해서 공부했다. https://sh-avid-learner.tistory.com/entry/EDA-Exploratory-Data-Analysis EDA - Explorat..

sh-avid-learner.tistory.com

 

ex)

 

import seaborn as sns
sns.boxplot(x="drive-wheels", y="price", data=df)

 

<4> visualizing using Scatter Plots

 

- each observation represented as a point

- shows the relationship between two variables

* predictor/independent variables on x-axis(horizontal axis) = that you are using to predict an outcome

* target/dependent variables on y-axis(vertical axis) = that you are trying to predict

 

ex)

 

y=df["price"]
x=df["engine-size"]
plt.scatter(x,y)

plt.title("Scatterplot of Engine Size vs Price")
plt.xlabel("Engine Size")
plt.ylabel("Price")

 

- a positivie linear relationship between these two variables -

 

* GroupBy

 

# Panda dataframe.Groupby() method

- can be applied on categorical variables

- group data into categories

- group by single or multiple variables

 

ex)

 

df_test = df[['drive-wheels', 'body-style', 'price']]
df_grp = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean()
df_grp

 

- but this form isn't the easiest to read and also not very easy to visualize

 

* Pandas Pivot() method

one variable displayed along the columns & the other variable displayed along the rows -

 

ex)

df_pivot = df_grp.pivot(index = 'drive-wheels', columns = 'body-style') 

 

 

easier to read and visualize -

 

pandas Tricks_12👉🏻 'pivot_table()' (Kevin by DataSchool)

👆 tidy data 포스팅에서 우리는 tidy한 data를 만들기 위해 기존 dataframe을 melt한다고 하였고, 다시 원위치 시키기 위해 pivot_table()을 활용한다고 배웠다. Tidy Data * 실제 사용되는 데이터는 하나의 완

sh-avid-learner.tistory.com

 

* Heatmap

 

= takes rectangular grid of data and assigns a color intensity based on the data value at the grid points

- great way to plot target variable over multiple variables

- we can get visual clues with the relationship between the variables

 

ex)

plt.pcolor(df_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

 

 

* Correlations (+Statistics)

= statistical metric for measuring to what extent different variables are interdependent

 

*** correlation doesn't imply causation (주의!)

 

# Positive Linear Relationship

 

ex) using regplot

sns.regplot(x="engine-size",y="price",data=df)
plt.ylim(0,)

 

 

# Negative Linear Relationship

 

ex)

sns.regplot(x="highway-mpg",y="price",data=df)
plt.ylim(0,)

 

 

(+ weak correlation btw two features)

 

ex)

sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)

 

we cannot use rpm to predict the values - 

 

**** <Pearson Correlation> ****

 

= measure the strength of the correlation btw two features

 

① correlation coefficient

② p-value

 

* Correlation coefficient

 

- close to +1: Large Positive relationship -

- close to -1: Large Negative relationship -

- close to 0: No relationship -

 

* P-value (gives us the certainty about the correlation coefficient that calculated) *

 

- P-value < 0.001: Strong certaintyin the result -

- P-value < 0.05: Moderate certainty in the result -

- P-value < 0.1: Weak certainty in the result -

- P-value > 0.1: No certainty in the result -

 

※ Strong Correlation = Correlation Coefficient close to 1 or -1 / P value less than 0.001

 

 

ex) (using SI/PI stat package)

from scipy import stats

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

# ---- RESULT ----
# Pearson Correlation: 0.81 (close to +1)
# P-value: 9.35 e-48 (much smaller than 0.001)

 

→ based on the result, we can conclude that we are cerating about the strong positive correlation

 

**** <Association btw two categorical variables> ****

 

the Chi-square Test for Association

 

- the test is intended to test how likely it is that an observed distribution is due to chance

- mesaures how well the observed distribution of data fits with the distribution that is expected if the variables are independent

- tests a null hypothesis that the variables are independent

- 

 

ex)

 

- crosstable = a table showing the relationship between two (or more) variables

- contingency table = when the table only shows the relationship between two categorical variables

 

 

- we calculate the expected value based on the total (if we do not know the exact value in each type)

 

- 29.6 will fall in btw a p-value less than 0.05 - reject the null hypothesis -

 

(we can conclude there is an association between two variables)

 

scipy.stats.chi2_contingency(cont_table, correction = True)

 

* result)

 

- 1st value) chi-square test-value

- 2nd value) exact p-value (very close to 0)

- 3rd value) a degree of freedom

- 4th value) expected value

 

(since p-value is very close to 0, we reject a null hypothesis that the variables are independent)


* degree of freedom www.investopedia.com/terms/d/degrees-of-freedom.asp

* 참고자료1) statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/

 

* 참고자료2) statisticsbyjim.com/glossary/significance-level/

* 전 내용 출처 - IBM <Data Analysis with Python (coursera)>

댓글