Machine Learning/Fundamentals

Overfitting/Underfitting & Bias/Variance Tradeoff

metamong 2022. 4. 17.

1. 일반화(generalization)

"In machine learning, generalization is a definition to demonstrate how well is a trained model to classify or forecast unseen data. Training a generalized machine learning model means, in general, it works for all subset of unseen data. An example is when we train a model to classify between dogs and cats. If the model is provided with dogs images dataset with only two breeds, it may obtain a good performance. But, it possibly gets a low classification score when it is tested by other breeds of dogs as well. This issue can result to classify an actual dog image as a cat from the unseen dataset. Therefore, data diversity is very important factor in order to make a good prediction. In the sample above, the model may obtain 85% performance score when it is tested by only two dog breeds and gains 70% if trained by all breeds. However, the first possibly gets a very low score (e.g. 45%) if it is evaluated by an unseen dataset with all breed dogs. This for the latter can be unchanged given than it has been trained by high data diversity including all possible breeds.'

≫ test set으로 완성된 model의 성능을 평가할 때 train data로 훈련했을 때 보다도 더 좋게 나왔는 지, 얼마나 모르는 data에 대해 예측을 잘하는 지를 나타내는 걸 '일반화'라고 한다

≫ 즉, 우리는 '일반화가 잘 되는 모델'을 학습시켜야 한다!

≫ test set을 넣은 결과 model의 일반화가 잘 되게끔 과적합/과소적합 문제점을 방지할 필요가 있다!

- 일반화가 잘 되는 모델을 만들기 위해 고려해야 할 것 -

1> data diveristy - used dataset이 다양성이 있어야 새로운 data가 들어올 때 generalization 측면에서 성능이 높아짐

2> ML algorithm - 과적합/과소적합 방지하기 위해 algorithm 선택 중요

3> model complexity - 선택한 algorithm 자체가 복잡도가 너무 낮거나 높지 않게끔 적절한 복잡도 수준을 선정.

(계속해서 우리가 피해야 하는 2가지 문제점! 과적합/과소적합에 대해 알아보자 👩‍🦱)

2. 과적합/과소적합 (overfitting & underfitting)

👩‍🏫 과적합(overfitting) = '모델이 훈련데이터에만 특수한 성질을 과하게 학습해 일반화 못해 결국 테스트데이터에서 오차가 커지는 현상'

👨‍🏫 과소적합(underfitting) = '훈련데이터에 과적합도 못하고 일반화 성질도 학습을 못해, 훈련/테스트 데이터 모두 오차가 크게 나옴'

(그림으로 한번에 이해하자)

'An underfit model results in high prediction errors for both training and test data. An overfit model gives a very low prediction error on training data, but a very high prediction error on test data. Both types of models result in poor accuracy.'

≫ 결국은 underfit & overfit 둘 다 안 좋다는 뜻이다. underfit는 model 자체가 훈련 data에도 잘 맞추어져 있지를 않아서 당연히 새로운 data에도 fitting을 못한다. overfit의 경우 너무 과.하.게. 훈련 data에 맞추어져 있어 새로운 data에 모델이 적응을 못한다. 결과론적으로는 두 fit 모두 정확도가 떨어지는 결과를 보여줌 ㅠ (둘 다 바람직하지 않아)

≫ 특히 overfit의 경우 'Poor test accuracy may occur when the model is highly complex, i.e., the input feature combinations are in a large number and affect the model’s flexibility.'- model에 넣는 input feature(X를 얘기하겠죠?) 수가 너무 많은 경우 복잡해져서 train data 자체에는 거의 맞을 지 모르겠지만 새로운 data에 적응을 못한다는 뜻이다

3. 편향/분산 tradeoff (bias/variance tradeoff)

** 2.에서 배운 과적합/과소적합을 편향과 분산 개념과 엮어 설명해보자 **

→ 분산(Variance) - '분산이 높은 경우는, 모델이 학습 데이터 노이즈에 민감하게 적합하여 테스트데이터에서 일반화를 잘 못하는 경우 즉 과적합 상태'

→ 편향(Bias) - '모델이 학습 데이터(훈련 데이터 말함)에서, 특성과 타겟 변수의 관계를 잘 파악하지 못해 과소적합 상태'

'Variance and bias are two important terms in machine learning. Variance means the variety of predictions values made by a machine learning model (target function). Bias means the distance of the predictions from the actual (true) target values. A high-biased model means its prediction values (average) are far from the actual values. Also, high-variance prediction means the prediction values are highly varied.'

> 즉 편향은 실제값과 예측값의 차이이고, 분산은 (여기서의 분산을 말함) 모델이 만들어낸 예측값들이 얼마나 서로 떨어져 있는 지 나타내는 수치라고 말할 수 있다!!

> error 개념으로 분산error & 편향error로 나누어 설명해보면

- 시그마 제곱은 데이터 자체에서 발생하는 오류 (넘어가도 됨) -

→ model 구축 후 해당 모델의 성능을 평가할 때 error를 잡을 때 ①데이터 자체에서 발생하는 오류(시그마제곱 - noise이며 줄일 수 없는 부분이기에 irreducible error라 부름), ②편향 오류(BiasD), ③분산 오류(VarD) 이렇게 세 가지로 나눌 수 있다.

→ BiasD = 예측한 모델의 MSE(Mean Squared Error) 총 합과 함수의 차이 (즉 위에서 말한 실제와 예측의 차이!)

→ VarD = V[X^2] = E[X] - E[X^2] 식을 유도하여 얻은 식..! (그 이상은 수학적이라 여기까지만. train과 validation data에서의 오차 차이)

(식 유도는 어려워서 생략. 아래 확인)

* 마지막 재 정리 🙋‍♂️*

* 분산 오류 = 예측값들(train & validation data 모두 포함)끼리의 분산이다.

<<variance 개념 자체를 하나의 방향성 - 모델이 크게 또는 적게 움직이는 지 - 생각하면 쉽다 >>

- 과적합되어 있는 model의 경우 predicted된 data가 들어올 경우 기존 model을 통해 예측한 값들에 비해 변동성이 큰 (즉, 말 그대로 새로운 data 자체에 맞춰진) data가 들어올 확률이 높고, 이는 곧 예측한 값들끼리의 분산이 크다고 말할 수 있기에 overfit → high variance

- 과소적합되어 있는 model의 경우 이미 model과 관계 없이 새로운 data가 들어오더라도 data간의 분포 자체에 큰 영향을 주지 않기 때문에 예측한 값들끼리의 분산에 큰 영향을 안 주므로 underfit → low variance

≫ Low Bias + High Variance - Overfitting

≫ High Bias + Low Variance - Underfitting

→ 만들어진 모델을 최종적으로 test data로 넘기기 전에 계속 validation data로 검증하면서 내가 만든 모델이 underfit인가? overfit인가? 따지게 된다. 이 때 검증 결과 제일 높은 model을 최종적으로 골라야 한다(당연하지 ㅇㅇ). 위 그림과 같이 복잡성이 증가할수록 training data에 맞게 training accuracy는 쭉쭉 증가한다. 하지만 validation으로 검증해보면 일정 모델 복잡도를 넘어가면 감소하기 시작한다(overfit 발생)

(복잡도를 줄이기 위해 regularization 실행! 추후 포스팅 참조)

▣ 따라서 training score가 어느 정도 높은 상태에서 validation scre 최상치일 때를 best model로 판단 - test data로 넘긴다 ▣

※ 그 사이 지점을 잡기 위해 regularization, boosting, bagging 기법을 사용함! ※

(이 기법들을 다룬 포스팅 추후 참조)

+ bias/variance tradeoff decomposition (thanks to Giorgos Papachristoudis)

👏 Giorgos Data Scientist가 medium에 포스팅한 글을 보면, generalization error는 크게 세가지, variance + bias squared + irreducible error 이렇게 세 가지로 나눌 수 있으며, 어떤 과정으로 이 세 개의 합이 error가 되는 지 수학적으로 증명하였다.

👏 추가적으로, 위 그림을 통해 차수(d)가 높을수록 bias값 (위 red line과 black line 사이의 거리)이 감소함을 알 수 있다.

👏 그리고 histogram의 variance가 곧 bias-variance tradeoff에서 논하는 variance로, 이 값은 차수(d)가 높은 d =2 그림에서 variance값이 높게 됨을 시각화 결과로 알 수 있다(histogram이 더 퍼져 있음..!)

머신러닝 기본 of 기본 필수 개념 숙지 꼭 하자! 🙆‍♂️

* 썸네일 출처) https://ko.wikipedia.org/wiki/%EA%B3%BC%EC%A0%81%ED%95%A9

* 출처1) https://deepai.space/what-is-generalization-in-machine-learning/

* 출처2) https://www.educative.io/edpresso/overfitting-and-underfitting

* 출처3) https://medium.com/towards-data-science/the-bias-variance-tradeoff-8818f41e39e9

* 출처4) https://bywords.tistory.com/entry/%EB%B2%88%EC%97%AD-%EC%9C%A0%EC%B9%98%EC%9B%90%EC%83%9D%EB%8F%84-%EC%9D%B4%ED%95%B4%ED%95%A0-%EC%88%98-%EC%9E%88%EB%8A%94-biasvariance-tradeoff

저작자표시 비영리 변경금지 (새창열림)

'Machine Learning > Fundamentals' 카테고리의 다른 글

One-Hot encoding (0)	2022.04.17
Cross-Validation (concepts + w/code) (0)	2022.04.17
All About Evaluation Metrics(1/2) → MSE, MAE, RMSE, R^2 (0)	2022.04.16
intro. Machine Learning (0)	2022.04.15
Baseline Model (0)	2022.04.13