Bias-Variance Trade-off

Bias-variance trade-off 를 이해하기 전에 bias, variance 가 각각 무엇인지 알아야할 것 같다.

Bias: target function 에 대한 assumption. Multi-layer perceptron 보다는 linear regression 의 bias 가 훨씬 강하다.
Variance: Training data 가 달라질 경우 estimate of target function 이 얼마나 달라지는지를 의미. 모델이 복잡할 수록 training set 의 변화에 대해 estimate of target function 이 많이 달라지게 된다.

Bias-variance trade-off 는 모델의 복잡성에 대해 각각 비례, 반비례하는 bias, variance 중 한 쪽을 낮출 경우 다른 한쪽이 높아지는 현상을 의미한다.

Bias, variance 는 expected loss 를 decompose 하여 수식으로 살펴볼 수 있다.

$$ \mathbb{E}[L] = \iint L(t, y(\mathbf{x})) p(\mathbf{x}, t) d\mathbf{x} dt $$

Loss function 을 square error 로 가정하면 expected loss 은 아래와 같다.

$$ \mathbb{E}[L] = \iint \{ y(\mathbf{x}) - t \}^2 p(\mathbf{x}, t) d\mathbf{x} dt $$

Square error 를 다음과 같이 표현하고

$$ \{y(\mathbf{x}) - t\}^2 = \{y(\mathbf{x}) - \mathbb{E}[t | \mathbf{x}] + \mathbb{E}[t | \mathbf{x}] - t\}^2 \\= \{y(\mathbf{x}) - \mathbb{E}[t | \mathbf{x}] \}^2 + 2\{y(\mathbf{x}) - \mathbb{E}[t | \mathbf{x} ] \} \{ \mathbb{E} [t | \mathbf{x} ] - t\} + \{ \mathbb{E}[t | \mathbf{x} ] - t \}^2$$

expected loss 의 square error 를 치환하면 아래와 같이 나타낼 수 있다.

$$ \mathbb{E} [ L ] = \int \{ y(\mathbf{x}) - \mathbb{E} [ t | \mathbf{x} ] \}^2 p(\mathbf{x})d\mathbf{x} + \int var[t | \mathbf{x}]p(\mathbf{x})d\mathbf{x}$$

두 번째 항은 주어진 데이터에 내재된 noise 이므로 irreducible 이다. 우리의 목표는 주어진 데이터 $\mathcal{D}$ 를 활용해서 $ y(\mathbf{x}) $ 를 $ \mathbb{E} [ t | \mathbf{x} ] $ 에 최대한 가깝게 만드는 것이다. 여러 dataset 이 있을 때 expected loss 의 dataset 에 대한 평균은 아래와 같이 나타낼 수 있다 (dataset 의 변화에 따라 함수가 얼마나 변하는지 (variance) 를 생각해야하므로).

$$ \mathbb{E}_{\mathcal{D}}[ \{ y(\mathbf{x}; \mathcal{D}) - \mathbb{E}[ t | \mathbf{x} ] \}^2 ] \\ = \{ \mathbb{E}_{\mathcal{D}}[ y(\mathbf{x}; \mathcal{D}) ] - \mathbb{E}[t | \mathbf{x}] \}^2 + \mathbb{E}_{\mathcal{D}} [ \{ y(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[ y(\mathbf{x}; \mathcal{D}) ] \}^2 ]$$

첫 번째 항은 bias, 두 번째 항은 variance 임을 알 수 있다. 단순한 모델을 사용하면 두 번째 항 (variance) 은 작아지지만 첫 번째 항의 $\mathbb{E}[t | \mathbf{x}]$ 에 잘 fitting 하기가 힘들어지고, 복잡한 모델을 사용하면 첫 번째 항 (bias) 은 작아지지만 $\mathcal{D}$ 가 변할 경우 $y(\mathbf{x}; \mathcal{D})$ 가 많이 달라질 수 있다.

저작자표시 비영리 변경금지 (새창열림)

'데이터 과학 > 데이터 과학 기초' 카테고리의 다른 글

Batch Normalization (0)	2020.12.29
직관적인 Universal Approximation Theorem 증명 (0)	2020.12.18
L1, L2 Regularization (0)	2020.12.08
Overfitting과 Underfitting (0)	2020.12.03

Han's smoke filled room

Bias-Variance Trade-off

'데이터 과학 > 데이터 과학 기초' 카테고리의 다른 글

티스토리툴바