1.1 Example: Polynomial Curve Fitting

Now suppose that we are given a training set comprising $N$ observations of $x$, written $\textbf{x} = (x_1, …, x_N)^{T}$ ,tother with corresponding observations of the values of $t$, denoted $\textbf{t} = (t_1,…,t_N)^T$ . Our goal is to exploit this training set in order to make predictions of the value $\hat{t}$ of the target variable for some new value $\hat{x}$ of the input value. As we shell see later, this involves implicitly trying to discover the underlying function $sin(2\pi x)$.
For the moment, however, we shall proceed rather informally and consider a simple approach based on curve fitting. In particular, we shall fit the data using a polynomial function of the form
$$
y(x, \textbf{w}) = w_0 + w_1x + w_2x^2 + … + w_Mx^M = \sum_{j=0}^{M} w_jx^j \tag{1.1}
$$
where $M$ is the order of the polynomial, and $x^j$ denotes $x$ raised to the power of $j$. The polynomial coefficients $w_0, …, w_M$ are collectively denoted by the vector $\textbf{w}$. Note that, although the polynomial function $y(x, \textbf{w})$ is a nonlinear function of $x$, it is a linear function of the coefficients $\textbf{w}$. Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are called linear models.

The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an error function that measure the misfit between the function $y(x, \textbf{w})$, for any given value of $\textbf{w}$, and the training set data points. One simple choice of error function, which is widely used, is given by the sum of the squares of the error between the predictions $y(x_n, \textbf{w})$ for each data point $x_n$ and the corresponding targer value $t_n$, so that we minimize

$$E(w) = \frac{1}{2}\sum_{n=1}^{N}\{ y(x_n, \textbf{w}) – t_n\}^2 \tag{1.2}$$

It is a nonnegative quantity that would be zero if, and only if, the function $y(x, \textbf{w})$ were to pass exactly through each training data point. The geometrical interpretation of the sum-of-square error function is illustrated below.

The error function corresponds to (one half of) the sum of the squares of the displacements (shown by the vertical green bars) of each data point from the function $y(x, \textbf{w})$.

Because the error function is a quadratic function of the coefficients $\textbf{w}$, it’s derivatives with respect to the coefficients will be linear in the elements of $\textbf{w}$, and so the minimization of the error function has a unique solution, denoted by $\textbf{w}^{*}$, which can be found in closed form.

As we noted earlier, the goal is to achieve good generalization by making accurate predictions for new data. For each choice of $M$, we can then evaluate the residual value of $E(\textbf{w}^*)$ given by (1.2) for the training data, and we can also evaluate $E(\textbf{w}^*)$ for the test dataset. It is sometimes more convenient to use the root-mean-square (RMS) error defined by

$$E_{RMS} = \sqrt{2E(\textbf{w}^*)/N} \tag{1.3}$$

in which the division by $N$ allows us to compare different size of data sets on the equal footing, and the square root ensures that $E_{RMS}$ on the same scale (and in the same units) as the target variable $t$.

Graphs of the root-mean-square error, evaluated on the training set and on an independent test test for various values of M.
Table of the coefficients $\textbf{w}^*$ for polynomials of various order. Observe how the typical magnitude of the coefficients increases dramatically as the order of the polynomial increase.

We can see that, as M increases, the magnitude of the coefficients typically gets larger. In particular for the $M = 9$ polynomial, the coefficients have become finely tuned to the data by developing large positive and negative values so that the corresponding polynomial function matches each of the data points exactly, but between data points(particularly near the ends of the range) the function exhibits the large oscillations. Intuitively, what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values.

Why do these coefficients become large positive and negative values as M increases?(Why does the higher the degree polynomial the larger the coefficients?)

the polynomial needs to have steep slopes and sharp turns, which require large coefficients.

  1. Mathematically, each term in a polynomial is of the form $w_ix^{i}$, where wiwi​ is the coefficient for the $i$-th term. To achieve sharp turns and fit closely to data points, the polynomial might need to have significant contributions from high-degree terms ($x^i$ for large $i$). Because these terms grow rapidly with $x$, the coefficients $w_i$​ must be large (either positive or negative) to counterbalance this effect and to ensure that the polynomial passes through or near the data points.
  2. A high-degree polynomial doesn’t just model the underlying trend; it also starts to model this random noise. To “wiggle” through all the data points, including noise, the polynomial needs to have steep slopes and sharp turns, which require large coefficients.
  3. Low-degree polynomials have high bias because they can’t fully capture complex relationships (they’re too “rigid”). High-degree polynomials have low bias because they can capture very complex relationships, but they have high variance because they’re sensitive to the exact data points they’re trained on, including noise. High variance manifests as large swings in the polynomial’s shape, necessitated by large coefficient values.
Plots of the solutions obtained by minimizing the sum-of-squares error function using the $M=9$ ploynomial for $N=15$ data points (left plot) and $N=100$ data points (right plot). We see that increasing the size of the data set reduce the over-fitting problem.

We see that, for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple(say 5 or 10) of the number of adaptive parameters in the model.

We shell see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood(discussed in Section 1.2.5), and that the over-fitting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the over-fitting problem can be avoided. We shell see that there is no difficulty from a Bayesian perspective in employing models fro which the number of parameters greatly exceeds the number of data points. Indeed, in Bayesian model the effective number of parameters adapts automatically to the size of the data set.

One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form

$$\tilde{E}(\textbf{w}) = \frac{1}{2}\sum_{n=1}^{N} \{ y(x_n, w) – t_n \}^{2} + \frac{\lambda}{2} \| \textbf{w} \|^{2} \tag{1.4} $$

where $\|\textbf{w}\|^2 = \textbf{w}^{T}\textbf{w} = w_0^2 + w_1^2 + .. + w_M^2$, and the coefficient $\lambda$ governs the relative importance of the regularization term compared with the sum-of-squares error term. Again, the error function in (1.4) can be minimized exactly in closed form. Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regularizer is called ridge regression. In the context of neural network, this approach is known as weight decay.

Can perpendicular distance be used to measure the misfit between the polynomial function $y(x, \textbf{w})$ and the training data points?

Yes. This approach is somewhat different from the more commonly used method of measuring the vertical distance (the difference in the $y$ direction(illustrated above)). The perpendicular distance provides a measure that takes into account both the vertical and horizontal discrepancies between the data points and the fitted curve, potentially offering a more geometrically accurate representation of the misfit, however, this approach can be more computationally intensive and mathematically complex because the perpendicular distance from a point to a curve depends on both the point’s $x$ and $y$ coordinates in a no-linear way, especially for higher-degree polynomials. The minimization problem in this case becomes more complicated and might not have a straightforward analytical solution, often requiring numerical optimization techniques.

While using perpendicular distance could theoretically provide a more “natural” measure of fit for certain applications, especially where errors in both the x and y directions are important, increased complexity and computational cost make this approach less common in practice.

1.2. Probability Theory

The Rules of Probability

sum rule $p(X) =\sum_{Y}p(X, Y)$

product rule $p(X, Y) = p(Y|X)p(X)$

Here $p(X, Y)$ is a joint probability and is verbalized as “the probability of $X$ and $Y$”. Similarly, the quantity $p(Y|X)$ is a conditional probability and is verbalized as “the probability of $Y$ given $X$”, whereas the quantity $p(X)$ is a marginal probability and is simply “the probability of $X$”. From the product rule, together with the symmetry property $p(X, Y) = p(Y, X)$, we immediately obtain the following relationship between conditional probabilities

$$p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)} \tag{1.12}$$

which is called Bayes’ theorem and which plays a central role in pattern recognition and machine learning. Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator

$$p(X)=\sum_{Y}p(X|Y)p(Y) \tag{1.13}$$

The denominator is the sum of all possible numerators. We can view the denominator as being the normalization constant required to ensure that the sum of the conditional probability on the left-side of (1.12) over all values of $Y$ equals one.

Because probabilities are nonnegative, and because the value of $x$

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注