Machine Learning and Data Mining Lecture Notes

machine learning and data mining methods in diabetes research, machine learning and data mining in pattern recognition machine learning and data mining approaches to climate science pdf
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Your Website URL(Optional)
Machine Learning and Data Mining Lecture Notes CSC 411/D11 Computer Science Department University of Toronto Version: February 6, 2012 c Copyright 2010 Aaron Hertzmann and David FleetCSC 411 / CSC D11 Introduction to Machine Learning 1 Introduction to Machine Learning Machine learning is a set of tools that, broadly speaking, allow us to “teach” computers how to perform tasks by providing examples of how they should be done. For example, suppose we wish to write a program to distinguish between valid email messages and unwanted spam. We could try to write a set of simple rules, for example, flagging messages that contain certain features (such as the word “viagra” or obviously-fake headers). However, writing rules to accurately distinguish which text is valid can actually be quite difficult to do well, resulting either in many missed spam messages, or, worse, many lost emails. Worse, the spammers will actively adjust the way they send spam in order to trick these strategies (e.g., writing “vigr”). Writing effective rules — and keeping them up-to-date — quickly becomes an insurmountable task. Fortunately, machine learning has provided a solution. Modern spam filters are “learned” from examples: we provide the learning algorithm with example emails which we have manually labeled as “ham” (valid email) or “spam” (unwanted email), and the algorithms learn to distinguish between them automatically. Machine learning is a diverse and exciting field, and there are multiple ways of defining it: 1. The Artifical Intelligence View. Learning is central to human knowledge and intelligence, and, likewise, it is also essential for building intelligent machines. Years of effort in AI has shown that trying to build intelligent computers by programming all the rules cannot be done; automatic learning is crucial. For example, we humans are not born with the ability to understand language — we learn it — and it makes sense to try to have computers learn language instead of trying to program it all it. 2. The Software Engineering View. Machine learning allows us to program computers by example, which can be easier than writing code the traditional way. 3. The Stats View. Machine learning is the marriage of computer science and statistics: com- putational techniques are applied to statistical problems. Machine learning has been applied to a vast number of problems in many contexts, beyond the typical statistics problems. Ma- chine learning is often designed with different considerations than statistics (e.g., speed is often more important than accuracy). Often, machine learning methods are broken into two phases: 1. Training: A model is learned from a collection of training data. 2. Application: The model is used to make decisions about some new test data. For example, in the spam filtering case, the training data constitutes email messages labeled as ham or spam, and each new email message that we receive (and which to classify) is test data. However, there are other ways in which machine learning is used as well. Copyright c 2011 Aaron Hertzmann and David Fleet 1CSC 411 / CSC D11 Introduction to Machine Learning 1.1 Types of Machine Learning Some of the main types of machine learning are: 1. Supervised Learning, in which the training data is labeled with the correct answers, e.g., “spam” or “ham.” The two most common types of supervised learning are classification (where the outputs are discrete labels, as in spam filtering) and regression (where the outputs are real-valued). 2. Unsupervised learning, in which we are given a collection of unlabeled data, which we wish to analyze and discover patterns within. The two most important examples are dimension reduction and clustering. 3. Reinforcement learning, in which an agent (e.g., a robot or controller) seeks to learn the optimal actions to take based the outcomes of past actions. There are many other types of machine learning as well, for example: 1. Semi-supervised learning, in which only a subset of the training data is labeled 2. Time-series forecasting, such as in financial markets 3. Anomaly detection such as used for fault-detection in factories and in surveillance 4. Active learning, in which obtaining data is expensive, and so an algorithm must determine which training data to acquire and many others. 1.2 A simple problem Figure 1 shows a 1D regression problem. The goal is to fit a 1D curve to a few points. Which curve is best to fit these points? There are infinitely many curves that fit the data, and, because the data might be noisy, we might not even want to fit the data precisely. Hence, machine learning requires that we make certain choices: 1. How do we parameterize the model we fit? For the example in Figure 1, how do we param- eterize the curve; should we try to explain the data with a linear function, a quadratic, or a sinusoidal curve? 2. What criteria (e.g., objective function) do we use to judge the quality of the fit? For example, when fitting a curve to noisy data, it is common to measure the quality of the fit in terms of the squared error between the data we are given and the fitted curve. When minimizing the squared error, the resulting fit is usually called a least-squares estimate. Copyright c 2011 Aaron Hertzmann and David Fleet 2CSC 411 / CSC D11 Introduction to Machine Learning 3. Some types of models and some model parameters can be very expensive to optimize well. How long are we willing to wait for a solution, or can we use approximations (or hand- tuning) instead? 4. Ideally we want to find a model that will provide useful predictions in future situations. That is, although we might learn a model from training data, we ultimately care about how well it works on future test data. When a model fits training data well, but performs poorly on test data, we say that the model has overfit the training data; i.e., the model has fit properties of the input that are not particularly relevant to the task at hand (e.g., Figures 1 (top row and bottom left)). Such properties are refered to as noise. When this happens we say that the model does not generalize well to the test data. Rather it produces predictions on the test data that are much less accurate than you might have hoped for given the fit to the training data. Machine learning provides a wide selection of options by which to answer these questions, along with the vast experience of the community as to which methods tend to be successful on a particular class of data-set. Some more advanced methods provide ways of automating some of these choices, such as automatically selecting between alternative models, and there is some beautiful theory that assists in gaining a deeper understanding of learning. In practice, there is no single “silver bullet” for all learning. Using machine learning in practice requires that you make use of your own prior knowledge and experimentation to solve problems. But with the tools of machine learning, you can do amazing things Copyright c 2011 Aaron Hertzmann and David Fleet 3CSC 411 / CSC D11 Introduction to Machine Learning 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 6 1.5 4 1 2 0.5 0 0 −2 −0.5 −4 −6 −1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Figure 1: A simple regression problem. The blue circles are measurements (the training data), and the red curves are possible fits to the data. There is no one “right answer;” the solution we prefer depends on the problem. Ideally we want to find a model that provides good predictions for new inputs (i.e., locations on thex-axis for which we had no training data). We will often prefer simple, smooth models like that in the lower right. Copyright c 2011 Aaron Hertzmann and David Fleet 4CSC 411 / CSC D11 Linear Regression 2 Linear Regression In regression, our goal is to learn a mapping from one real-valued space to another. Linear re- gression is the simplest form of regression: it is easy to understand, often quite effective, and very efficient to learn and use. 2.1 The 1D case We will start by considering linear regression in just 1 dimension. Here, our goal is to learn a mappingy = f(x), wherex andy are both real-valued scalars (i.e.,x∈R,y∈R). We will take f to be an linear function of the form: y =wx+b (1) where w is a weight and b is a bias. These two scalars are the parameters of the model, which we would like to learn from training data. n particular, we wish to estimate w and b from the N N training pairs(x,y ) . Then, once we have values for w and b, we can compute the y for a i i i=1 newx. Given 2 data points (i.e., N=2), we can exactly solve for the unknown slope w and offset b. (How would you formulate this solution?) Unfortunately, this approach is extremely sensitive to noise in the training data measurements, so you cannot usually trust the resulting model. Instead, we can find much better models when the two parameters are estimated from larger data sets. WhenN 2 we will not be able to find unique parameter values for whichy = wx +b for all i i i, since we have many more constraints than parameters. The best we can hope for is to find the parameters that minimize the residual errors, i.e.,y −(wx +b). i i The most commonly-used way to estimate the parameters is by least-squares regression. We define an energy function (a.k.a. objective function): N X 2 E(w,b) = (y −(wx +b)) (2) i i i=1 To estimate w and b, we solve for the w and b that minimize this objective function. This can be done by setting the derivatives to zero and solving. X dE = −2 (y −(wx +b)) = 0 (3) i i db i Solving forb gives us the estimate: P P y x i i ∗ i i b = −w (4) N N = y¯−wx ¯ (5) Copyright c 2011 Aaron Hertzmann and David Fleet 5CSC 411 / CSC D11 Linear Regression 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Figure 2: An example of linear regression: the red line is fit to the blue data points. ∗ where we definex ¯ andy¯ as the averages of thex’s andy’s, respectively. This equation forb still depends onw, but we can nevertheless substitute it back into the energy function: X 2 E(w,b) = ((y −y¯)−w(x −x ¯)) (6) i i i Then: X dE = −2 ((y −y¯)−w(x −x ¯))(x −x ¯) (7) i i i dw i dE Solving = 0 then gives: dw P (y −y¯)(x −x ¯) i i ∗ i w = P (8) 2 (x −x ¯) i i ∗ ∗ The valuesw andb are the least-squares estimates for the parameters of the linear regression. 2.2 Multidimensional inputs D Now, suppose we wish to learn a mapping fromD-dimensional inputs to scalar outputs:x∈R , 1 y∈R. Now, we will learn a vector of weightsw, so that the mapping will be: D X T f(x) =w x+b = w x +b . (9) j j j=1 1 Above we used subscripts to index the training set, while here we are using the subscript to index the elements of the input and weight vectors. In what follows the context should make it clear what the index denotes. Copyright c 2011 Aaron Hertzmann and David Fleet 6 yCSC 411 / CSC D11 Linear Regression For convenience, we can fold the bias b into the weights, if we augment the inputs with an addi- tional 1. In other words, if we define     w x 1 1     . . . .     . . w ˜ = , x ˜ = (10)         w x D D b 1 then the mapping can be written: T f(x) =w ˜ x ˜ . (11) GivenN training input-output pairs, the least-squares objective function is then: N X T 2 E(w ˜) = (y −w ˜ x ˜ ) (12) i i i=1 If we stack the outputs in a vector and the inputs in a matrix, then we can also write this as: 2 ˜ E(w ˜) =y−Xw ˜ (13) where     T y x 1 1 1     . . ˜ . . y = , X = (14)     . . T y x 1 N N P 2 2 and· is the usual Euclidean norm, i.e.,v = v . (You should verify for yourself that i i Equations 12 and 13 are equivalent). Equation 13 is known as a linear least-squares problem, and can be solved by methods from linear algebra. We can rewrite the objective function as: T ˜ ˜ ˜ ˜ E(w) = (y−Xw) (y−Xw) (15) T T T T ˜ ˜ ˜ = w ˜ X Xw ˜−2y Xw ˜ +y y (16) We can optimize this by setting all values of dE/dw = 0 and solving the resulting system of i equations (we will cover this in more detail later in Chapter 4). In the meantime, if this is unclear, start by reviewing your linear algebra and vector calculus). The solution is given by: ∗ T −1 T ˜ ˜ ˜ w = (X X) X y (17) (You may wish to verify for yourself that this reduces to the solution for the 1D case in Section + ˜ 2.1; however, this takes quite a lot of linear algebra and a little cleverness). The matrixX ≡ T −1 T ˜ ˜ ˜ ˜ (X X) X is called the pseudoinverse ofX, and so the solution can also be written: ∗ + ˜ w ˜ =X y (18) Copyright c 2011 Aaron Hertzmann and David Fleet 7CSC 411 / CSC D11 Linear Regression In MATLAB, one can directly solve the system of equations using the slash operator: ∗ ˜ w ˜ =X\y (19) There are some subtle differences between these two ways of solving the system of equations. We will not concern ourselves with these here except to say that I recommend using the slash operator rather than the pseudoinverse. 2.3 Multidimensional outputs In the most general case, both the inputs and outputs may be multidimensional. For example, with K D-dimensional inputs, andK-dimensional outputsy∈R , a linear mapping from input to output can be written as T ˜ y =W x ˜ (20) (D+1)×K ˜ ˜ whereW∈R . It is convenient to expressW in terms of its column vectors, i.e.,   w ... w 1 K ˜ ˜ ˜ W = w ... w ≡ . (21) 1 K b ... b 1 K T In this way we can then express the mapping from the inputx ˜ to thej element ofy asy =w ˜ x. th j j N Now, givenN training samples, denotedx ˜ ,y a natural energy function to minimize in order i i i=1 ˜ to estimateW is just the squared residual error over all training samples and all output dimensions, i.e., N K XX T 2 ˜ ˜ ˜ E(W) = (y −w x ) . (22) i,j i j i=1 j=1 There are several ways to conveniently vectorize this energy function. One way is to express ′ E solely as a sum over output dimensions. That is, lety be theN-dimensional vector comprising j th ′ T thej component of each output training vector, i.e.,y = y ,y ,...,y . Then we can write 1,j 2,j N,j j K X ′ 2 ˜ ˜ E(W) = y −Xw ˜ (23) j j j=1 T ˜ ˜ ˜ ˜ where X = x x ... x . With a little thought you can see that this really amounts to K 1 2 N ∗ + ′ ˜ distinct estimation problems, the solutions for which are given byw ˜ =X y . j j Another common convention is to stack up everything into a matrix equation, i.e., 2 ˜ ˜ ˜ E(W) =Y−XW (24) F P ′ ′ 2 2 whereY = y ...y , and· denotes the Frobenius norm:Y = Y . You should F 1 K F i,j i,j verify that Equations (23) and (24) are equivalent representations of the energy function in Equa- tion (22). Finally, the solution is again provided by the pseudoinverse: ∗ + ˜ ˜ W =X Y (25) ∗ ˜ ˜ or, in MATLAB,W =X\Y. Copyright c 2011 Aaron Hertzmann and David Fleet 8CSC 411 / CSC D11 Nonlinear Regression 3 Nonlinear Regression Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic form, i.e., y =f(x) (26) In linear regression, we havef(x) =Wx+b; the parametersW andb must be fit to data. What nonlinear function do we choose? In principle,f(x) could be anything: it could involve linear functions, sines and cosines, summations, and so on. However, the form we choose will make a big difference on the effectiveness of the regression: a more general model will require more data to fit, and different models are more appropriate for different problems. Ideally, the form of the model would be matched exactly to the underlying phenomenon. If we’re modeling a linear process, we’d use a linear regression; if we were modeling a physical process, we could, in principle, modelf(x) by the equations of physics. In many situations, we do not know much about the underlying nature of the process being modeled, or else modeling it precisely is too difficult. In these cases, we typically turn to a few models in machine learning that are widely-used and quite effective for many problems. These methods include basis function regression (including Radial Basis Functions), Artificial Neural Networks, andk-Nearest Neighbors. There is one other important choice to be made, namely, the choice of objective function for learning, or, equivalently, the underlying noise model. In this section we extend the LS estimators introduced in the previous chapter to include one or more terms to encourage smoothness in the estimated models. It is hoped that smoother models will tend to overfit the training data less and therefore generalize somewhat better. 3.1 Basis function regression 2 A common choice for the functionf(x) is a basis function representation : X y =f(x) = w b (x) (27) k k k for the 1D case. The functions b (x) are called basis functions. Often it will be convenient to k T express this model in vector form, for which we define b(x) = b (x),...,b (x) and w = 1 M T w ,...,w whereM is the number of basis functions. We can then rewrite the model as 1 M T y =f(x) =b(x) w (28) Two common choices of basis functions are polynomials and Radial Basis Functions (RBF). A simple, common basis for polynomials are the monomials, i.e., 2 3 b (x) = 1, b (x) =x, b (x) =x , b (x) =x , ... (29) 0 1 2 3 2 In the machine learning and statistics literature, these representations are often referred to as linear regression, since they are linear functions of the “features”b (x) k Copyright c 2011 Aaron Hertzmann and David Fleet 9CSC 411 / CSC D11 Nonlinear Regression Polynomial basis functions Radial Basis Functions 8 2 0 x 1 6 x 2 x 1.5 3 x 4 1 2 0 0.5 −2 0 −4 −0.5 −6 −1 −8 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x x Figure 3: The first three basis functions of a polynomial basis, and Radial Basis Functions With a monomial basis, the regression model has the form X k f(x) = w x , (30) k Radial Basis Functions, and the resulting regression model are given by 2 (x−c ) k − 2 2σ b (x) =e , (31) k 2 X (x−c ) k − 2 2σ f(x) = w e , (32) k 2 wherec is the center (i.e., the location) of the basis function andσ determines the width of the k basis function. Both of these are parameters of the model that must be determined somehow. In practice there are many other possible choices for basis functions, including sinusoidal func- tions, and other types of polynomials. Also, basis functions from different families, such as mono- mials and RBFs, can be combined. We might, for example, form a basis using the first few poly- nomials and a collection of RBFs. In general we ideally want to choose a family of basis functions such that we get a good fit to the data with a small basis set so that the number of weights to be estimated is not too large. To fit these models, we can again use least-squares regression, by minimizing the sum of squared residual error between model predictions and the training data outputs: 2 X X X 2 E(w) = (y −f(x )) = y − w b (x) (33) i i i k k i i k To minimize this function with respect tow, we note that this objective function has the same form as that for linear regression in the previous chapter, except that the inputs are now theb (x) values. k Copyright c 2011 Aaron Hertzmann and David Fleet 10CSC 411 / CSC D11 Nonlinear Regression In particular, E is still quadratic in the weightsw, and hence the weightsw can be estimated the same way. That is, we can rewrite the objective function in matrix-vector form to produce 2 E(w) =y−Bw (34) where· denotes the Euclidean norm, and the elements of the matrixB are given byB =b (x ) i,j j i ∗ (for rowi and columnj). In Matlab the least-squares estimate can be computed asw =B\y. Picking the other parameters. The positions of the centers and the widths of the RBF basis functions cannot be solved directly for in closed form. So we need some other criteria to select them. If we optimize these parameters for the squared-error, then we will end up with one basis center at each data point, and with tiny width that exactly fit the data. This is a problem as such a model will not usually provide good predictions for inputs other than those in the training set. The following heuristics instead are commonly used to determine these parameters without overfitting the training data. To pick the basis centers: 1. Place the centers uniformly spaced in the region containing the data. This is quite simple, but can lead to empty regions with basis functions, and will have an impractical number of data points in higher-dimensinal input spaces. 2. Place one center at each data point. This is used more often, since it limits the number of centers needed, although it can also be expensive if the number of data points is large. 3. Cluster the data, and use one center for each cluster. We will cover clustering methods later in the course. To pick the width parameter: 1. Manually try different values of the width and pick the best by trial-and-error. 2. Use the average squared distances (or median distances) to neighboring centers, scaled by a constant, to be the width. This approach also allows you to use different widths for different basis functions, and it allows the basis functions to be spaced non-uniformly. In later chapters we will discuss other methods for determining these and other parameters of models. 3.2 Overfitting and Regularization Directly minimizing squared-error can lead to an effect called overfitting, wherein we fit the train- ing data extremely well (i.e., with low error), yet we obtain a model that produces very poor pre- dictions on future test data whenever the test inputs differ from the training inputs (Figure 4(b)). Overfitting can be understood in many ways, all of which are variations on the same underlying pathology: Copyright c 2011 Aaron Hertzmann and David Fleet 11CSC 411 / CSC D11 Nonlinear Regression 1. The problem is insufficiently constrained: for example, if we have ten measurements and ten model parameters, then we can often obtain a perfect fit to the data. 2. Fitting noise: overfitting can occur when the model is so powerful that it can fit the data and also the random noise in the data. 3. Discarding uncertainty: the posterior probability distribution of the unknowns is insuffi- ciently peaked to pick a single estimate. (We will explain what this means in more detail later.) There are two important solutions to the overfitting problem: adding prior knowledge and handling uncertainty. The latter one we will discuss later in the course. In many cases, there is some sort of prior knowledge we can leverage. A very common as- sumption is that the underlying function is likely to be smooth, for example, having small deriva- tives. Smoothness distinguishes the examples in Figure 4. There is also a practical reason to prefer smoothness, in that assuming smoothness reduces model complexity: it is easier to estimate smooth models from small datasets. In the extreme, if we make no prior assumptions about the nature of the fit then it is impossible to learn and generalize at all; smoothness assumptions are one way of constraining the space of models so that we have any hope of learning from small datasets. One way to add smoothness is to parameterize the model in a smooth way (e.g., making the width parameter for RBFs larger; using only low-order polynomial basis functions), but this limits the expressiveness of the model. In particular, when we have lots and lots of data, we would like the data to be able to “overrule” the smoothness assumptions. With large widths, it is impossible to get highly-curved models no matter what the data says. Instead, we can add regularization: an extra term to the learning objective function that prefers smooth models. For example, for RBF regression with scalar outputs, and with many other types of basis functions or multi-dimensional outputs, this can be done with an objective function of the form: 2 2 E(w) =y−Bw + λw (35) z z data term smoothness term This objective function has two terms. The first term, called the data term, measures the model fit to the training data. The second term, often called the smoothness term, penalizes non-smoothness (rapid changes inf(x)). This particular smoothness term (w) is called weight decay, because it 3 tends to make the weights smaller. Weight decay implicitly leads to smoothness with RBF basis functions because the basis functions themselves are smooth, so rapid changes in the slope of f (i.e., high curvature) can only be created in RBFs by adding and subtracting basis functions with large weights. (Ideally, we might directly penalize smoothness, e.g., using an objective term that directly penalizes the integral of the squared curvature off(x), but this is usually impractical.) 3 Estimation with this objective function is sometimes called Ridge Regression in Statistics. Copyright c 2011 Aaron Hertzmann and David Fleet 12CSC 411 / CSC D11 Nonlinear Regression This regularized least-squares objective function is still quadratic with respect tow and can be optimized in closed-form. To see this, we can rewrite it as follows: T T E(w) = (y−Bw) (y−Bw)+λw w (36) T T T T T T = w B Bw−2w B y+λw w+y y (37) T T T T T = w (B B+λI)w−2w B y+y y (38) To minimizeE(w), as above, we solve the normal equations∇E(w) =0 (i.e.,∂E/∂w = 0 for i alli). This yields the following regularized LS estimate forw: ∗ T −1 T w = (B B+λI) B y (39) 3.3 Artificial Neural Networks Another choice of basis function is the sigmoid function. “Sigmoid” literally means “s-shaped.” The most common choice of sigmoid is: 1 g(a) = (40) −a 1+e Sigmoids can be combined to create a model called an Artificial Neural Network (ANN). For K K 1 regression with multi-dimensional inputsx∈R , and multidimensional outputsy∈R : 2 X X (1) (2) (2) (1) y =f(x) = w g w x +b +b (41) k j k,j j j k ( This equation describes a process whereby a linear regressor with weightsw 2) is applied tox. The output of this regressor is then put through the nonlinear Sigmoid function, the outputs of (2) which act as features to another linear regressor. Thus, note that the inner weightsw are distinct (1) parameters from the outer weightsw . As usual, it is easiest to interpret this model in the 1D j case, i.e.,   X (1) (2) (2) (1) y =f(x) = w g w x+b +b (42) j j j j Figure 5(left) shows plots ofg(wx) for different values ofw, and Figure 5(right) showsg(x+b) for different values of b. As can be seen from the figures, the sigmoid function acts more or less like a step function for large values ofw, and more like a linear ramp for small values ofw. The biasb shifts the function left or right. Hence, the neural network is a linear combination of shifted (smoothed) step functions, linear ramps, and the bias term. To learn an artificial neural network, we can again write a regularized squared-error objective function: 2 2 E(w,b) =y−f(x) +λw (43) Copyright c 2011 Aaron Hertzmann and David Fleet 13CSC 411 / CSC D11 Nonlinear Regression 1.5 1.5 training data points original curve estimated curve 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 training data points original curve estimated curve −1 −1.5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 (a) (b) 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 training data points −1 training data points original curve original curve estimated curve estimated curve −1.5 −1.5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 (c) (d) Figure 4: Least-squares curve fitting of an RBF. (a) Point data (blue circles) was taken from a sine curve, and a curve was fit to the points by a least-squares fit. The horizontal axis isx, the vertical axis is y, and the red curve is the estimated f(x). In this case, the fit is essentially perfect. The curve representation is a sum of Gaussian basis functions. (b) Overfitting. Random noise was added to the data points, and the curve was fit again. The curve exactly fits the data points, which does not reproduce the original curve (a green, dashed line) very well. (c) Underfitting. Adding a smoothness term makes the resulting curve too smooth. (In this case, weight decay was used, along with reducing the number of basis functions). (d) Reducing the strength of the smoothness term yields a better fit. Copyright c 2011 Aaron Hertzmann and David Fleet 14CSC 411 / CSC D11 Nonlinear Regression 1 0.9 g(x−4) 0.8 g(x) g(x+4) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 −wx Figure 5: Left: Sigmoidsg(wx) = 1/(1+e ) for various values ofw, ranging from linear ramps −x−b to smooth steps to nearly hard steps. Right: Sigmoids g(x +b) = 1/(1 +e ) with different shiftsb. wherew comprises the weights at both levels for allj. Note that we regularize by applying weight decay to the weights (both inner and outer), but not the biases, since only the weights affect the smoothness of the resulting function (why?). Unfortuntely, this objective function cannot be optimized in closed-form, and numerical opti- mization procedures must be used. We will study one such method, gradient descent, in the next chapter. 3.4 K-Nearest Neighbors At heart, many learning procedures — especially when our prior knowledge is weak — amount to smoothing the training data. RBF fitting is an example of this. However, many of these fitting procedures require making a number of decisions, such as the locations of the basis functions, and can be sensitive to these choices. This raises the question: why not cut out the middleman, and smooth the data directly? This is the idea behindK-Nearest Neighbors regression. The idea is simple. We first select a parameterK, which is the only parameter to the algorithm. Then, for a new inputx, we find the K nearest neighbors tox in the training set, based on their 2 Euclidean distancex−x . Then, our new outputy is simply an average of the training outputs i Copyright c 2011 Aaron Hertzmann and David Fleet 15CSC 411 / CSC D11 Nonlinear Regression for those nearest neigbors. This can be expressed as: X 1 y = y (44) i K i∈N (x) K where the setN (x) contains the indicies of theK training points closest tox. Alternatively, we K might take a weighted average of theK-nearest neighbors to give more influence to training points close tox than to those further away: P w(x )y i i 2 2 i∈N (x) K −x −x /2σ i y = P , w(x ) =e (45) i w(x ) i i∈N (x) K 2 whereσ is an additional parameter to the algorithm. The parametersK andσ control the degree of smoothing performed by the algorithm. In the extreme case ofK = 1, the algorithm produces a piecewise-constant function. K-nearest neighbors is simple and easy to implement; it doesn’t require us to muck about at all with different choices of basis functions or regularizations. However, it doesn’t compress the data at all: we have to keep around the entire training set in order to use it, which could be very expensive, and we must search the whole data set to make predictions. (The cost of searching can be mitigated with spatial data-structures designed for searching, such ask-d-trees and locality- sensitive hashing. We will not cover these methods here). Copyright c 2011 Aaron Hertzmann and David Fleet 16CSC 411 / CSC D11 Quadratics 4 Quadratics The objective functions used in linear least-squares and regularized least-squares are multidimen- sional quadratics. We now analyze multidimensional quadratics further. We will see many more uses of quadratics further in the course, particularly when dealing with Gaussian distributions. The general form of a one-dimensional quadratic is given by: 2 f(x) =w x +w x+w (46) 2 1 0 This can also be written in a slightly different way (called standard form): 2 f(x) =a(x−b) +c (47) 2 where a = w ,b =−w /(2w ),c = w − w /4w . These two forms are equivalent, and it is 2 1 2 0 2 1 easy to go back and forth between them (e.g., given a,b,c, what are w ,w ,w ?). In the latter 0 1 2 form, it is easy to visualize the shape of the curve: it is a bowl, with minimum (or maximum) at b, and the “width” of the bowl is determined by the magnitude of a, the sign of a tells us which direction the bowl points (a positive means a convex bowl,a negative means a concave bowl), and c tells us how high or low the bowl goes (at x = b). We will now generalize these intuitions for higher-dimensional quadratics. The general form for a 2D quadratic function is: 2 2 f(x ,x ) =w x +w x x +w x +w x +w x +w (48) 1 2 1,1 1,2 1 2 2,2 1 1 2 2 0 1 2 and, for anN-D quadratic, it is: X X f(x ,...x ) = w xx + wx +w (49) 1 N i,j i j i i 0 1≤i≤N,1≤j≤N 1≤i≤N P P Note that there are three sets of terms: the quadratic terms ( w xx ), the linear terms ( wx ) i,j i j i i and the constant term (w ). 0 Dealing with these summations is rather cumbersome. We can simplify things by using matrix- T vector notation. Letx be anN-dimensional column vector, writtenx = x ,...x . Then we can 1 N write a quadratic as: T T f(x) =x Ax+b x+c (50) where   w ... w 1,1 1,N   . . . . A = (51)   . w . i,j w ... w N,1 N,N T b = w ,...,w (52) 1 N c = w (53) 0 Copyright c 2011 Aaron Hertzmann and David Fleet 17CSC 411 / CSC D11 Quadratics You should verify for yourself that these different forms are equivalent: by multiplying out all the elements off(x), either in the 2D case or, using summations, the generalN−D case. For many manipulations we will want to do later, it is helpful forA to be symmetric, i.e., to havew = w . In fact, it should be clear that these off-diagonal entries are redundant. So, if we i,j j,i are a given a quadratic for whichA is asymmetric, we can symmetrize it as: 1 T T T T T ˜ f(x) =x ( (A+A ))x+b x+c =x Ax+b x+c (54) 2 1 T ˜ and useA = (A +A ) instead. You should confirm for yourself that this is equivalent to the 2 original quadratic. As before, we can convert the quadratic to a form that leads to clearer interpretation: T f(x) = (x−μ) A(x−μ)+d (55) 1 −1 T −1 whereμ =− A b,d = c−μ Aμ, assuming thatA exists. Note the similarity here to the 2 1-D case. As before, this function is a bowl-shape in N dimensions, with curvature specified by 4 the matrixA, and with a single stationary point μ. However, fully understanding the shape of f(x) is a bit more subtle and interesting. 4.1 Optimizing a quadratic Suppose we wish to find the stationary points (minima or maxima) of a quadratic T T f(x) =x Ax+b x+c. (56) The stationary points occur where all partial derivatives are zero, i.e., ∂f/∂x = 0 for all i. The i gradient of a function is the vector comprising the partial derivatives of the function, i.e., T ∇f≡ ∂f/∂x ,∂f/∂x ,...,∂f/∂N . (57) 1 2 T At stationary points it must therefore be true that∇f = 0,...,0 . Let us assume that A is symmetric (if it is not, then we can symmetrize it as above). Equation 56 is a very common form of cost function (e.g. the log probability of a Gaussian as we will later see), and so the form of its gradient is important to examine. Due to the linearity of the differentiation operator, we can look at each of the three terms of Eq.56 separately. The last (constant) term does not depend onx and so we can ignore it because its derivative is zero. Let us examine the first term. If we write out the individual terms within the 4 A stationary point means a setting ofx where the gradient is zero. Copyright c 2011 Aaron Hertzmann and David Fleet 18CSC 411 / CSC D11 Quadratics vectors/matrices, we get:    a ... a x 11 1N 1    . . . . . . . . (x ...x ) (58) 1 N  .   . . . a ... a x N1 NN N =(x a +x a +...+x a x a +x a +... (59) 1 11 2 21 N N1 1 12 2 22   x 1   . . ...+x a +x a +...+x a ) (60)   1 1N 2 2N N NN . x N 2 2 =x a +x x a +...+x x a +x x a +x a +...+x x a +... (61) 11 1 2 21 1 N N1 1 2 12 22 N 2 N2 1 2 2 ...x x a +x x a +...+x a (62) 1 N 1N 2 N 2N NN N X = a xx (63) ij i j ij th The i element of the gradient corresponds to ∂f/∂x . So in the expression above, for the i terms in the gradient corresponding to each x , we only need to consider the terms involving x i i (others will have derivative zero), namely X 2 x a + xx (a +a ) (64) ii i j ij ji i j6=i The gradient then has a very simple form:  T X ∂ x Ax = 2xa + x (a +a ). (65) i ii j ij ji ∂x i j= 6 i We can write a single expression for all of thex using matrix/vector form: i T ∂x Ax T = (A+A )x. (66) ∂x You should multiply this out for yourself to see that this corresponds to the individual terms above. If we assume thatA is symmetric, then we have T ∂x Ax = 2Ax. (67) ∂x T This is also a very helpful rule that you should remember. The next term in the cost function,b x, has an even simpler gradient. Note that this is simply a dot product, and the result is a scalar: T b x =b x +b x +...+b x . (68) 1 1 2 2 N N Copyright c 2011 Aaron Hertzmann and David Fleet 19

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.