Advanced Machine Learning Lecture Notes

machine learning lecture notes in artificial intelligence. and advanced machine learning technologies, machine learning pattern recognition difference pdf free
AbbieBenson Profile Pic
AbbieBenson,United States,Professional
Published Date:13-07-2017
Your Website URL(Optional)
Advanced Machine Learning Jacobs University Bremen Herbert Jaeger Lecture Notes Version status: Mar 29, 2013: corrected some URLs Sep 6, 2011: created by copy from the 2009 version Sep 26, 2011: corrected math typos in Section 2.9 Oct 4, 2011: extended and corrected Sections 4.4.1, 4.4.2 Oct 7, 2011: improved notation and explanations in 4.4.2 Oct 9, 2011: thorough workover and extension of entire 4.4 Oct 13, 2011: added paragraphs on linear regression in 4.4 Nov 1, 2011: small add-ons and improvements in beginnings of HMM section Nov 3, 2011: replaced "measure space" by "observation space" throughout Nov 11, 2011: added Section 10.6 on sequence model comparison Nov 24, 2011: corrected and improved equations 8.10, 8.11 Jan 11, 2012: corrected typos in Sections 1 5 1 1. Introduction 1.1 What this lecture is – unfortunately – NOT about What do you see? (From an ad for "Fiddler on the Roof", Don't behave like an image processing algorithm. I think you can "see" much more: What's sticking out in the left lower corner? What's the expression on the face of that man? What will be his next movement? How does the hat smell? Other questions you could answer... How did your answers get there where they come from? Human learning: uncomprehensively rich and complex, involving aspects of • body growth • brain development • motion control • exploration, curiosity, play • creativity • social interaction • drill and exercise and rote learning 2 • reward and punishment, pleasure and pain • evolution • dreaming • remembering • forgetting • & 1000 more ... • and all is integrated into your person, makes up your individual personality You are what you learnt. What must be in place for you to become yourself through learning? • The universe, the earth, the atmosphere, water, food, caves • Your body, brain, sensor & motor apparatus • Physiology and neurophysiology • Evolution • Other people, living • Other people, long dead • Machines, tools, buildings, toys • Words and sentences • Concepts and meanings • Letters and books • Traditions • Schools Your and your learning are part of the world's development By the way, does evolution learn? Is, what is learnt by an individual, affecting evolution? Can the results of learning be passed on to siblings genetically? The Baldwin effect says, yes it can, albeit indirectly. Excerpt from At the turn of the century, it was unclear whether Darwin's theory or Lamarck's better explained evolution. Lamarck believed in direct inheritance of characteristics acquired by individuals during their lifetime. Darwin proposed that natural selection coupled with diversity could largely explain evolution. Darwin himself believed that Lamarckian evolution might play a small role in life, but most Darwinians rejected Lamarckism. One potentially verifiable difference between the two theories was that Darwinians were committed to gradualism (evolution in tiny, incremental steps), while Lamarckians expected occasional rapid change. Lamarckians cited the gaps in the fossil record (which are now associated with punctuated equilibria) as supporting evidence. Lamarckism was a viable theory until August Weismann's (1893) work was widely accepted. Weismann argued that higher organisms have two types of cells, germ cells that pass genetic information to offspring and somatic cells that have no direct role in reproduction. He argued that there is no way for information acquired by somatic cells to be transmitted to germ cells. 3 In the context of this debate, James Mark Baldwin (1896) proposed "a new factor in evolution", whereby acquired characteristics could be indirectly inherited. Morgan (1896) and Osborn (1896) independently proposed similar ideas. The "new factor" was phenotypic plasticity: the ability of an organism to adapt to its environment during its lifetime. The ability to learn is the most obvious example of phenotypic plasticity, but other examples are the ability to tan with exposure to sun, to form a callus with exposure to abrasion, or to increase muscle strength with exercise. Baldwin (1896) pointed out that, among other things, the new factor could explain punctuated equilibria. The Baldwin effect works in two steps. First, phenotypic plasticity allows an individual to adapt to a partially successful mutation, which might otherwise be useless to the individual. If this mutation increases inclusive fitness, it will tend to proliferate in the population. However, phenotypic plasticity is typically costly for an individual. For example, learning requires energy and time, and it sometimes involves dangerous mistakes. Therefore there is a second step: given sufficient time, evolution may find a rigid mechanism that can replace the plastic mechanism. Thus a behavior that was once learned (the first step) may eventually become instinctive (the second step). On the surface, this looks the same as Lamarckian evolution, but there is no direct alteration of the genotype, based on the experience of the phenotype. This effect is similar to Waddington's (1942) "canalization". The Baldwin effect came to the attention of computer scientists with the work of Hinton and Nowlan (1987). The Baldwin effect may arise in evolutionary computation when a genetic algorithm is used to evolve a population of individuals that also employ a local search algorithm. Local search is the computational analog of phenotypic plasticity in biological evolution. In computational terms, in the first step of the Baldwin effect, local search smooths the fitness landscape, which can facilitate evolutionary search. In the second step, as more optimal genotypes arise in the population, there is selective pressure for reduction in local search, driven by the intrinsic costs associated with the search. 1.2 What this lecture is about This lecture is about inductive learning: given observations / experiences / examples, derive a model / description / concept / rule (we'll stick to model) Most important examples: • Categorization (I see a man who wears a hat) • Second-to-second action result expectations (if I let go of this pen, it will fall down) • Skills, professional and everyday • Everyday theories ("plants grow because rain falls and sun shines") • Models in the natural sciences Some might claim that induction is learning is induction and that Science is The Big Learning, but we have learnt otherwise a few minutes ago. This lecture is about inductive learning performed by algorithms: given a data set ("training data", "sample"), condense it into a data structure that can be used for various purposes, such as simulation, prediction, filtering (de-noising, transformation), classification, failure monitoring, pattern completion. 4 Examples: • Given a set of plant descriptions with dozens to hundreds anatomical details per plant, derive an efficient classification scheme (as you find it in botanic field guides) • Given a sample of profile data records of customers, classify them into target groups for marketing. • Given a sample of profile data records of startup companies, predict the risk of insolvency for a new startup company, together with predicting the inaccuracy of your prediction. • Given 100,000 spam emails and 1,000 non-spam emails, create a spam filter. • Given audiorecordings of properly working gearboxes and of gearboxes shortly before failure, develop a monitoring system that warns you of imminent failure. • Given some thousand samples of hand-written postal codes, train an automated postal code recognizer. • Given a corrupted radio signal with echo and noise and amplifier distortion, distil a de- distortion filter (an equalizer) which undoes these effects and returns the "clean" signal; and, do this within milliseconds (your cellphone does it). • Read a long text aloud into a microphone. From that recording, train a personal dictating system that transforms your speech into text. When the system makes mistakes, speak the mistyped word a few times and let the system adapt. • Given test data from a combustion engine run under many different working conditions, learn an automated control device that minimizes fuel consumption. • Given a team of soccer robots and a lot of test game recordings, develop a behavior control module that lets the team win. In this lecture, you will learn techniques that enable you to start coping with most of the examples. You see, it's useful. But it's a far cry from the totality of human learning. And it involves statistics... Don't be afraid, though, it's so powerful that statistics turns into a good feeling. 1.3 What this lecture isn't about either This lecture is mostly about inductive learning of numerical models. The methods we will see have mostly been developed in the fields of pattern recognition, signal processing, neural networks and statistical decision making. The books of Bishop and Duda/Hart/Stork are representative (see references list on course homepage). 5 This is one big portion of the field of machine learning. Another big portion is concerned with learning symbolic models, for instance, deriving sets of logical rules or even little computer programs from training data. Such techniques often have a typical "artificial intelligence" flavour. Here the book of Mitchell is a good source. Finally, there are two further big fields of machine learning that we will not touch: genetic algorithms / evolutionary optimization, and reinforcement learning. The former installs artificial evolutionary processes to finding (sometimes amazingly "creative") solutions to complex optimization tasks, the latter deals with leraning optimal action policies for autonomous agents when during training only scarce reward / punishment signals are provided. Mitchell's book gives short intros for both. 1.4. Technical remarks The course homepage is at There you can find this script, references, exercise sheets, software etc. 6 2. A first tour of topics 2.1. Introducing the Digits example The digits dataset and all Matlab routines used in this section are available at If you want to play around with this example, go ahead In order to get a feeling for the issues of ML, in the first few sessions we will informally consider a simple learning task and learn about some crucial themes of that will turn up again and again throughout the course. Assume you want to train an algorithm to distinguish the written symbols "1" and "0". Part of the training material might look like the first two lines in the following figure: Figure 2.1: part of the digits training sample (created from a benchmark dataset donated by Robert Duin, orginally retrieved from databases/mfeat/mfeat-pix, now also at and Technically speaking, these samples are two-dimensional arrays of size 15x16, or alternatively vectors x of length 240, where i = 1, ..., N and N is the size of the training set. i The values of the vector components indicate greyscale values. In our digits dataset, these 7 values are integers ranging from 0 to 6, and for "zero" and "one" patterns this data set contains 200 samples, so here N = 400. In addition, this training set contains the information whether some input vector is a zero or a one. We code this by introducing a class label y ∈ 1, 2, where y = 1 if x represents a i i i "zero" and y = 2 if x represents a "one". Thus, your training data is a set of labelled samples: i i (2.1) (x , y ) i i i = 1,..., N What in the end you want to have is an algorithm that accepts as input some (new) alternatively vectors x and produces as output a "1" or a "2" according to whether the input vector represents a one or a zero. Formally, such an algorithm implements a binary decision function ˆ (2.2) y = f (x), where y ∈ 1, 2. We generally use the hat, , to mark functions or variables that are learnt, or − as statisticians ˆ would say with equal right − estimated from data. In writing y = f (x), we indicate that we ˆ believe that there is a correct (but unknown) decision function f, of which f is an estimate obtained through some learning method (or even by inspired hand design). You might find it easy to explicitly write a little algorithm, inspired by insight into the nature ˆ of 0's and 1's, that implements some reasonable f . The zero and one pictures have some very distinctive characteristics that you might be able to exploit. But you might find it much more difficult to come up with a program that distinguishes between ones and twos, or between ones an sevens Even humans are sometimes unsure about a classification. Look at the last "one" example in Figure 2.1. If you wouldn't know it is a "one", you might be unsure whether it might not also be a "zero". Facing this intrinsic uncertainty of classifications, a more adequate type of classification algorithm would not return on input x a single class label, but should rather return a hypothesis vector: (P(y = 1 x), P(y = 2 x)). Because P(y = 2 x) = 1 − P(y = 1 x), one of the two hypothesis components is redundant, and it is enough if the probability p = P(y = 1 x) is returned. That is, such an algorithm would implement a hypothesis function ˆ (2.3) p = f (x), where p ∈ 0,1. The explicit design of a classification algorithm soon becomes tedious if not impractical, and it would be hard to write an algorithm that returns reasonable hypotheses instead of just a binary decision. Bright idea: write a learning algorithm instead, that reads the training samples (2.1) and automatically generates a function f of the kind (2.2) or (2.3) 8 2.2. The curse of dimensionality We will now try to design an ad hoc, brute-force learning algorithm L for a decision function f of type (2.2). Since in the dataset we are using, vectors x coding for digits have integer values ranging between 0 and 6, the input space for f is a 240-dimensional, integer-valued cube X with edge length 6. Our (first, crude) approach to design L is learning by frequency counts: • Partition X into a number of subcubes X , where j = 1, ..., d. j • For each subcube X , L considers all training samples (x , y ) with x ∈ X and counts j i i i j how many of them are "zeros" (= N (j) ) and how many are "ones" (= N (j) ). 0 1 • Store all these N (j), N (j). 0 1 ˆ • f (x) is then defined as follows: o Determine index j of the subcube with x ∈ X . j o If N (j) N (j) return 1. 0 1 o If N (j) N (j) return 2. 0 1 o If N (j) = N (j) return ?. 0 1 Hopeless Why? The coarsest reasonable partition of X into subcubes X would halve the edges of X and create j 240 subcubes of edge length 3. This would give 2 subcubes That is, only in a vanishingly small number of subcubes would we find a training sample at all. In the vast majority of subcubes, N (j) and N (j) would be 0 and f wouldn't be defined. Even if we had a number N of 0 1 training samples amounting to the number of atoms in the universe, the situation would hardly improve (plus, we would run into storage problems). So our naive approach fails completely. This situation is common to many ML tasks and is known as the Curse of dimensionality: Training samples (x , y ) that are high-dimensional vectors populate i i their input vector space X so thinly that it is meaningless to use them directly to define a ˆ decision or hypothesis function f over X. 2.3 Using features A way to escape from the curse of dimensionality: use a few features instead of original input vectors. The function of features is to reduce the dimensionality of the input space. Technically, a feature F is a one-dimensional function over the original input vectors x. Examples in the digit domain: 9 T • F (x) = sum of components of x / 240 = 1 x / 240, where 1 is the 240-vector of all 1 T ones and denotes transpose (in this script, all vectors are assumed to be column vectors unless noted otherwise). This is the average brightness of the digit picture. Might be useful to distinguish between "ones" and "zeros" because "zeros" should use more ink than "ones". • F (x) = 0 if in the middle row of the digit picture there are some black pixels followed 2 by some white pixels followed again by some black pixels, and = 1 else. Might single out "zeros" from "ones" because F(x) = 0 seems typical for "zeros". (But look at the fourth "one" image in Fig. 2.1). • Construct F as follows. First, create a pixel image with a clean, clear, "one". 3 T Transform this image into a vector x . Put F(x) = x , x = x x. This is best1 best1 best1 the inner product between x and x . Intuitively, F measures the overlap of a best1 3 pattern x with the prototype x , and thus should be large for patterns x representing best1 "ones" and small for patterns representing "zeros". There are very many ways to define useful features, and we will learn about more of them. Features are a tool for dimensionality reduction. If you have some few (hopefully relevant) features F , ..., F , rewrite your high-dimensional input patterns x into low-dimensional 1 m feature vectors F(x) = (F (x), ..., F (x)). This transformation is called feature extraction. Use 1 m (F(x ), y ) as training material. The learning task is then to distil from this material a i i i = 1,..., N ˆ function y = f (F(x)) that gets as input feature vectors and returns classification labels. Finding "good" features for dimensionality reduction is a key factor in many ML tasks. It is also something of an art, because the "good" features are often task-specific and require some insight into the "nature" of the data (compare F from the list above). Unfortunately, there is 2 no known universal way to define the "best" features for a given learning task (but there are good working solutions that are quite generally applicable). Using low-dimensional features, our ad hoc approach of learning by frequency counts suddenly makes sense. In the simplest case, we only use a single feature. The space X on which f operates is now one-dimensional (the dimension of the feature value). Using feature F and 10 subcubes (= subintervals here) we get the following histogram for 3 N (j), N (j): 0 1 10 N (j) (blue, left columns) 0 and N (j)(red) 1 F (x) 3 Figure 2.2. Frequency counts for "zero" and "one" training patterns w.r.t. feature F 3 ˆ It becomes clear from this figure that a decision function f based on this feature alone would classify the majority of the training examples correctly but would also yield some misclassifications. In fact, 5.25 per cent of the training samples would be misclassified – apparently more misclassifications than a human would make. We will learn to do much better. 2.4 Regression and time series prediction. Introducing the Mackey-Glass attractor example. The digit recognition task is a classification task. The training patterns for a classification task are of the kind (x , y ), where y is a discrete-valued class label. "Discrete-valued" here means i i i that there are only finitely many possible values y can take. i If the y in the training patterns are allowed to take "analog", real-number values, then the i training patterns (x , y ) are best seen as argument-value pairs of some function, that is, y = i i i f(x ), and the learning task becomes one of estimating the function from training examples, i ˆ that is, to obtain some y = f (x). Estimating real-valued functions from training data is called regression. In a sense, classification is just a special case of regression, where the function f is intended to take only discrete (class label) values. Often, the training patterns come from noisy measurements. Figure 2.3 shows a noisy "measurement" of a linear and the square function; the crosses mark the (x , y ) training i i patterns. 11 Figure 2.3: A linear and the square function (solid lines), represented by "noisy observations" (crosses). If one wishes to obtain a linear function (as in the left panel of Fig. 2.3), one speaks of a linear regression; otherwise, of a nonlinear regression. The term "regression" typically means that one deals with noisy training patterns, but sometimes it is also used for noise-free training samples. (While we are at it: the term "sample" may refer to a single training instance (one red cross in Fig. 2.3), but is likewise used to denote the complete training set – don't blame me for this confusion.) Regression tasks abound in practical applications. Here are some examples: • Predict the time a cancer patient will live. Training samples (x , y ): x is a vector i i i containing diagnostic measurements of patient i, y is the time this patient lived after i diagnosis. • Assess the quality of a production batch leaving a chemical factory. In training samples (x , y ), x is a vector of measurements made during the production process, y i i i i is a vector of quality criteria describing the end product. This is not only interesting to predict the quality of an end product currently being processed, but also (applied in reverse) for finding "manufacturing tuning parameters" x that will yield a product with certain desired characteristics y. • Image restoration: Training samples (x , y ) are pairs of a corrupted image x and its i i i non-corrupted form y . Here, both x and y are grayscale vectors. i i i Regression is also key in a type of problem where one would not, at first sight, expect it: time series prediction. We will see now how time series prediction can be seen as a regression task. By way of example, we consider a time series which is generated by a deterministic equation, namely, the Mackey-Glass (MG) equation. The MG equation was first introduced as a model 1 for the onset of leukaemia . It describes a chaotic system and has been used extensively as a benchmark system in time series prediction research, so there is a big number of articles available that treat the prediction task for the MG system and which allow one to compare one's own prediction performance with the state of the art. Originally, the MG equation is a 1 J. McNames, J. A. K. Suykens, J. Vandewalle, Int. J. of Bifurcation and Chaos 9, 1485 (1999) 12 so-called delay differential equation. When it is discretized with stepsize δ (that is, changed from continuous time t to discrete time n), one gets an approximate update equation that looks like this:  0.2x(n −τ /δ)  (2.4) x(n+1)=x(n)+δ −0.1x(n)   10 1+x(n −τ /δ)   This equation describes how the value x(n+1) can be computed from previous values. The delay parameter τ is usually set to 17 or to 30, that is, x(n+1) depends on the previous value x(n) and the value 17 update steps before n. The larger τ , the more chaotic the resulting time series. Fig. 2.4 shows how Eq. (2.4) evolves for τ = 17 and τ = 30 (a stepsize of δ = 0.1 was used). For values of τ less than 17, the resulting time series is not chaotic but only periodic. Figure 2.4: Evolution in time of the MG system for delays τ = 17 and τ = 30. Time series prediction tasks. In the case of discrete time n, and a one-dimensional time series (x ) , a time series prediction task is set up as follows: n n =1,2,3,... • Given: an initial sequence x , x , ..., x . generated by some dynamical system (in our 1 2 N case, Eq. (2.4)). • NOT given: knowledge of the generating system x ˆ x ˆ • Wanted: an estimated continuation of this sequence , , ... N+1 N+2 For the MG example, the initial sequence might for instance consist in the first 400 points of Fig. 2.4 (left), and the prediction task would be to generate the next points. Here are some other important time series prediction tasks: • Predict the currency exchange rate, given the currency charts up to NOW. • Predict the weather. • Predict the next move of your opponent in a game (for instance, chess or real-life warfare – yes, that's being tried) • Predict where a comet is heading (will it hit us?) 13 • Predict the next sounds in a speech utterance (this is important for automated speech processing systems, because such predictions are required to "filter" hypotheses about what has actually been said). ˆ • This is a learning task in the sense that in order to create the continuation x , N+1 ˆ x , has to induce (= "learn") from the given initial sequence x , x , ..., x a N+2 1 2 N model M of the (unknown) generating system, and then use M to compute the continuation. What kind of mathematical thing is M? There are many, many possibilities. In fact, M might take the form of any mathematical/algorithmical object capable of generating a time series. For instance, M might be some kind of deterministic automaton, or a random system like a Markov chain, or a game ˆ strategy, or simply an update equation of the form x(n+1) = f (x(n)). A good deal of ingenuity and intuitive insight is typically required to settle for a "suitable" type of algorithmical object for a given time series prediction task. Now we will see how the task of time series prediction can be cast as a regression task. The ˆ idea is to establish M as a regression function f (which has to be learnt from the training data x , x , ..., x ). Consider a short section of the MG series (Fig. 2.5) for an illustration: the idea 1 2 N ˆ is to learn a function f which takes as arguments some previous, already known values of the time series and computes the next value from those. Figure 2.5: One way to specify a mechanism M capable of creating a time series: compute next value (square) as a function (arrows) of some previous values (circles). 14 Typically, when one uses this approach, one takes a relatively small number of preceding points (order of 3 to 10), which are spaced at regular intervals (in Fig. 2.5., with 2 points in ˆ between). Formally, M is an update equation f of the following kind: ˆ (2.5) x(n +1) = f (x(n), x(n − d), x(n − 2d), ... x(n − kd)) At this point we make a little digression and describe a curious and extremely useful and 2 relatively recent mathematical insight – the Takens theorem . Consider some dynamical system that is governed by a differential equation of the form  (2.6) x=h(x) , where x is a vector from a possibly high-dimensional vector space X. Assume further that the dynamics described by this equation is confined to a compact submanifold A of X. Basically, this means that the dynamics does not drive the system state to infinite values.This is typically given for real-life physical systems. Furthermore, let the dimension of A be d . A typical A occurence in very-high-dimensional physical systems is that dim(X) d . This is due to the A fact that often the very many componentes variables of x are coupled through the dynamics (2.6), a condition which makes them change in time in a coordinated fashion, which effectively shrinks the dimension of the manifold A of X where the action x(t) happens. This would for instance be expected in brain states or in many biomechanical systems. But still, even with the dimensionality reduction afforded by mutual couplings of variables, d still may A be quite large in many systems of interest. In many cases where researchers face complex dynamical systems of the kind (2.6), they face a situation where the following two facts are at odds: 1) The researcher wants to understand the dynamics of the system – for instance, whether it is periodic, chaotic (and if so, how "badly" chaotic), whether it has stable equilibria and so forth – in short, s/he wants to know about the qualitative properties of h. 2) The researcher cannot measure the high-dimensional state vector x, nor some lower- dimensional transform of it that covers the dynamics on A. (Imagine, for instance, that x contains the zillions of neural activations of a brain state, even given that many neurons act in some coordinated way, d still would be quite large; or imagine the hundreds of A concentrations of reactants in a complex chemical reaction). In many cases, all that the researcher has in her hands is just the value of one single measurement (or observation) variable y(x(t)) which is a function of the underlying high- dimensional physical state. (For instance, y(x(t)) might be a voltage reading of an extracranial electrode for measuring brain activity, or it might be the concentration of a single, easy-to- measure reactant in a chemical reaction). 2 Original paper: F. Takens (1991), Detecting strange attractors in turbulence. In Dynamical Systems and Turbulence (Rand, D.A. and Young, L.-S., eds.), Springer LN Mathematics 898, 366-381. More recent extensions: Stark, J., Broomhead, D.S., Davies, M.E., Huke, J., Takens embedding theorems for forced and stochastic systems. Nonlinear Analysis, Theory, Methods & Applications 30 (8), 1997, 5303-5314 Websites: try "Takens theorem" on Google and you will immediately find pages with a rigorous statement of Takens theorem. 15 Now the Takens theorem comes to your help It says that under certain conditions, the dynamics x(t) of the original state vector is essentially identical to the dynamics of a delay embedding of the observation variable, that is, to the dynamics of a vector of the kind (2.7) y(t) = (y(t), y(t - τ), y(t - 2τ), ..., y(t – (k−1)τ)). Some explanations: • The phrase, "the dynamics is essentially identical", intuitively means that the dynamics of x(t) and of y(t) exhibit the same qualitative properties, for instance same number of equilibrium points, same number of possible oscillation patterns, etc. Technically speaking, the dynamics of y(t) evolves on some compact manifold y(A), and there is a differentiable embedding of y(A) into A which maps the dynamics of y(t) on the dynamics of the original states x(t). • The dimension k of y(t) is called the embedding dimension. Takens theorem states that if k is chosen to be at least 2 d + 1, one obtains "essentially identical dynamics". A • Since Takens' pioneering paper, numerous embedding theorems of the same flavour 3 have been found. Specifically, a theorem that goes back to Sauer et al states that if the dynamical system in question is a chaotic attractor of (fractal) dimension d, then the system can be reconstructed generically in delay embedding coordinates of dimension k 2d. • The value of the delay τ must be chosen with some care to get good results in practical applications. • For many dynamical systems (even when the system equation is known), it is not possible to compute the embedding dimension analytically. Approximate numerical methods to estimate this dimension from one-dimensional observation sequences y(t) have been developed; they are often somewhat hairy and need a good deal of mathematical understanding to yield reliable estimates. Takens theorem can be re-phrased as follows: If you observe a high-dimensional dynamical system over time through a single measurement variable, you can reach full knowledge about the "hidden" high-dimensional dynamics if you take a delay embedding of your observation variable instead of the original state. Takens theorem has been mis-used in the last decade in sometimes extreme ways, which disregarded the "small print" of the technical prerequisites for its application, and often disregarded the malicious influence of noise in observations (the theorem holds strictly only for noise-free observations). It is a seductive but also dangerous theorem ˆ And this is the connection of Takens/Sauer's theorem with our update function f (Eq. (2.5)) for the Mackey-Glass system. The original MG equation is a so-called delay differential equation. Technically, this implies that the dynamics governed by the MG equation uses an infinite-dimensional state space X However, after the dynamics converges on the final MG attractor (that is what we see in Fig. 2.4), the (fractal) dimension of the attractor is small (about 3 for a delay of τ = 30). Because in Eq. (2.4) we took a discrete-time approximation to the original continuous-time MG equation, the dimension of X in this discrete-time approximate system is not infinite but equals the delay, that is, it is dim(X) = 17 or 30. If we 3 T. Sauer, J. Yorke, and M. Casdagli, Embedology, J. Stat. Phys. 65, 579 (1991). Preprint at 16 consider the dynamics only on the final attractor, it again is a fractal of dimension 3 for τ = 30. Although Eq. (2.4) superficially looks like a one-dimensional difference equation of the type x(n+1) = h(x(n)), it actually specifies a high-dimensional system (of dim 17 or 30) due to the delay that appears in it. The variable x of Eq. (2.4) should actually be seen not as a state variable, but as an observation variable of the kind y(x(t)) as described above. For the MG attractor, embedding dimensions of k = 7 or 8 are often used. ˆ The reason why it works out to implement M as a function f that takes some previous values to compute the next, as in Figure 2.5., now becomes clear: the previous values provide a delay embedding state that is "essentially equivalent" to the true (unknown) 3-dimensional attractor state. So it is no coincidence that in Fig. 2.5, the number of previous points is 7: it is actually the minimal number required (for τ = 30). Tricky question: Equation (2.4) uses only 2 previous points to compute the next. So why is the embedding dimension not just 2? Assuming τ = 30 and δ = 1, the form of Eq. (2.4) is x(n+1) = g(x(n), x(n−30)). So why shouldn't one be able to learn some g ˆ ( x(n), x(n−30)), that is, why shouldn't an embedding dimension of 2 be sufficient? Answer: yes, it is possible to learn g ˆ ( x(n), x(n−30)) from training data. BUT one has to divine the correct delay value of 30 first in order to make this work Eq. (2.4) has no equivalent version that uses a different delay value. By contrast, using the 7-dimensional delay embedding suggested by Takens' theorem is in principle insensitive to the choice of the embedding delay duration (although some delays work more efficiently than others). Because there is no known way of guessing the correct original delay from training data, one has to apply Takens' theorem. 2.5 Analytical vs. blackbox modelling Things are relatively simple if one knows, from insight or analysis, that the given initial time series has been generated by a mathematical object of a particular type. For instance, in predicting the trajectory of celestial objects the type of governing equations is known since Newton's days. If one knows the form of the equation, the learning task boils down to fix the values of the few "free parameters" in the equation (for instance, the comet's mass). Or in a chess game prediction, one might settle for one's favourite game-theoretic formalism to write up a template for M; this would then have to be filled with the current opponent's "personality parameters", which would have to be induced from the known training data. In many engineering and chemical processing tasks – and in weather prediction – , a well-guided guess about the form of the governing equations can be made. In such cases where the basic form of M is known, one speaks of analytical models (or physical models). Learning boils down to fitting a small number of parameters to the individual case at hand. In many cases, however, analytical models are ruled out. For instance, it is close to impossible to invent "suitable" equation templates for stock market developments, or for brain or cardial recordings (medical and biological time series in general), generally, biological, economical, social systems are hard to press into an equation. In other cases, analytical models might be obtainable, but would be too complex for any practical dealings. This is a typical situation when one is confronted with complete, complex technical systems, like 17 power plants, combustion engines, manufacturing lines and the like. Besides the sheer complexity of such systems, analytical models would present difficulties because they would have to integrate several types of formalisms: For instance, modeling a combustion engine would require an amalgamation of chemical, thermodynamic, and mechanical formalisms. If analytical models are inaccessible, the model M has to be established in some "generic" formalism that is capable of mimicking basically any kind of dynamical system, without paying respect to the underlying physical mechanism. The only requirement about M here is that M can produce the same input-output time series as the original generator. This is the approach of blackbox modeling: the original system is seen as an intransparent ("black") device whose internal workings cannot be understood; only the externally observable measurements are available, and only they are "modeled" by M. The best known type of blackbox model M is probably neural networks, and we will learn a lot about them. However, there are other important types of blackbox models, which I will at least name here: support vector machines (very fashionable these days), mixtures of Gaussians, Fourier decompositions, Taylor expansions, Volterra expansions, Markov chains and hidden Markov models. All of these models can accomodate to a wide range of systems, and all of them typically require a very large number of free parameters to be fixed through the learning process. The general picture of a blackbox modeling task is shown in Fig. 2.6. Figure 2.6: General scheme of blackbox modeling (here, for time series). 18 2.6 The bias-variance dilemma (Adapted from Bishop Sec. 1.5) We have encountered two elementary learning tasks, • classification tasks with training data (x , y ), where y ∈ 1, ..., n are taken from a i i i ˆ finite set of class labels and a classification function f (x) is sought, • regression tasks with training data (x , y ), where y is a real-valued vector and a i i i ˆ regression function f (x) is sought. Both types of tasks have a similar basic structure, and both types of tasks have to face the same fundamental problem of machine learning: the bias-variance dilemma. We will introduce it with the simple regression task of polynomial curve fitting. Let's consider a one-dimensional input, one-dimensional output regression task of the kind where the training data are of form (x , y ). Assume that there is some systematic relationship i i y = f(x) that we want to recover from the training data. We consider a simple artificial case where the x range in 0, 1 and the to-be-discovered relationship is y = sin(2 π x). The training i data, however, contain a noise component, that is, y = sin(2 π x ) + ν , where ν is drawn from i i i i a normal distribution with zero mean and standard deviation σ. Fig. 2.7 shows a sample (x , i y ), where eleven x are chosen equidistantly. i i Fig. 2.7: An example of training data (red squares) obtained from a noisy observation of an underlying "correct" function sin(2 π x) (dashed blue line). ˆ We now want to solve the task of learning a good approximation f for f from the training data (x , y ) by applying polynomial curve fitting, an elementary technique you might be i i surprised to meet here as a case of machine learning. Consider an M-th order polynomial M M j (2.8) . p(x)=w +wx++w x = wx ∑ 0 1 M j j=0 19 We want to approximate the function given to us via the training data (x , y ) by a polynomial, i i that is, we want to find ("learn") a polynomial p(x) such that p(x ) ≈ y . More precisely, we i i want to minimize the mean square error on the training data 1 N 2 (2.9) MSE = (p(x)−y) train ∑ i i i=1 N by a good choice of p(x) (here N is the number of training samples, in our example N = 11). If we assume that the order M of the polynomial is given, minimizing (2.9) boils down to finding polynomial coefficients w such that j N M 1 j 2 (2.10) MSE = ( w x −y) train ∑ ∑ j i i=1 j=0 N is minimized. At this moment we don't bother how this task is solved computationally but simply rely on the Matlab function polyfit which does exactly this job for us: given training data (x , y ) and polynomial order M, find the polynomial coefficients that minimize i i (2.10). Fig. 2.8 shows the polynomials found in this way for M = 1, 3, 10. Fig. 2.8: Fitting polynomials (green lines) for polynomial orders 1, 3, 10 (from left to right) If we compute the MSE's (2.10) for the three orders 1, 3, 10, we get MSE = 0.4852, train 0.0703, 0.0000. Some observations: • If we increase the order M, we get increasingly better MSE . train • For M = 1, we get a linear polynomial, which apparently does not represent our original sine function well. • For M = 3, we get a polynomial that hits our target sine apparently quite well. • For M = 10, we get a polynomial that perfectly matches the training data, but apparently misses the target sine function. We have stumbled over a phenomenon that will haunt us for the rest of this lecture: we ˆ apparently do NOT get the optimal estimate f = p(x) for f if we try to optimally fit the training data. This is our first encounter with the bias-variance dilemma, one of the fiercest enemies of machines in machine learning. We take a closer look at it. First we discuss what actually makes the M = 3 solution "look better" than the other two. If we knew the correct target function f(x) = sin(2 π x), it would be clear what "better" means: a 20

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.