Question? Leave a message!




supervised machine learning

supervised machine learning
Big Data Analytics CSCI 4030High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection Big Data Analytics CSCI 4030 2 Many algorithms are today classified as machine learning ones.  These algorithms extract information from data.  Produce summary of data, from which decision is made.  Machine learning algorithms learn a model (or classified) from the data.  Discover something about data that will be seen in the future.. Big Data Analytics CSCI 4030 3 E.g., the clustering algorithms allow us to classify future data into one of the clusters.  Machine learning enthusiast call it unsupervised learning.  Unsupervised means that the input data does not tell the clustering algorithm what the clusters should be. Big Data Analytics CSCI 4030 4 In supervised machine learning:  The available data includes information about the correct way to classify at least some of the data.  The data classified already is called the training set.  This is the main subject of today’s lecture. Big Data Analytics CSCI 4030 5 Would like to do prediction: estimate a function f(x) so that y = f(x)  Where y can be: X Y  Real number: Regression  Categorical: Classification X’Y’  Complex object:  Ranking of items etc. Training and test set Estimate y = f(x) on X,Y. Hope that the same f(x)  Data is labeled: also works on unseen X’, Y’  Have many pairs (x, y)  x … vector of binary, categorical, real valued features  y … class (+1, 1, or a real number) Big Data Analytics CSCI 4030 6Big Data Analytics CSCI 4030 7 Plot the height and weight of dogs in three classes: Beagles, Chihuahuas, and Dachshunds.  Each pair (x, y) in the training set consists of:  Feature vector x of the form height, weight.  The associated label y is the variety of the dog.  An example of a trainingset pair would be (5 inches, 2 pounds, Chihuahua). Big Data Analytics CSCI 4030 8Big Data Analytics CSCI 4030 9 The horizontal line represents a height of 7 inches and separates Beagles from Chihuahuas and Dachshunds.  The vertical line represents a weight of 3 pounds and separates Chihuahuas from Beagles and Dachshunds. Big Data Analytics CSCI 4030 10 The algorithm that implements function f is:  Is it supervised on unsupervised learning Big Data Analytics CSCI 4030 11 The algorithm that implements function f is:  Is it supervised on unsupervised learning  Here, we are performing supervised learning with the same data (weight and height) augmented by classifications (variety) for the training data. Big Data Analytics CSCI 4030 12 y is a real number. ML problem is called regression.  y is a boolean value trueorfalse (+1 and −1). The problem is binary classification.  y is a member of some finite set (classes). The problem is multiclass classification.  y is a member of some potentially infinite set. Big Data Analytics CSCI 4030 13 Assume four data points: (1, 2), (2, 1), (3, 4) and (4, 3). Big Data Analytics CSCI 4030 14 Let these points be a training dataset, where the vectors are onedimensional.  i.e., (1,2) can be thought as a pair (1, 2), where 1 is feature vector x and 2 is the associated label y.  The other points are interpreted, accordingly. Big Data Analytics CSCI 4030 15 Suppose we want to learn the linear function f(x) = ax + b  That best represents the point of the training set.  What is the appropriate value of a and b  A natural interpretation of best is rootmean square error (RMSE).  the value of f(x) compared with given value of y. Big Data Analytics CSCI 4030 16 That is we want to minimize RMSE:  This sum is:  Simplifying the sum is: Big Data Analytics CSCI 4030 17 If we then take the derivatives wrt a and b and set them to 0, we get:  Therefore, a = 3/5 and b = 1,  i.e., f(x) = (3/5)x + 1.  For these values the RMSE is 3.2. Big Data Analytics CSCI 4030 18 We will talk about the following methods:  Decision trees  Perceptrons  Support Vector Machines  Neural nets (Neural Networks)  Instance based learning  Main question: How to efficiently train (build a model/find model parameters) Big Data Analytics CSCI 4030 19 The form of function f is a tree.  Each node of the tree has a function of x that determines to which child or children the search must proceed.  Decision trees are suitable for binary and multiclass classification.  Especially when the dimension of the feature vector is not too large.  Large numbers of features can lead to overfitting. Big Data Analytics CSCI 4030 20 Perceptrons are threshold functions applied to the vector x = x , x , . . ., x ,… , x . 1 2 i n  A weight w is associated with the ith i component and there is a threshold θ  The output is +1 if  and the output is 1 otherwise.  Suitable for binary classification. Big Data Analytics CSCI 4030 21 Neural nets are acyclic networks of perceptrons  the outputs of some perceptrons are used as inputs to others.  These are suitable for binary or multiclass classification.  since there can be several perceptrons used as output, with one or more indicating each class. Big Data Analytics CSCI 4030 22 Uses the entire training set to represent the function f.  An example is knearestneighbor.  For instance, 1nearestneighbor classifies data by giving it the same class as that of its nearest training example. Big Data Analytics CSCI 4030 23 Supportvector machines are an advance over the algorithms traditionally used to select the weights and threshold.  The result is a classifier that tends to be more accurate on unseen data. Big Data Analytics CSCI 4030 24 A decision tree is a decision support tool that uses a treelike graph. Big Data Analytics CSCI 4030 26 The formal measure of how concentrated into relatively few documents are the occurrences of a given word is called TF.IDF (Term Frequency times Inverse Document Frequency).  Suppose we have a collection of N documents. Define f to be the frequency ij (number of occurrences) of term (word) i in document j. Big Data Analytics CSCI 4030 27 Then, define the term frequency TF to be: ij  Suppose term i appears in n of the N i documents in the collection.  Then, TFIDF is defined as: Big Data Analytics CSCI 4030 28 Suppose items are news articles, and features are the highTF.IDF words in those documents.  Also suppose there is a user who likes articles about baseball, except articles about the New York Yankees..  The row of the utility matrix for user has 1 if he has read the article and is blank if not.  We shall take the 1’s as “like” and the blanks as “doesn’t like.”  Predicates will be boolean expressions of words. Big Data Analytics CSCI 4030 29YES NO NO YES Big Data Analytics CSCI 4030 30 Example: Spam filtering  Instance space x  X (X= n data points)  Binary or realvalued feature vector x of word occurrences  d features (words + other things, d100,000)  Class y  Y  y: Spam (+1), Not Spam (1) Big Data Analytics CSCI 4030 32 Binary classification: Decision +1 if w x + w x +. . . w x boundary 1 1 2 2 d d f (x) = is linear 1 otherwise (j) (j)  Input: Vectors x and labels y (j)  Vectors x are real valued  Goal: Find vector w = (w , w ,... , w ) 1 2 d  Each w is a real number i w  x = 0 w Big Data Analytics CSCI 4030 33 (very) Loose motivation: Neuron  Inputs are feature values  Each feature has a weight w i w 1  Activation is the sum: x 1 w 2 x 2 w  f(x) =  w x = w x 3  i i i x 0 3 w 4 x 4  If the f(x) is:  Positive: Predict +1  Negative: Predict 1 Big Data Analytics CSCI 4030 34 Perceptron: y’ = sign(w x)  How to find parameters w  Start with w = 0 0 (t)  Pick training examples x one by one (from disk) (t)  Predict class of x using current weights (t) (t)  y’ = sign(w x )  If y’ is correct (i.e., y = y’) t (t) (t) yx (t+1) (t)  No change: w = w (t) w (t)  If y’ is wrong: adjust w (t+1) w (t+1) (t) (t) (t) w = w +  y x  is the learning rate parameter (t)  x is the tth training example (t) (t) (t) x , y =1  y is true tth class label (+1, 1) Big Data Analytics CSCI 4030 35 Perceptron Convergence Theorem:  If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge  How long would it take to converge  Perceptron Cycling Theorem:  If the training data is not linearly separable the Perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop  How to provide robustness, more expressivity Big Data Analytics CSCI 4030 36 Separability: Some parameters get training set perfectly  Convergence: If training set is separable, perceptron will converge Big Data Analytics CSCI 4030 37 Perceptron will oscillate and won’t converge  When to stop learning  (1) Use fixed   (2) Slowly decrease the learning rate   A classic way is to:  = c /(t + c ) 1 2  But, we also need to determine constants c and c 1 2  (3) Stop when we reached some maximum number of passes over the data Big Data Analytics CSCI 4030 38 What if more than 2 classes  Weight vector w for each class c c  Train one class vs. the rest:  Example: 3way classification y = A, B, C  Train 3 classifiers: w : A vs. B,C; w : B vs. A,C; w : C vs. A,B A B C  Calculate activation for each class wx C Highest activation wins biggest w C wx B w B w A biggest wx A biggest Big Data Analytics CSCI 4030 39 We want to classify Web pages into a number of topics, such as sports, politics, medicine, and so on.  We can represent Web pages by a vector with 1 for each word present in the page and 0 otherwise.  Each topic has certain words that tend to indicate that topic.  For instance, sports pages would be full of words like “win,” “goal,” “played,” and so on.  The weight vector for that topic would give higher weights to the words that characterize that topic. Big Data Analytics CSCI 4030 40 A new page could be classified as belonging to the topic that gives the highest score  when the dot product of the page’s vector and the weight vectors for the topics are computed. Big Data Analytics CSCI 4030 41 Overfitting:  Regularization: If the data is not separable weights dance around  Mediocre generalization:  Finds a “barely” separating solution Big Data Analytics CSCI 4030 42 Idea: Pretend we do not know the data/labels we actually do know Training  Build the model f(x) on set X Y the training data Validation set See how well f(x) does on Test the test data X’ set  If it does well, then apply it also to X’  Refinement: Cross validation  Splitting into training/validation set is brutal  Let’s split our data (X,Y) into 10folds (buckets)  Take out 1fold for validation, train on remaining 9  Repeat this 10 times, report average performance Big Data Analytics CSCI 4030 43Perceptron  Online: Can adjust to changing target, over time  Advantages  Simple  Guaranteed to learn a linearly separable problem  Advantage with few relevant features per training example  Limitations  Only linear separations  Only converges for linearly separable data  Not really “efficient with many features” Big Data Analytics CSCI 4030 44 New setting: Online Learning  Allows for modeling problems where we have a continuous stream of data  We want an algorithm to learn from it and slowly adapt to the changes in data  Idea: Do slow updates to the model  Perceptron makes updates if they misclassify an example  So: First train the classifier on training data. Then for every example from the stream, if we misclassify, update the model (using small learning rate) Big Data Analytics CSCI 4030 46 Protocol:  User comes and tell us origin and destination  We offer to ship the package for some money (10 50)  Based on the price we offer, sometimes the user uses our service (y = 1), sometimes they don't (y = 1)  Task: Build an algorithm to optimize what price we offer to the users  Features x capture:  Information about user  Origin and destination  Problem: Will user accept the price Big Data Analytics CSCI 4030 47 Model whether user will accept our price: y = f(x; w)  Accept: y =1, Not accept: y=1  Build this model with say Perceptron  The website that runs continuously  Online learning algorithm would do something like  User comes  She is represented as an (x,y) pair where  x: Feature vector including price we offer, origin, destination  y: If they chose to use our service or not  The algorithm updates w using just the (x,y) pair  Basically, we update the w parameters every time we get some new data Big Data Analytics CSCI 4030 48 For a major website where you have a massive stream of data then this kind of algorithm is pretty reasonable  Why Big Data Analytics CSCI 4030 49 For a major website where you have a massive stream of data then this kind of algorithm is pretty reasonable  Why  Don’t need to deal with all the training data Big Data Analytics CSCI 4030 50 An online algorithm can adapt to changing user preferences  For example, over time users may become more price sensitive  The algorithm adapts and learns this  So the system is dynamic Big Data Analytics CSCI 4030 51 Want to separate “+” from “” using a line Data: +  Training examples: + +  (x , y ) … (x , y ) 1 1 n n  Each example i: + + (1) (d)  x = ( x ,… , x ) i i i (j) +  x is real valued i  y 1, +1 i  Inner product: 𝑑 (𝑗 ) (𝑗 ) 𝒘 ⋅𝒙 = 𝑤 ⋅𝑥 𝑗 =1 Which is best linear separator (defined by w) Big Data Analytics CSCI 4030 53 Distance from the A C separating + + hyperplane + + corresponds to + B the “confidence” + + of prediction  Example: + +  We are more sure about the class of A and B than of C Big Data Analytics CSCI 4030 54 Margin 𝜸 : Distance of closest example from the decision line/hyperplane The reason we define margin this way is due to theoretical convenience and existence of generalization error bounds that depend on the value of margin. Big Data Analytics CSCI 4030 55 Prediction = sign(wx + b) 𝒘  “Confidence” = (w x + b) y + +  For ith datapoint: + + 𝜸 = 𝒘𝒙 +𝒃 𝒚 𝒊 𝒊 𝒊 +  Want to solve + + max w, s.t.i, y (wxb) i i Big Data Analytics CSCI 4030 56 Maximize the margin: + wx+b=0 +  Good according to intuition, + theory (VC dimension) +  + practice +  + max w,  + s.t.i,y (wxb) i i 𝜸 is margin … distance from Maximizing the margin the separating hyperplane Big Data Analytics CSCI 4030 57 Want to estimate 𝒘 and 𝒃  Standard way: Use a solver  Solver: software for finding solutions to “common” optimization problems, e.g., WEKA  Use a quadratic solver:  Minimize quadratic function  Subject to linear constraints  Use Stochastic Gradient Descent (SGD) Big Data Analytics CSCI 4030 58 Example:  Reuters RCV1 document corpus  Predict a category of a document  One vs. the rest classification  n = 781,000 training examples (documents)  23,000 test examples  d = 50,000 features  One feature per word  Remove stopwords  Remove low frequency words Big Data Analytics CSCI 4030 60 Questions:  (1) Is SGD successful at minimizing f(w,b)  (2) How quickly does SGD find the min of f(w,b)  (3) What is the error on a test set Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM (1) SGDSVM is successful at minimizing the value of f(w,b) (2) SGDSVM is super fast (3) SGDSVM test set error is comparable Big Data Analytics CSCI 4030 61 Instance based learning  Example: Nearest neighbor  Keep the whole training dataset: (x, y)  A query example (vector) q comes  Find closest example(s) x  Predict y  Works both for regression and classification  Collaborative filtering is an example of kNN classifier  Find k most similar people to user x that have rated movie y  Predict rating y of x as an average of y x k Big Data Analytics CSCI 4030 63 To make Nearest Neighbor work we need 4 things:  Distance metric:  Euclidean  How many neighbors to look at  One  Weighting function (optional):  Unused  How to fit with the local points  Just predict the same output as the nearest neighbor Big Data Analytics CSCI 4030 64 Distance metric:  Euclidean  How many neighbors to look at  k  Weighting function (optional):  Unused  How to fit with the local points  Just predict the average output among k nearest neighbors k=9 Big Data Analytics CSCI 4030 65 Distance metric:  Euclidean  How many neighbors to look at  All of them ()  Weighting function: 𝟐 𝒅 𝒙 ,𝒒 𝒊  𝒘 =𝐞𝐱𝐩 (− ) 𝒊 𝑲 𝒘  Nearby points to query q are weighted more strongly. K …kernel width. w  How to fit with the local points 𝒘 𝒚 𝒊 𝒊 𝒊  Predict weighted average: 𝒘 𝒊 𝒊 Big Data Analytics CSCI 4030 66 Given: a set P of n  Goal: Given a query point q  NN: Find the nearest neighbor p of q in P  Range search: Find one/all points in P within distance r from q p q Big Data Analytics CSCI 4030 67 Main memory:  Linear scan  Tree based:  Quadtree  Hashing:  LocalitySensitive Hashing  Secondary storage:  Rtrees Big Data Analytics CSCI 4030 68 Consider training data for spam emails  +1 indicates spam  Apply Perceptron algorithm  with learning rate η = ½  and we shall visit each training example af once (stopping criteria)  What is the precision of the algorithm given computed weights w Big Data Analytics CSCI 4030 70 NearestNeighbor Learning: In this approach to machine learning, the entire training set is used as the model.  For each (“query”) point to be classified, we search for its k nearest neighbors in the training set.  The classification of the query point is some function of the labels of these k neighbors.  The simplest case is when k = 1, in which case we can take the label of the query point to be the label of the nearest neighbor. Big Data Analytics CSCI 4030 71 Perceptrons: This machinelearning method assumes the training set has only two class labels, positive and negative.  Perceptrons work when there is a hyperplane that separates the feature vectors of the positive examples from those of the negative examples.  We converge to that hyperplane by  adjusting our estimate of the hyperplane by a fraction – the learning rate – of the direction that is the average of the currently misclassified points. Big Data Analytics CSCI 4030 72 SupportVector Machines: The SVM improves upon perceptrons by finding a separating hyperplane that not only separates the positive and negative  points, but does so in a way that maximizes the margin – the distance perpendicular to the hyperplane to the nearest points.  The points that lie exactly at this minimum distance are the support vectors.  Alternatively, the SVM can be designed to allow points that are too close to the hyperplane, or even on the wrong side of the hyperplane, but minimize the error due to such misplaced points. Big Data Analytics CSCI 4030 73
Website URL
Comment