Question? Leave a message!




Data Mining Anomaly Detection

Data Mining Anomaly Detection
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1Anomaly/Outlier Detection  What are anomalies/outliers – The set of data points that are considerably different than the remainder of the data  Variants of Anomaly/Outlier Detection Problems – Given a database D, find all the data points x D with anomaly scores greater than some threshold t – Given a database D, find all the data points x D having the top n largest anomaly scores f(x) – Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D  Applications: – Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Importance of Anomaly Detection Ozone Depletion History  In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10 below normal levels  Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations  The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a Sources: computer program and discarded http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Anomaly Detection  Challenges – How many outliers are there in the data – Method is unsupervised  Validation can be quite challenging (just like for clustering) – Finding needle in a haystack  Working assumption: – There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Anomaly Detection Schemes  General Steps – Build a profile of the “normal” behavior  Profile can be patterns or summary statistics for the overall population – Use the “normal” profile to detect anomalies  Anomalies are observations whose characteristics differ significantly from the normal profile  Types of anomaly detection schemes – Graphical Statisticalbased – Distancebased – Modelbased © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Graphical Approaches  Boxplot (1D), Scatter plot (2D), Spin plot (3D)  Limitations – Time consuming – Subjective © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Convex Hull Method  Extreme points are assumed to be outliers  Use convex hull method to detect extreme values  What if the outlier occurs in the middle of the data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Statistical Approaches  Assume a parametric model describing the distribution of the data (e.g., normal distribution)  Apply a statistical test that depends on – Data distribution – Parameter of distribution (e.g., mean, variance) – Number of expected outliers (confidence limit) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Grubbs’ Test  Detect outliers in univariate data  Assume data comes from normal distribution  Detects one outlier at a time, remove the outlier, and repeat – H : There is no outlier in data 0 – H : There is at least one outlier A max XX  Grubbs’ test statistic: G s  Reject H if: 2 0 t (N1) ( /N,N2) G 2 N2t N ( /N,N2) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Statisticalbased – Likelihood Approach  Assume the data set D contains samples from a mixture of two probability distributions: – M (majority distribution) – A (anomalous distribution)  General Approach: – Initially, assume all the data points belong to M – Let L (D) be the log likelihood of D at time t t – For each point x that belongs to M, move it to A t  Let L (D) be the new log likelihood. t+1  Compute the difference,  = L (D) – L (D) t t+1  If  c (some threshold), then x is declared as an anomaly t and moved permanently from M to A © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Statisticalbased – Likelihood Approach  Data distribution, D = (1 –) M +  A  M is a probability distribution estimated from data – Can be based on any modeling method (naïve Bayes, maximum entropy, etc)  A is initially assumed to be uniform distribution  Likelihood at time t: N  M A t t  L (D) P (x ) (1) P (x ) P (x ) t D i M i A i t t  i1 xM xA  i t i t LL(D) M log(1) logP (x ) A log logP (x )  t t M i t A i t t xM xA i t i t © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Limitations of Statistical Approaches  Most of the tests are for a single attribute  In many cases, data distribution may not be known  For high dimensional data, it may be difficult to estimate the true distribution © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Distancebased Approaches  Data is represented as a vector of features  Three major approaches – Nearestneighbor based – Density based – Clustering based © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›NearestNeighbor Based Approach  Approach: – Compute the distance between every pair of data points – There are various ways to define outliers:  Data points for which there are fewer than p neighboring points within a distance D  The top n data points whose distance to the kth nearest neighbor is greatest  The top n data points whose average distance to the k nearest neighbors is greatest © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Outliers in Lower Dimensional Projection  Divide each attribute into  equaldepth intervals – Each interval contains a fraction f = 1/ of the records  Consider a kdimensional cube created by picking grid ranges from k different dimensions – If attributes are independent, we expect region to k contain a fraction f of the records – If there are N points, we can measure sparsity of a cube D as: – Negative sparsity indicates cube contains smaller number of points than expected © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Example 2  N=100,  = 5, f = 1/5 = 0.2, N  f = 4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Densitybased: LOF approach  For each point, compute the density of its local neighborhood  Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors  Outliers are points with largest LOF value In the NN approach, p is 2 not considered as outlier, while LOF approach find both p and p as outliers 1 2 p 2  p 1  © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›ClusteringBased  Basic idea: – Cluster the data into groups of different density – Choose points in small cluster as candidate outliers – Compute the distance between candidate points and noncandidate clusters.  If candidate points are far from all other noncandidate points, they are outliers © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Base Rate Fallacy  Bayes theorem:  More generally: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Base Rate Fallacy (Axelsson, 1999) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Base Rate Fallacy  Even though the test is 99 certain, your chance of having the disease is 1/100, because the population of healthy people is much larger than sick people © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Base Rate Fallacy in Intrusion Detection  I: intrusive behavior, I: nonintrusive behavior A: alarm A: no alarm  Detection rate (true positive rate): P(AI)  False alarm rate: P(AI)  Goal is to maximize both – Bayesian detection rate, P(IA) – P(IA) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Detection Rate vs False Alarm Rate  Suppose:  Then:  False alarm rate becomes more dominant if P(I) is very low © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›Detection Rate vs False Alarm Rate  Axelsson: We need a very low false alarm rate to achieve a reasonable Bayesian detection rate © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹›