Question? Leave a message!




Introduction to Data Mining

Introduction to Data Mining 20
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar (modified for I211 by P. Radivojac) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1zzz What is Data Collection of data objects and Attributes Class their attributes Tid Home Marital Taxable An attribute is a property or Cheat Owner Status Income characteristic of an object 1 Yes Sing gle 125K No – EEl xamples: eye collor off a 2 No Married 100K No person, temperature, etc. 3 No Single 70K No – Attribute is also known as 4 Yes Married 120K No feature feature, variable variable, v variate ariate 5 No Divorced 95K Yes Data 6 No Married 60K No points A collection of attributes 7 Yes Divorced 220K No describe describe a a data data point point 8 No Single 85K Yes – data point is also known as 9 No Married 75K No object, record, instance, or 10 No Single 90K Yes 10 example example © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2‹‹‹ zzz Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction Distinction between between attributes attributes and and attribute attribute v values alues – Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value – averaggg e age is interestingg to know,, but averagge ID is meaninggless © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3z Measurement of Length The way you measure an attribute is something that may not match the attributes properties. A 5 1 B 7 2 C 8 3 order actual length D 10 4 E 15 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4‹‹‹‹ z Types of Attributes There are different types of attributes – Nominal Nominal Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in tall, medium, short – Interval Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio Examples: temperature in Kelvin, length, time, counts © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5z Properties of Attribute Values The type of an attribute depends on which of the following following properties properties it it possesses: possesses: – Distinctness: = ≠ – Order: – Addition: + –Multiplication: / – Nominal attribute: distinctness – Ordinal Ordinal attrib attribute te: distinctness distinctness order order – Interval attribute: distinctness, order addition – – Ratio Ratio a attribute: ttribute: all all 4 4 properties properties © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6Attribute Description Examples Operations Type Nominal The values of a nominal attribute are zip codes, employee mode, entropy, just different names, i.e., nominal ID numbers, eye color, contingency 2 attributes provide only enough sex: male, female correlation, χ test information to distinguish one object ffh rom another. ((=, ≠) ) Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles, provide enough information to order good, better, best, rank correlation, objects objects. ( (, ) ) grades grades, street street numbers numbers run run t tests ests, sign sign tests tests Interval For interval attributes, the calendar dates, mean, standard dif diff ferences b between val lues are temperature iiC n Cellsiius ddi eviatiion, PPearson''s meaningful, i.e., a unit of or Fahrenheit correlation, t and F measurement exists. tests (+, ) Ratio For ratio variables, both differences temperature in Kelvin, geometric mean, and ratios are meaningful. (, /) monetary quantities, harmonic mean, counts, age, mass, percent variation length, electrical current currentAttribute Transformation Comments Level If all employee ID numbers Nominal Any permutation of values were reassigned, would it make any y difference An attribute encompassing the Ordinal An order preserving change of notion of g, good, better best can values values, i i.e e., be represented equally well by newvalue = f(oldvalue) the values 1, 2, 3 or by where f is a monotonic function. 0.5, 1, 10. Thus, the Fahrenheit and Interval newvalue =a oldvalue + b Celsius temperature scales where a and b are constants differ in terms of where their zero value is and the size of a unit (degree). Lt Lengthh can bbe measuredd iin RRt atiio newvall ue = a oldldvallue meters or feet.zz Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically Practically, real real values values can can only only b be e m measured easured and and represented represented using a finite number of digits. – Continuous attributes are typically represented as floatingpoint variables. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9zzzz Types of data sets Record – Data Matrix – Document Document Data Data – Transaction Data Graph – World Wide Web – Molecular Structures Ordered Ordered – Spatial Data – Temporal Data – SSt equentiiall DDatta – Genetic Sequence Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10‹‹‹‹‹ Important Characteristics of Structured Data – Dimensionality ⎡⎤ Curse of Dimensionality ⎢⎥ 1 ⎢⎢⎥⎥ ⎢⎥ 1 ⎢⎥ 1 – Sparsity⎢⎥ ⎢⎥ Onl Only presence presence co counts nts ⎢⎢⎥⎥ ⎢⎥ ⎢⎥ ⎢⎥ 1 ⎢⎢⎥⎥ – – Resolution Resolution ⎢⎥ 1 ⎢⎥ Patterns depend on the scale ⎢ 1⎥ ⎣⎦ – Attribute and Class Imbalance small number of non zero elements (related to sparsity) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11z Record Data Data that consists of a collection of records, each of of which which consists consists of of a a fixed fixed set set o of f a attributes ttributes Tid Home Marital Taxable Cheat Owner Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No No Singl Single e 90K 90K Ye Yes s 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12zz Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute Such data set can be represented by an mbyn matrix, where there are m rows, one for each object, and n collf umns, one for eachh at tttribibutte P Pr roj ojec ect tiion on P Pr roj ojec ect tiion on D Diis st tan ance ce Loa Load d T Th hiickn ckness ess of of of of x x x x L L Lo Lo oad oad a ad d of of of of y y y y lo lo lload oad a ad d 10 10..2 23 3 5. 5.27 27 15 15..2 22 2 2.7 2.7 1. 1.2 2 12 12 12 12..6 6 65 65 5 5 62 662 6..25 2555 16 16 16 16..22 22 22 22 2 222 22 ..2 2 111 11 1..1 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13sea ason time eout lo ost w wi n n gam me sco ore ba all plla y y coa ach tea am z Document Data Each document becomes a `term' vector, – each each t term erm iis s a a component component ( (attribute) attribute) of of the the v vector ector, – the value of each component is the number of times the corresponding term occurs in the document. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14Example © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15Matlab Code read from preprepared files and count some words dictionary = 'gaza', 'fuel', 'the', 'patriots'; fid = fopen('1.txt', 'rt'); s = textscan(fid, 's'); fclose fclose(fid); (fid); s1 = lower(s1); fid = fopen('2.txt', 'rt'); tt = ttt extscan(fid (fid, ' 's' ') ) fclose(fid); t1 = lower(t1); for i = 1 : length(dictionary) D(1, i) = length(strmatch(dictionaryi, s1)); D(2, i) = length(strmatch(dictionaryi, t1)); end end © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16z Transaction Data A special type of record data, where – each each r record ecord ( (transaction) transaction) iinvolves nvolves a a set set o of f iitems tems. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 2 Beer, Beer, Bread Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 5 CCk oke, D Diiaper, M Milk ilk © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17zzz Machine Learning Repository at UCI contains a number of user deposited ML problems ftppp ://ftp.ics.uci.edu/p pub/machinelearning gdatabases Discussion: – Pima Indians diabetes example (link) – Boston housing example (link) – German German credit credit e example xample ( (link link) ) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18zzzzz How to load certain data formats From web sites – use readurl function From Excel files – – use use x xlsread lsread function function From text files – use textscan and related functions From CSV files – use csvread function Mf Many fililes are unsttructturedd, parsiing iis need ded d © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19zz Reading Custom File Types Need standard I/O for this – use use f fopen open, fclose fclose, fgetl fgetl, ffget get for for ttext ext files – use fread, fwrite, fseek, ftell for binary files El Examples – reading protein sequence data – readi ding and d wriiti ting bi binary d datta © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20z Graph Data Examples: Generic graph and HTML Links a href="papers/papers.htmlbbbb" Data Mining /a li a href="papers/papers.htmlaaaa" 2 2 Graph Partitioning /a li a href="papers/papers.htmlaaaa" 1 5 Parallel Solution of Sparse Linear System of Equations /a li 2 2 a href="papers/papers.htmlffff" NBody Computation and Dense Linear System Solvers 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21z Chemical Data Benzene Molecule: C H 6 6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22z Ordered/Sequential Data Sequences of transactions It Items/E /Event ts An element of the sequence © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23z Ordered/Sequential Data Genomic sequence data 11: ...GGTTCCGCCTTCAGCCCCCCGCC GGTTCCGCCTTCAGCCCCCCGCC... 00 2: ...GGTTCCGCGTTCAGCCCCGCGCC... 1 3: 3: ...GGTTCCGCCTTCAGCCCCCCGCC GGTTCCGCCTTCAGCCCCCCGCC... 00 4: ...GGTTCCGCCTTCAGCCCCGCGCC... 0 5: 5: ...GGTTCCGCCTTCAGCCCCTCGCC... ...GGTTCCGCCTTCAGCCCCTCGCC... 00 6: ...GGTTCCGCCTTCAGCCCCGCGCC... 0 7: ...GGTTCCGCCTTCAGCCCCTCGCC... 0 8: ...GGTTCCGCATTCAGCCCCCCGCC... 1 9: ...GGTTCCGCCTTCAGCCCCGCGCC... 0 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24z Ordered Data SpatioTemporal Data Average Monthly Temperature of land and ocean © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25zzzzz Data Quality What kinds of data quality problems How How can can we we detect detect p problems roblems w with ith the the data data What can we do about these problems Examples of data quality problems: – Noise and outliers – missing values – duplicate data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26z Noise Noise refers to modification of original values – Examples: Examples: d distortion istortion o of f a a person person’s s v voice oice when when talking talking on a poor phone and “static” on television screen Two Sine Waves Two Sine Waves + Noise © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27z Outliers Outliers are data objects with characteristics that are are c considerably onsiderably d different ifferent than than most most of of the the o other ther data objects in the data set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28zz Missing Values Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e (e.g g., annual annual iincome ncome is is not not applicable applicable tto o c children) hildren) Handlinggg missing values – Eliminate Data Objects – Estimate Missing Values – Ignore the Missing Value During Analysis – Replace with all possible values (weighted by their probabilities) probabilities) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29zzz Duplicate Data Data set may include data objects that are duplicates, duplicates, o or r a almost lmost duplicates duplicates of of one one another another – Major issue when merging data from heterogeneous sources Examples: – Same person with multiple email addresses Data cleaning – Process of dealing with noise and duplicate data issues issues © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30zzzzz Understanding data collection process Sometimes, the whole population is available for labeling (say, we can provide features and a the class with limited resources for any object in the population). In such a case we’d like to select a sample that is a good representative representative of of the the population population. However, there are situations where we do not have control of the data at hand. It is very important to understand the mechanism how the data was generated generated Example 1: Bank loan data – all all people people e eligible ligible to to apply apply ffor or a a lloan oan all all people people w who ho apply apply all all people people w who ho are are accepted all people who take the loan – Banks can only study behavior of the people who take the loan. Banks can only make inferences about a subset of the overall pp population. Example 2: 1936 Presidential elections in the USA (Roosevelt vs. Landon) Example 3: 2007 Democratic primary in New Hampshire (Obama vs. Clinton) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31
Website URL
Comment