Introduction to Data Mining

Introduction to Data Mining 20
ZoeTabbot Profile Pic
ZoeTabbot,Germany,Professional
Published Date:13-07-2017
Your Website URL(Optional)
Comment
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar (modified for I211 by P. Radivojac) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1zzz What is Data? Collection of data objects and Attributes Class their attributes Tid Home Marital Taxable An attribute is a property or Cheat Owner Status Income characteristic of an object 1 Yes Sing gle 125K No – EEl xamples: eye collor off a 2 No Married 100K No person, temperature, etc. 3 No Single 70K No – Attribute is also known as 4 Yes Married 120K No feature feature, variable variable, v variate ariate 5 No Divorced 95K Yes Data 6 No Married 60K No points A collection of attributes 7 Yes Divorced 220K No describe describe a a data data point point 8 No Single 85K Yes – data point is also known as 9 No Married 75K No object, record, instance, or 10 No Single 90K Yes 10 example example © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2‹‹‹ zzz Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction Distinction between between attributes attributes and and attribute attribute v values alues – Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value – averaggg e age is interestingg to know,, but averagge ID is meaninggless © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3z Measurement of Length The way you measure an attribute is something that may not match the attributes properties. A 5 1 B 7 2 C 8 3 order actual length D 10 4 E 15 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4‹‹‹‹ z Types of Attributes There are different types of attributes – Nominal Nominal Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in tall, medium, short – Interval Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio Examples: temperature in Kelvin, length, time, counts © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5z Properties of Attribute Values The type of an attribute depends on which of the following following properties properties it it possesses: possesses: – Distinctness: = ≠ – Order: – Addition: + - –Multiplication: / – Nominal attribute: distinctness – Ordinal Ordinal attrib attribute te: distinctness distinctness & & order order – Interval attribute: distinctness, order & addition – – Ratio Ratio a attribute: ttribute: all all 4 4 properties properties © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6Attribute Description Examples Operations Type Nominal The values of a nominal attribute are zip codes, employee mode, entropy, just different names, i.e., nominal ID numbers, eye color, contingency 2 attributes provide only enough sex: male, female correlation, χ test information to distinguish one object ffh rom another. ((=, ≠) ) Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles, provide enough information to order good, better, best, rank correlation, objects objects. ( (, ) ) grades grades, street street numbers numbers run run t tests ests, sign sign tests tests Interval For interval attributes, the calendar dates, mean, standard dif diff ferences b between val lues are temperature iiC n Cellsiius ddi eviatiion, PPearson''s meaningful, i.e., a unit of or Fahrenheit correlation, t and F measurement exists. tests (+, - ) Ratio For ratio variables, both differences temperature in Kelvin, geometric mean, and ratios are meaningful. (, /) monetary quantities, harmonic mean, counts, age, mass, percent variation length, electrical current currentAttribute Transformation Comments Level If all employee ID numbers Nominal Any permutation of values were reassigned, would it make any y difference? An attribute encompassing the Ordinal An order preserving change of notion of g, good, better best can values values, i i.e e., be represented equally well by new_value = f(old_value) the values 1, 2, 3 or by where f is a monotonic function. 0.5, 1, 10. Thus, the Fahrenheit and Interval new_value =a old_value + b Celsius temperature scales where a and b are constants differ in terms of where their zero value is and the size of a unit (degree). Lt Lengthh can bbe measuredd iin RRt atiio new_vall ue = a oldld_vallue meters or feet.zz Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically Practically, real real values values can can only only b be e m measured easured and and represented represented using a finite number of digits. – Continuous attributes are typically represented as floating-point variables. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9zzzz Types of data sets Record – Data Matrix – Document Document Data Data – Transaction Data Graph – World Wide Web – Molecular Structures Ordered Ordered – Spatial Data – Temporal Data – SSt equentiiall DDatta – Genetic Sequence Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10‹‹‹‹‹ Important Characteristics of Structured Data – Dimensionality ⎡⎤ Curse of Dimensionality ⎢⎥ 1 ⎢⎢⎥⎥ ⎢⎥ 1 ⎢⎥ 1 – Sparsity⎢⎥ ⎢⎥ Onl Only presence presence co counts nts ⎢⎢⎥⎥ ⎢⎥ ⎢⎥ ⎢⎥ 1 ⎢⎢⎥⎥ – – Resolution Resolution ⎢⎥ 1 ⎢⎥ Patterns depend on the scale ⎢ 1⎥ ⎣⎦ – Attribute and Class Imbalance small number of non zero elements (related to sparsity) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11z Record Data Data that consists of a collection of records, each of of which which consists consists of of a a fixed fixed set set o of f a attributes ttributes Tid Home Marital Taxable Cheat Owner Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No No Singl Single e 90K 90K Ye Yes s 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12zz Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m-by-n matrix, where there are m rows, one for each object, and n collf umns, one for eachh at tttribibutte P Pr roj ojec ect tiion on P Pr roj ojec ect tiion on D Diis st tan ance ce Loa Load d T Th hiickn ckness ess of of of of x x x x L L Lo Lo oad oad a ad d of of of of y y y y lo lo lload oad a ad d 10 10..2 23 3 5. 5.27 27 15 15..2 22 2 2.7 2.7 1. 1.2 2 12 12 12 12..6 6 65 65 5 5 62 662 6..25 2555 16 16 16 16..22 22 22 22 2 222 22 ..2 2 111 11 1..1 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13sea ason time eout lo ost w wi n n gam me sco ore ba all plla y y coa ach tea am z Document Data Each document becomes a `term' vector, – each each t term erm iis s a a component component ( (attribute) attribute) of of the the v vector ector, – the value of each component is the number of times the corresponding term occurs in the document. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14Example © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15Matlab Code % read from pre-prepared files and count some words dictionary = 'gaza', 'fuel', 'the', 'patriots'; fid = fopen('1.txt', 'rt'); s = textscan(fid, '%s'); fclose fclose(fid); (fid); s1 = lower(s1); fid = fopen('2.txt', 'rt'); tt = ttt extscan(fid (fid, ' '%%s' ') ) fclose(fid); t1 = lower(t1); for i = 1 : length(dictionary) D(1, i) = length(strmatch(dictionaryi, s1)); D(2, i) = length(strmatch(dictionaryi, t1)); end end © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16z Transaction Data A special type of record data, where – each each r record ecord ( (transaction) transaction) iinvolves nvolves a a set set o of f iitems tems. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 2 Beer, Beer, Bread Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 5 CCk oke, D Diiaper, M Milk ilk © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17zzz Machine Learning Repository at UCI contains a number of user deposited ML problems ftppp ://ftp.ics.uci.edu/p pub/machine-learning g-databases Discussion: – Pima Indians diabetes example (link) – Boston housing example (link) – German German credit credit e example xample ( (link link) ) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18zzzzz How to load certain data formats From web sites – use readurl function From Excel files – – use use x xlsread lsread function function From text files – use textscan and related functions From CSV files – use csvread function Mf Many fililes are unsttructturedd, parsiing iis need ded d © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19zz Reading Custom File Types Need standard I/O for this – use use f fopen open, fclose fclose, fgetl fgetl, ffget get for for ttext ext files – use fread, fwrite, fseek, ftell for binary files El Examples – reading protein sequence data – readi ding and d wriiti ting bi binary d datta © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20