comparison between supervised and unsupervised learning and supervised and unsupervised learning in data mining
Supervised and Unsupervised
Ay/Bi 199 – April 2011 Summary
• KDD and Data Mining Tasks
• Finding the op?mal approach
• Supervised Models
– Neural Networks
– Mul? Layer Perceptron
– Decision Trees
• Unsupervised Models
– Diﬀerent Types of Clustering
– Distances and Normaliza?on
– Self Organizing Maps
• Combining diﬀerent models
– CommiOee Machines
– Introducing a Priori Knowledge
– Sleeping Expert Framework Knowledge Discovery in Databases
• KDD may be deﬁned as: "The non trivial process of
iden2fying valid, novel, poten2ally useful, and
ul2mately understandable pa9erns in data".
• KDD is an interac?ve and itera?ve process involving
several steps. You got your data: what’s next?
What kind of analysis do you need? Which model is more appropriate for it? … Clean your data
• Data preprocessing transforms the raw data
into a format that will be more easily and
eﬀec?vely processed for the purpose of the
• Some tasks
• sampling: selects a representa?ve subset
from a large popula?on of data;
• Noise treatment
• strategies to handle missing data: some?mes
your rows will be incomplete, not all
parameters are measured for all samples.
• feature extrac2on: pulls out speciﬁed data
that is signiﬁcant in some par?cular context. Missing Data
• Missing data are a part of almost all research, and we all have to
decide how to deal with it.
• Complete Case Analysis: use only rows with all the values
• Available Case Analysis
– Mean Value: replace the missing value with the
mean value for that par?cular aOribute
– Regression Subs?tu?on: we can replace the
missing value with historical value from similar cases
– Matching Imputa?on: for each unit with a missing y,
ﬁnd a unit with similar values of x in the observed
data and take its y value
– Maximum Likelihood, EM, etc
• Some DM models can deal with missing data beOer than others.
• Which technique to adopt really depends on your data Data Mining
• Crucial task within the KDD
• Data Mining is about automa?ng the process of
searching for paOerns in the data.
• More in details, the most relevant DM tasks are:
– sequence or path analysis
– visualiza?on Finding SoluDon via Purposes
• You have your data, what kind of analysis do you need?
– predict new values based on the past, inference
– compute the new values for a dependent variable based on the
values of one or more measured aOributes
– divide samples in classes
– use a trained set of previously labeled data
– par??oning of a data set into subsets (clusters) so that data in
each subset ideally share some common characteris?cs
• Classiﬁca?on is in a some way similar to the clustering, but requires
that the analyst know ahead of ?me how classes are deﬁned. Cluster Analysis
How many clusters do you expect? Search for Outliers ClassiﬁcaDon
• Data mining technique used to predict group membership for
data instances. There are two ways to assign a new value to a
• Crispy classiﬁcaDon
– given an input, the classiﬁer returns its label
• ProbabilisDc classiﬁcaDon
– given an input, the classiﬁer returns its probabili?es to belong to
– useful when some mistakes can be more
costly than others (give me only data 90%)
– winner take all and other rules
• assign the object to the class with the
highest probability (WTA)
• …but only if its probability is greater than 40%
(WTA with thresholds) Regression / ForecasDng
• Data table sta?s?cal correla?on
– mapping without any prior assump?on on the func?onal
form of the data distribu?on;
– machine learning algorithms well suited for this.
• Curve ﬁgng
– ﬁnd a well deﬁned and known
func?on underlying your data;
– theory / exper?se can help. Machine Learning
• To learn: to get knowledge of by study, experience,
or being taught.
• Types of Learning
• Unsupervised Unsupervised Learning
• The model is not provided with the correct results
during the training.
• Can be used to cluster the input data in classes on
the basis of their sta?s?cal proper?es only.
• Cluster signiﬁcance and labeling.
• The labeling can be carried out even if the labels are
only available for a small number of objects
representa?ve of the desired classes.