Question? Leave a message!




Supervised and Unsupervised Learning

Supervised and Unsupervised Learning 20
Supervised
and
Unsupervised
 Learning
 Ciro
Donalek
 Ay/Bi
199
–
April
2011
Summary
 •  KDD
and
Data
Mining
Tasks
 •  Finding
the
opmal
approach
 •  Supervised
Models
 –  Neural
Networks
 –  Mul
Layer
Perceptron
 –  Decision
Trees
 •  Unsupervised
Models
 –  Different
Types
of
Clustering
 –  Distances
and
Normalizaon
 –  Kmeans
 –  Self
Organizing
Maps
 •  Combining
different
models
 –  CommiOee
Machines
 –  Introducing
a
Priori
Knowledge
 –  Sleeping
Expert
Framework

Knowledge
Discovery
in
Databases
 •  KDD
may
be
defined
as:
"The
non
trivial
process
of
 iden2fying
valid,
novel,
poten2ally
useful,
and
 ul2mately
understandable
pa9erns
in
data".
 •  KDD
is
an
interacve
and
iterave
process
involving
 several
steps.
You
got
your
data:
what’s
next
 What
kind
of
analysis
do
you
need
Which
model
is
more
appropriate
for
it
…
Clean
your
data
 •  Data
preprocessing
transforms
the
raw
data
 into
a
format
that
will
be
more
easily
and
 effecvely
processed
for
the
purpose
of
the
 user.
 •  Some
tasks
 •  sampling:
selects
a
representave
subset
 from
a
large
populaon
of
data;

 Use
standard
 •  Noise
treatment

 formats
 •  strategies
to
handle
missing
data:
somemes
 your
rows
will
be
incomplete,
not
all
 parameters
are
measured
for
all
samples.

 •  normaliza2on

 •  feature
extrac2on:
pulls
out
specified
data
 that
is
significant
in
some
parcular
context.

Missing
Data
 •  Missing
data
are
a
part
of
almost
all
research,
and
we
all
have
to
 decide
how
to
deal
with
it.
 •  Complete
Case
Analysis:
use
only
rows
with
all
the
values
 •  Available
Case
Analysis
 •  Substuon
 –  Mean
Value:
replace
the
missing
value
with
the

 mean
value
for
that
parcular
aOribute
 –  Regression
Substuon:
we
can
replace
the

 missing
value
with
historical
value
from
similar
cases
 –  Matching
Imputaon:
for
each
unit
with
a
missing
y,

 find
a
unit
with
similar
values
of
x
in
the
observed

 data
and
take
its
y
value
 –  Maximum
Likelihood,
EM,
etc
 •  Some
DM
models
can
deal
with
missing
data
beOer
than
others.
 •  Which
technique
to
adopt
really
depends
on
your
data
Data
Mining
 •  Crucial
task
within
the
KDD
 •  Data
Mining
is
about
automang
the
process
of
 searching
for
paOerns
in
the
data.
 •  More
in
details,
the
most
relevant
DM
tasks
are:
 – associaon
 – sequence
or
path
analysis
 – clustering
 – classificaDon
 – regression
 – visualizaon
Finding
SoluDon
via
Purposes
 •  You
have
your
data,
what
kind
of
analysis
do
you
need
 •  Regression
 – predict
new
values
based
on
the
past,
inference
 – compute
the
new
values
for
a
dependent
variable
based
on
the
 values
of
one
or
more
measured
aOributes
 •  Classificaon:
 – divide
samples
in
classes
 – use
a
trained
set
of
previously
labeled
data
 •  Clustering
 – paroning
of
a
data
set
into
subsets
(clusters)
so
that
data
in
 each
subset
ideally
share
some
common
characteriscs
 •  Classificaon
is
in
a
some
way
similar
to
the
clustering,
but
requires
 that
the
analyst
know
ahead
of
me
how
classes
are
defined.
Cluster
Analysis
 How
many
clusters
do
you
expect
Search
for
Outliers
ClassificaDon
 •  Data
mining
technique
used
to
predict
group
membership
for
 data
instances.
There
are
two
ways
to
assign
a
new
value
to
a
 given
class.
 •  Crispy
classificaDon
 – given
an
input,
the
classifier
returns
its
label
 •  ProbabilisDc
classificaDon
 – given
an
input,
the
classifier
returns
its
probabilies
to
belong
to
 each
class
 – useful
when
some
mistakes
can
be
more

 costly
than
others
(give
me
only
data
90)
 – winner
take
all
and
other
rules
 •  assign
the
object
to
the
class
with
the
 highest
probability
(WTA)
 •  …but
only
if
its
probability
is
greater
than
40

 (WTA
with
thresholds)
Regression
/
ForecasDng
 •  Data
table
stascal
correlaon
 – mapping
without
any
prior
assumpon
on
the
funconal
 form
of
the
data
distribuon;
 – machine
learning
algorithms
well
suited
for
this.
 •  Curve
figng
 – find
a
well
defined
and
known

 funcon
underlying
your
data;
 – theory
/
experse
can
help.
Machine
Learning
 •  To
learn:
to
get
knowledge
of
by
study,
experience,
 or
being
taught.
 •  Types
of
Learning
 •  Supervised
 •  Unsupervised
Unsupervised
Learning
 •  The
model
is
not
provided
with
the
correct
results
 during
the
training.
 •  Can
be
used
to
cluster
the
input
data
in
classes
on
 the
basis
of
their
stascal
properes
only.
 •  Cluster
significance
and
labeling.
 •  The
labeling
can
be
carried
out
even
if
the
labels
are
 only
available
for
a
small
number
of
objects
 representave
of
the
desired
classes.
Supervised
Learning
 •  Training
data
includes
both
the
input
and
the
 desired
results.
 •  For
some
examples
the
correct
results
(targets)
are
 known
and
are
given
in
input
to
the
model
during
 the
learning
process.
 •  The
construcon
of
a
proper
training,
validaon
and
 test
set
(Bok)
is
crucial.
 •  These
methods
are
usually
fast
and
accurate.
 •  Have
to
be
able
to
generalize:
give
the
correct
 results
when
new
data
are
given
in
input
without
 knowing
a
priori
the
target.

GeneralizaDon
 •  Refers
to
the
ability
to
produce
reasonable
outputs
 for
inputs
not
encountered
during
the
training.
 In
other
words:
NO
PANIC
when
 "never
seen
before"
data
are
given
 in
input
A
common
problem:
OVERFITTING
 •  Learn
the
“data”
and
not
the
underlying
funcon
 •  Performs
well
on
the
data
used
during
the
training
 and
poorly
with
new
data.
 How
to
avoid
it:
use
proper
subsets,
early
stopping.

Datasets
 •  Training
set:
a
set
of
examples
used
for
learning,
 where
the
target
value
is
known.
 •  ValidaDon
set:
a
set
of
examples
used
to
tune
the
 architecture
of
a
classifier
and
esmate
the
error.

 •  Test
set:
used
only
to
assess
the
performances
of
a
 classifier.
It
is
never
used

 during
the
training
process

 so
that
the
error
on
the
test

 set
provides
an
unbiased

 esmate
of
the
generalizaon

 error.

IRIS
dataset
 •  
IRIS
 – consists
of
3
classes,
50
instances
each
 – 4
numerical
aOributes
(sepal
and
petal
length
and
width
 in
cm)
  – each
class
refers
to
a
type
of
Iris
plant
(Setosa,
Versicolor,
 Verginica)
  – the
first
class
is
linearly
separable

 nd from
the
other
two
while
the
2 

 rd and
the
3 
are
not
linearly

 separable
ArDfacts
Dataset
 •  PQ
Arfacts 
 –  2
main
classes
and
4
numerical
aOributes
 –  classes
are:
true
objects,
arfacts

Data
SelecDon
 •  
“Garbage
in,
garbage
out
”:
training,
validaon
and
 test
data
must
be
representave
of
the
underlying
 model
 •  All
eventualies
must
be
covered
 •  Unbalanced
datasets
 – since
the
network
minimizes
the
overall
error,
the
proporon
 of
types
of
data
in
the
set
is
crical;

 – inclusion
of
a
loss
ma  trix
(Bishop,1995);

 – onen,
the
best
approach
is
to
ensure
even
representaon
of
 different
cases,
then
to
interpret
the
network's
decisions
 accordingly.
ArDficial
Neural
Network
 An
Arficial
Neural
Network
is
an
 informaon
processing
paradigm
 that
is
inspired
by
the
way
 biological
nervous
systems
process
 informaon:
 “a
large
number
of
highly
 interconnected
simple
processing
 elements
(neurons)
working
 together
to
solve
specific
 problems”
A
simple
arDficial
neuron
 •  The
basic
computaonal
element
is
onen
called
a
node
or
unit.
It
 receives
input
from
some

other
units,
or
from
an
external
source.
 •  Each
input
has
an
associated
weight
w,
which
can
be

modified
so
 as
to
model
synapc
learning.

 •  The
unit
computes
some
funcon
of
the
weighted
sum
of
its
 inputs:
Neural
Networks
 A
Neural
Network
is
usually
structured
into
an
input
layer
of
neurons,
one
or
 more
hidden
layers
and
one
output
layer.
 Neurons
belonging
to
adjacent
layers
are
usually
fully
connected
and
the
 various
types
and
architectures
are
idenfied
both
by
the
different
topologies
 adopted
for
the
connecons
as
well
by
the
choice
of
the
acvaon
funcon.
 The
values
of
the
funcons
associated
with
the
connecons
are
called
 “weights”.
 The
whole
game
of
using
NNs
is
in
the
fact
 that,
in
order
for
the
network
to
yield
 appropriate
outputs
for
given
inputs,
the
 weight
must
be
set
to
suitable
values.
 The
way
this
is
obtained
allows
a
further
 disncon
among
modes
of
operaons.
Neural
Networks:
types
 Feedforward:
Single
Layer
Perceptron,
MLP,
ADALINE
(Adapve
Linear
 Neuron),
RBF

 Self‐Organized:
SOM
(Kohonen
Maps)

 Recurrent:
Simple
Recurrent
Network,
 Hopfield
Network.

 Stochasc:
Boltzmann
machines,
RBM.

 Modular:
CommiOee
of
Machines,
ASNN
 (Associave
Neural
Networks),
 Ensembles.
 Others:
Instantaneously
Trained,
Spiking
 (SNN),
Dynamic,
Cascades,
NeuroFuzzy,
 PPS,
GTM.
MulD
Layer
Perceptron
 •  The
MLP
is
one
of
the
most
used
supervised
model:
 it
consists
of
mulple
layers
of
computaonal
units,
 usually
interconnected
in
a
feed‐forward
way.
 •  Each
neuron
in
one
layer
has
direct
connecons
to
 all
the
neurons
of
the
subsequent
layer.

Learning
Process
 •  Back
Propagaon
 –  
the
output
v  alues
are
compared
with
the
target
to
compute
the
value
 of
some
predefined
error
funcon
 –  the
error
is
then
fedback
through
the
network
 –  using
this
in  formaon,
the
algorithm
adjusts
the
weights
of
each
 connecon
in
order
to
reduce
the
value
of
the
error
funcon
 Aner
repeang
this
process
for
a
sufficiently
large
number
of
training
cycles,
 the
network
will
usually
converge.

Hidden
Units
 •  The
best
number
of
hidden
units
depend
on:
 –  number
of
inputs
and
outputs
 –  number
of
training
case
 –  the
amount
of
noise
in
the
targets
 –  the
complexity
of
the
funcon
to
be
learned
 –  the
acvaon
funcon
 •  Too
few
hidden
units
=
high
training
and
generalizaon
error,
due
to
 underfigng 
and
high
stascal
bias.
 •  Too
many
hidden
units
=
low
training
error
but
high
generalizaon
 error,
due
to
overfigng 
and
high
variance.
 •  Rules
of
thumb
don't
usually
work.
AcDvaDon
and
Error
FuncDons
AcDvaDon
FuncDons
Results:
confusion
matrix
Results:
completeness
and
contaminaDon
 Exercise:
compute
completeness
and
contaminaon
for
the
previous
confusion
matrix
(test
set)
Decision
Trees
 •  Is
another
classificaon
method.

 •  A
decision
tree
is
a
set
of
simple
rules,
such
as
"if
the
 sepal
length
is
less
than
5.45,
classify
the
specimen
as
 setosa."

 •  Decision
trees
are
also
nonparametric
because
they
do
 not
require
any
assumpons
about
the
distribuon
of
 the
variables
in
each
class.
Summary
 •  KDD
and
Data
Mining
Tasks
 •  Finding
the
opmal
approach
 •  Supervised
Models
 –  Neural
Networks
 –  Mul
Layer
Perceptron
 –  Decision
Trees
 •  Unsupervised
Models
 –  Different
Types
of
Clustering
 –  Distances
and
Normalizaon
 –  Kmeans
 –  Self
Organizing
Maps
 •  Combining
different
models
 –  CommiOee
Machines
 –  Introducing
a
Priori
Knowledge
 –  Sleeping
Expert
Framework

Unsupervised
Learning
 •  The
model
is
not
provided
with
the
correct
results
 during
the
training.
 •  Can
be
used
to
cluster
the
input
data
in
classes
on
 the
basis
of
their
stascal
properes
only.
 •  Cluster
significance
and
labeling.
 •  The
labeling
can
be
carried
out
even
if
the
labels
are
 only
available
for
a
small
number
of
objects
 representave
of
the
desired
classes.
Types
of
Clustering
 •  Types
of
clustering:

 – HIERARCHICAL:
finds
successive
clusters
using
previously
 established
clusters

 •  agglomerave
(boOom‐up):
start
with
each
element
in
a
separate
cluster
 and
merge
them
accordingly
to
a
given
property
 •  divisive
(top‐down)
 – PARTITIONAL:
usually
determines
all
clusters
at
once
Distances
 •  Determine
the
similarity
between
two
clusters
and
 the
shape
of
the
clusters.
In
case
of
strings…
 •  The
Hamming
distance
between
two
strings
of
equal
length
is
 the
number
of
posions
at
which
the
corresponding
symbols
 are
different.

 – measures
the
minimum
number
of
subs2tu2ons
required
to
 change
one
string
into
the
other
 •  The
Levenshtein
(edit)
distance
is
a
metric
for
measuring
the
 amount
of
difference
between
two
sequences.
 – is
defined
as
the
minimum
number
of
edits
needed
to
transform
 one
string
into
the
other.
 1001001
 LD(BIOLOGY,
BIOLOGIA)=2
 1000100
 BIOLOGY
‐
BIOLOGI
(subsDtuDon)
 HD=3
 BIOLOGI
‐
BIOLOGIA
(inserDon)
NormalizaDon
 VAR:
the
mean
of
each
aOribute
 of
the
transformed
set
of
data
 points
is
reduced
to
zero
by
 subtracng
the
mean
of
each
 aOribute
from
the
values
of
the
 aOributes
and
dividing
the
result
 by
the
standard
deviaon
of
the
 aOribute.

 RANGE
(Min‐Max
NormalizaDon):
subtracts
the
minimum
value
of
an
aOribute
from
each
value
 of
the
aOribute
and
then
divides
the
difference
by
the
range
of
the
aOribute.
It
has
the
 advantage
of
preserving
exactly
all
relaonship
in
the
data,
without
adding
any
bias.
 SOFTMAX:
is
a
way
of
reducing
the
influence
of
extreme
values
or
outliers
in
the
data
without
 removing
them
from
the
data
set.
It
is
useful
when
you
have
outlier
data
that
you
wish
to
 include
in
the
data
set
while
sll
preserving
the
significance
of
data
within
a
standard
deviaon
 of
the
mean.
KMeans
KMeans:
how
it
works
Kmeans:
Pro
and
Cons
Learning
K
 •  Find
a
balance
between
two
variables:
the
number
of
 clusters
(K)
and
the
average
variance
of
the
clusters.
 •  Minimize
both
values
 •  As
the
number
of
clusters
increases,
the
average
 variance
decreases
(up
to
the
trivial
case
of
k=n
and
 variance=0).
 •  Some
criteria:
 – BIC
(Bayesian
Informaon
Criteria)
 – AIC
(Akaike
Informaon
Criteria)
 – Davis‐Bouldin
Index
 – Confusion
Matrix
Self
Organizing
Maps
SOM
topology
SOM
Prototypes
SOM
Training
CompeDDve
and
CooperaDve
Learning
SOM
Update
Rule
Parameters
DM
with
SOM
SOM
Labeling
Localizing
Data
Cluster
Structure
Cluster
Structure
‐
2
Component
Planes
RelaDve
Importance
How
accurate
is
your
clustering
Trajectories
Combining
Models
Commidee
Machines
A
priori
knowledge
Sleeping
Experts