market basket business model and market-basket model support and confidence
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Big Data Analytics CSCI 4030Supermarket shelf management – Market-basket
model:
Goal: Identify items that are bought together by
sufficiently many customers
Approach: Process the sales data collected with
barcode scanners to find dependencies among
items
A classic rule:
If someone buys diaper and milk, then he/she is
likely to buy beer
Don’t be surprised if you find six-packs next to diapers
Big Data Analytics CSCI 4030
2Input:
A large set of items
TID Items
1 Bread, Coke, Milk
e.g., things sold in a
2 Beer, Bread
supermarket
3 Beer, Coke, Diaper, Milk
A large set of baskets
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Each basket is a
small subset of items
Output:
e.g., the things one
Rules Discovered:
customer buys on one day
Milk Coke
Want to discover
Diaper, Milk Beer
association rules
People who bought x,y,z tend to buy v,w
Amazon
Big Data Analytics CSCI 4030
3 Items = products; Baskets = sets of products
someone bought in one trip to the store
Real market baskets: Chain stores keep data
about what customers buy together
Tells how typical customers navigate stores, lets
them position tempting items
Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
Need the rule to occur frequently, or no ’s
Amazon’s people who bought X also bought Y
Big Data Analytics CSCI 4030
4 Baskets = sentences; Items = documents
containing those sentences
Items that appear together too often could
represent plagiarism
Baskets = patients; Items = drugs & side-effects
Has been used to detect combinations
of drugs that result in particular side-effects
Big Data Analytics CSCI 4030
5 A general many-to-many mapping
(association) between two kinds of things
But we ask about connections among “items”,
not “baskets”
For example:
Finding communities in graphs (e.g., Twitter)
Big Data Analytics CSCI 4030
6 Finding communities in graphs (e.g., Twitter)
Baskets = nodes; Items = outgoing neighbors
Searching for complete bipartite subgraphs K of a
s,t
big graph
How?
View each node i as a
basket B of nodes i it points to
i
K = a set Y of size t that
s,t
occurs in s buckets B
i
Looking for K set of
s,t
support s and look at layer t –
A dense 2-layer graph
all frequent sets of size t
Big Data Analytics CSCI 4030
7
s nodes
…
…
t nodesFirst: Define
Frequent itemsets
Association rules:
Confidence, Support, Interestingness
Then: Algorithms for finding frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithm + 2 refinements
Big Data Analytics CSCI 4030
8 Simplest question: Find sets of items that
appear together “frequently” in baskets
Support for itemset I: Number of baskets
containing all items in I
TID Items
1 Bread, Coke, Milk
(Often expressed as a fraction
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
of the total number of baskets)
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Given a support threshold s,
Support of
then sets of items that appear
Beer, Bread = 2
in at least s baskets are called
frequent itemsets
Big Data Analytics CSCI 4030
9 Items = milk, coke, pepsi, beer, juice
Support threshold = 3 baskets
B = m, c, b B = m, p, j
1 2
B = m, b B = c, j
3 4
B = m, p, b B = m, c, b, j
5 6
B = c, b, j B = b, c
7 8
Frequent itemsets: m, c, b, j,
, b,c , c,j.
m,b
Big Data Analytics CSCI 4030
10 Association Rules:
If-then rules about the contents of baskets
i , i ,…,i → j means: “if a basket contains
1 2 k
all of i ,…,i then it is likely to contain j”
1 k
In practice there are many rules, want to find
significant/interesting ones
Confidence of this association rule is the
probability of j given I = i ,…,i
1 k
support(I j)
conf(I j)
support(I)
Big Data Analytics CSCI 4030
11 Not all high-confidence rules are interesting
The rule X → milk may have high confidence for
many itemsets X, because milk is just purchased very
often (independent of X) and the confidence will be
high
Interest of an association rule I → j:
difference between its confidence and the
fraction of baskets that contain j
Interest(I j) conf(I j) Pr j
Interesting rules are those with high interest values
(usually above 0.5)
Big Data Analytics CSCI 4030
12B = m, c, b B = m, p, j
1 2
B = m, b B = c, j
3 4
B = m, p, b B = m, c, b, j
5 6
B = c, b, j B = b, c
7 8
Association rule: m, b →c
Confidence = 2/4 = 0.5
Interest = 0.5 – 5/8 = 1/8
Item c appears in 5/8 of the baskets
Rule is not very interesting
Big Data Analytics CSCI 4030
13 Problem: Find all association rules with
support ≥s and confidence ≥c
Note: Support of an association rule is the support
of the set of items on the left side
Hard part: Finding the frequent itemsets
If i , i ,…, i → j has high support and
1 2 k
confidence, then both i , i ,…, i and
1 2 k
i , i ,…,i , j will be “frequent”
1 2 k
support(I j)
conf(I j)
support(I)
Big Data Analytics CSCI 4030
14 Step 1: Find all frequent itemsets I
(we will explain this next)
Step 2: Rule generation
For every subset A of I, generate a rule A → I \ A
Since I is frequent, A is also frequent
Variant 1: Single pass to compute the rule confidence
confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
Variant 2:
Observation: If A,B,C→D is below confidence, so is A,B→C,D
Can generate “bigger” rules from smaller ones
Output the rules above the confidence threshold
Big Data Analytics CSCI 4030
15B = m, c, b B = m, p, j
1 2
B = m, c, b, n B = c, j
3 4
B = m, p, b B = m, c, b, j
5 6
B = c, b, j B = b, c
7 8
Support threshold s = 3, confidence c = 0.75
1) Frequent itemsets:
b,m b,c c,m c,j m,c,b
2) Generate rules:
b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
m→b: c=4/5 … b,m→c: c=3/4
b→c,m: c=3/6
Big Data Analytics CSCI 4030
16 To reduce the number of rules we can
post-process them and only output:
Maximal frequent itemsets:
No immediate superset is frequent
Gives more pruning
or
Closed itemsets:
No immediate superset has the same count ( 0)
Stores not only frequent information, but exact counts
Big Data Analytics CSCI 4030
17Frequent, but
superset BC
also frequent.
Support Maximal(s=3) Closed
A 4 No No
Frequent, and
its only superset,
B 5 No Yes
ABC, not freq.
C 3 No No
Superset BC
has same count.
AB 4 Yes Yes
Its only super-
AC 2 No No
set, ABC, has
BC 3 Yes Yes
smaller count.
ABC 2 No Yes
Big Data Analytics CSCI 4030
18Item
Back to finding frequent itemsets
Item
Item
Data is often kept in flat files
Item
Item
rather than in a database system:
Item
Item
Stored on disk
Item
Item
Stored basket-by-basket
Item
Item
Baskets are small but we have
Item
many baskets and many items
Expand baskets into pairs, triples, etc.
Etc.
as you read baskets
Use k nested loops to generate all
sets of size k
Items are positive integers,
Note: We want to find frequent itemsets. To find them, we and boundaries between
baskets are –1.
have to count them. To count them, we have to generate them.
Big Data Analytics CSCI 4030
20
Advise:Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.