Abstract: Association rules mining is the second widely used

Abstract:

Association rules mining is the second widely used
techniques in data mining. It searches for interesting relationships among
items in a given data set especially in transactional databases. This will
investigate what Association rules mining is, application areas, variants, etc.
The problem of discovering association rules has received considerable research
attention and several fast algorithms for mining association rules have been
developed. In practice, users are often interested in a subset of association
rules. For example, they may only want rules that contain a specific item or
rules that contain children of a specific item in a hierarchy. While such
constraints can be applied as a post processing step, integrating them into the
mining algorithm can dramatically reduce the execution time.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

 

 

 

 

 

1.    
Introduction

 

 

Data mining also called as knowledge discovery in databases, was
discovered as a new era for database research. The area is used to find
interesting rules from large sets of data.

  Given a set of transactions, set
of items is each transaction, an association rule is an expression X=>Y, X
and Y are sets of items. The meaning of this thing is that whatever items there
may be in X however it also contains Y. An example of such a thing is 98% of
people who buy tires and auto accessories also buy some automotive services the
98% of people here are called as confidence of the rule. The percentage of rule
that contain both X and Y are called as the support of the rule where X=>Y.
The problem of the mining association rules is that it needs to find all the
rules that satisfy the user-specified minimum support and minimum confidence. The
applications with which the association rules are linked with are attached
mailing, catalog design, cross
marketing, store layout, loss-leader analysis and customer segmentation based
on buying patterns.

 

  In most cases Taxonomies over the
items are available. A simple example according to the taxonomy rules is that
we may think that people who bought outwear along with hiking boots buy hiking
boots every time they buy the outer wear like ski pants along with hiking
boots, jackets along with hiking boots. As many people had bought these items
together. Also outer wear and hiking boots is a valid rule. But not for clothes
and hiking boots it may not be a valid rule because it may not have minimum
support and the latter may not minimum confidence.

 

                          Clothes                          
Footwear

 

                             Outerwear                 Shirts              Shoes                  Hiking
Boots

 

Jackets                         Ski pants

 

 

                     The
taxonomies mostly work on the leaf level nodes rather than the parent nodes.
However finding rules for different taxonomies is valuable because

1.) Taxonomies can be used to
prune uninteresting or redundant rules.

2.) Rules at lower at lower levels
may not have support. There is less minimum support for the people to buy
hiking boots along with clothes. But it doesn’t say that the taxonomy is
limited to leaf level comparisons. We cannot find many association rules if we
are limited to leaf level. If we take the supermarket into consideration we
have hundreds of products available there. But discounts are available only if
we buy a pair of items as many people buy those things together.

 

2. What is Associative Rules Mining?

 Associative Rule mining is a technique
which we use to find frequent patterns, correlations, associations, or causal structures
from data sets found in different kinds of databases such as relational
databases, transactional databases, and other forms of data storages. Given a
set of transactions, association rule mining aims to find the rules which
enable us to predict the occurrence of a specific item based on the occurrences
of the other items in the transaction.

Association rule mining is the data mining
process of finding the rules that may govern associations and causal objects
between sets of items. So in a given transaction with multiple items, it tries
to find the rules that govern how or why such items are often bought together.
For example, peanut butter and jelly are often bought together because a lot of
people like to make PB sandwiches. Also incredibly, diapers and beer are
bought in combination because, as it turns out, that dads are often tasked to
do the buying groceries while the moms are left near the baby.

The main
applications of association rule mining:

 

•   Basket data analysis – is to analyse the cooperative
of purchased items in a single basket or single purchase.

•   Cross marketing – is to work with other
organizations that complement your own, not competitors. For example, vehicle
dealerships and manufacturers have cross marketing campaigns with oil and gas
companies for obvious reasons.

•   Catalog design – the choice of items in a
business’ catalog are often designed to complement each other so that shopping
for one item will lead to buying of an alternative. So these items are often
complements or very related. (techopedia,
n.d.)

 

2.    
Apriori

 Mining for
associations among items in a large database of sales transaction is an
important database mining function. For example, the information that a
customer who purchases a keyboard also tends to buy a mouse at the same time is
represented in association rule below:              Keyboard =>Mouse  support = 6%, confidence = 70%

 

•      
Apriori pruning principle:

     If there is any itemset which
is infrequent, its superset should not be generated or tested!

•      
Method:

–     
Initially,
scan DB once to get frequent 1-itemset

–     
Generate
length (k+1) candidate item sets from length k frequent item sets

–     
Test
the candidates against DB

–     
Terminate
when no frequent or candidate set can be generated

 

Example:

Transactional Database

TID

Items

10

A, C, D

20

B, C, E

30

A, B, C, E

40

B, E

 

Item
set

sup

{A}

2

{B}

3

{C}

3

{D}

1

{E}

3

Item
set

sup

{A}

2

{B}

3

{C}

3

{E}

3

 

Itemset

sup

{A}

2

 {B}

3

{C}

3

{E}

3

                                                 1st
scan

 

C1

Item
set

sup

{A,
B}

1

{A,
C}

2

{A,
E}

1

{B,
C)

2

{B,
E}

3

{C,
E}

2

 

Itemset

sup

{A,
C}

2

{B,
C}

2

{B,
E}

3

{C,
E}

2

Item
set

{A,
B}

{A,
C}

{A,
E}

{B,
C}

{B,
E}

{C,
E}

L2

 

                                                                                                                                                                          

                          

 

Scan2

 

 

 

Itemset

{B,
C, E}

Itemset

sup

{B,
C, E}

2

3rd
scan L3                                                                                                                                                                                                                                                                                                                                                                          

DETAILS OF APRIORI

•      
Generate
candidates

–     
Step
1: self-joining Lk

–     
Step
2: pruning

•      
Count
supports of candidates

•      
Example
of Candidate-generation

–     
L3={abc,
abd, acd, ace, bcd}

–     
Self-joining:
L3*L3

•      
abcd from abc
and abd

•      
acde from acd
and ace

–     
Pruning:

•      
acde is
removed because ade is not in L3

–     
C4={abcd}

                                                                                                                                                          

BOTTLE NECK OF APRIORI

•      
Challenges

–     
Multiple
scans of transaction database

–     
Huge
number of candidates

–     
Tedious
workload of support counting for candidates

•      
Improving
Apriori: general ideas

–     
Reduce
passes of transaction database scans

–     
Shrink
number of candidates

–     
Facilitate
support counting of candidates

Possible ways of improving performance of the
algorithms

•      
Implementation
techniques

–         
Use
of good data structures

–         
Fast
implementation of basic operations

•      
Algorithm
improvement

–         
Finding
algorithms that are more efficient

•      
Use
of parallel processing

•      
Sampling
the transaction databases

Interactive Discovery

In ARM, the user plays an important role in the
process

•      
The
user is responsible for setting the initial minimum support and confidence
thresholds

•      
During
the discovery, the user may decide to further fine-tune the thresholds

•      
The
user can specify what items are to appear on either or both sides of the resulting
rules. (for different purposes)

(e.g.) 
{X} -> {nappies}  or {nappies}
-> {X}  or {Bear} -> {nappies}

•      
The
user can exploit a category hierarchy of some kind among items.

(e.g.) Instead of “Bread à Coke”, “Bakery products” ->
“Soft drinks”

 

Measures of Interestingness

•      
play basketball  Þ eat cereal 40%, 66.7%  is misleading

–     
The
overall % of students eating cereal is 75% > 66.7%.

•      
play basketball  Þ not eat cereal 20%, 33.3%
is more accurate, although with lower support and confidence

•      
Another
measure of interestingness:  lift

Basketball

Not basketball

Sum (row)

Cereal

2000

1750

3750

Not cereal

1000

250

1250

Sum(col.)

3000

2000

5000

 

                      

 

 

                                                                                                                                             

 

                                              

3.) Algorithms:

                   

The problem of discovering the association
rules can be differentiated in three rules

1.)    Find
all the set of items who has more support than the usual minimum support of the
user specified which are called the frequent item sets.

2.)     Use the frequent item sets to generate the
association rules. The general idea says that if AB and ABCD are frequent item
sets then we find whether the rule AB=> CD holds by computing the ratio
conf=support(ABCD)/support(AB). If conf > minconf then the rule holds.

3.)    Prune
all uninteresting items from this set.

Consider a problem where a
transaction T supports an Item set X. If we check the raw transaction we need
to check about the each item whether x belongs to X where x or some descendant
of x is present in the transaction. The task becomes easy if we add all the
ancestors of T to T let’s call it as extended transaction T’. Now then T
supports X when only X is a superset of T’. A straight forward way to find
generalised association rules is to implement the rules itself on the available
algorithms.