Parallel
Data Mining

Data mining aims at
finding meaningful patterns or rules in large datasets. It is
an interdisciplinary field, which combines research from areas
such as machine learning, statistics, high performance computing,
and neural networks.

A common
feature of most data mining tasks is that they are resource intensive
and operate on large sets of data. Data sources measuring in gigabytes
or terabytes are now quite common in data mining. For example,
WalMart records 20 million transactions and AT&T produces
275 million call records every day. This calls for fast data mining
algorithms that can mine huge databases in a reasonable amount
of time.

However,
despite the many algorithmic improvements proposed in many serial
algorithms, the large size and dimensionality of many databases
makes data mining tasks too slow and too big to be run on a single
processor machine. It is therefore a growing need to develop efficient
parallel data mining algorithms that can run on a distributed
system. In this project, several parallel data mining algorithms,
based on the famous Apriori algorithm, was implemented and their
performance evaluated on a parallel machine.

**Apriori**

Apriori employs a bottom-up, breadth-first search that enumerates
every single frequent pattern in a database. It starts by finding
all frequent patterns of size 1, which is then used to find all
frequent patterns of size 2 etc.

What
makes Apriori so popular is that uses the downward closure property
of pattern support (all subsets of a frequent pattern must themselves
be frequent) to prune the search space. Thus only frequent patterns
of size k are used to generate patterns of size k+1.

This
may seem trivial but using this observation can significantly
reduce the search space. This is illustrated in the diagram below.

**Parallel
Algorithms**

Many
parallel data mining algorithms inherits this property from Apriori,
which is why most parallel data mining algorithms are said to
be a variation of Apriori.

Writing
parallel data mining algorithms are a non-trivial task. The main
challenges associated with parallel data mining include

- minimizing I/O

- minimizing synchronization and communication

- effective load balancing

- effective data layout (horizontal vs. vertical)

- deciding on the best search procedure to use (complete
vs. greedy)

- good data decomposition

- minimizing/avoiding duplication of work

How
we solved these challenges when writing our algorithms are described
in our report.

Four
Parallel Algorithms were used.
download
algorithms (171kb)

The
algorithms were implemented in Java. Instructions on how to run
the algorithms are included in the compressed file.

The
algorithms can run on any computer that has the java runtime environment
installed, which can be downloaded
from Sun Microsystems.

*Abstract:*
This paper studies the performance of several parallel algorithms
for finding frequent patterns in large data sets using different
number of processors on a symmetric multiprocessor (SMP) machine.

The
experimental data consisted of several synthetic transaction databases
of different sizes and with different average transaction length.
Instead of using a hash-tree structure, binary databases were
used to speed-up the task of counting itemsets and for easy comparison
of different algorithms. Java was used as programming language
for all algorithms.

The
finding of the study is that the performance of various algorithms
is highly dependent upon the database, especially as regards width
of the transactions and the data distribution. No generic algorithm
was found that was optimal for all database types.