Science and EngineeringThe University of Edinburgh
informatics logo university logo
spacer   spacer   spacer
 
....................
arrow Algorithms
....................
arrow Report
....................
arrow Slides
....................
arrow Links
....................
 

Parallel Data Mining

Data mining aims at finding meaningful patterns or rules in large datasets. It is an interdisciplinary field, which combines research from areas such as machine learning, statistics, high performance computing, and neural networks.

A common feature of most data mining tasks is that they are resource intensive and operate on large sets of data. Data sources measuring in gigabytes or terabytes are now quite common in data mining. For example, WalMart records 20 million transactions and AT&T produces 275 million call records every day. This calls for fast data mining algorithms that can mine huge databases in a reasonable amount of time.

However, despite the many algorithmic improvements proposed in many serial algorithms, the large size and dimensionality of many databases makes data mining tasks too slow and too big to be run on a single processor machine. It is therefore a growing need to develop efficient parallel data mining algorithms that can run on a distributed system. In this project, several parallel data mining algorithms, based on the famous Apriori algorithm, was implemented and their performance evaluated on a parallel machine.

 

Apriori

Apriori employs a bottom-up, breadth-first search that enumerates every single frequent pattern in a database. It starts by finding all frequent patterns of size 1, which is then used to find all frequent patterns of size 2 etc.

What makes Apriori so popular is that uses the downward closure property of pattern support (all subsets of a frequent pattern must themselves be frequent) to prune the search space. Thus only frequent patterns of size k are used to generate patterns of size k+1.

This may seem trivial but using this observation can significantly reduce the search space. This is illustrated in the diagram below.

Pruning the search space

 

Parallel Algorithms

Many parallel data mining algorithms inherits this property from Apriori, which is why most parallel data mining algorithms are said to be a variation of Apriori.

Writing parallel data mining algorithms are a non-trivial task. The main challenges associated with parallel data mining include
  - minimizing I/O
  - minimizing synchronization and communication
  - effective load balancing
  - effective data layout (horizontal vs. vertical)
  - deciding on the best search procedure to use (complete vs. greedy)
  - good data decomposition
  - minimizing/avoiding duplication of work

How we solved these challenges when writing our algorithms are described in our report.

 

Four Parallel Algorithms were used.
  1. Count Distribution – parallelizing the task of measuring the frequency of a pattern inside a database
  2. Candidate Distribution – parallelizing the task of generating longer patterns
  3. Hybrid Count and Candidate Distribution – a hybrid algorithm that tries to combine the strengths of the above algorithms
  4. Sampling with Hybrid Count and Candidate Distribution – an algorithm that tries to only use a sample of the database.

A brief comparison of the algorithms when running on various numbers of processors is shown in the graph below. A more detail performance evaluation and comparison can be found in our report.

graph figure

 


 

Algorithms

download algorithms (171kb)

The algorithms were implemented in Java. Instructions on how to run the algorithms are included in the compressed file.

The algorithms can run on any computer that has the java runtime environment installed, which can be downloaded from Sun Microsystems.


Report

download reportpdf (274kb)

Abstract: This paper studies the performance of several parallel algorithms for finding frequent patterns in large data sets using different number of processors on a symmetric multiprocessor (SMP) machine.

The experimental data consisted of several synthetic transaction databases of different sizes and with different average transaction length. Instead of using a hash-tree structure, binary databases were used to speed-up the task of counting itemsets and for easy comparison of different algorithms. Java was used as programming language for all algorithms.

The finding of the study is that the performance of various algorithms is highly dependent upon the database, especially as regards width of the transactions and the data distribution. No generic algorithm was found that was optimal for all database types.


Slides

download slidespdf (158kb)

The slides were used to give a short introduction on my research and complement my report.


Links

The following links have been added to provide interested reader with more information about data mining.

Rakesh Agrawal's Publications
Lots of data mining articles by Rakesh Agrawal from IBM

KDD Cup 2003
Information and data about the KDD CUP, the most famous data mining competition.

Data Mining and Exploration
Course on data mining provided by the University of Edinburgh.

 
spacer    

Valid HTML 4.01!