Background material and the data mining process

A data mining project involves more than applying intelligent algorithms to available data. It is a process that starts with stating the goals of the end-user (what does he want to know), and ends with the deployment of the newly found knowledge (eg in the from of a computer program for the business user). In between lie steps like collecting the data, cleaning and preprocessing them, choosing data mining tasks and techniques, mining the data and interpreting the results. More information about this can be found in the following texts.

From Data Mining to Knowledge Discovery: An Overview by U. M. Fayyad, G. Piatetsky-Shapiro and P. Smyth in Advances in Knowledge Discovery and Data Mining by U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy.
From Data Mining to Knowledge Discovery in Databases is a different version of the same article which appeared in AI magazine.

They see data mining as one step in the whole process of knowledge discovery in databases (which others call data mining process or data mining project). They give the following definition (cited in most subsequent publications on data mining):

Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.

Their article is a comprehensive overview of all the aspects of the data mining (or kdd) process. It's a good general introduction.
The process of Knoledge Discovery in Databases by R. J. Brachman and T. Anand in Advances in Knowledge Discovery and Data Mining by U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy
This article describes the data mining process with special attention for the user and the process. They stress that end-users are usually not data analysts. The data miner has to interact with them to gain insight into the questions that have to be answered, the data that are available, the domain in which the project is set, and the final product that is wanted. They also point out that data mining is a complex and iterative process.
I assume this kind of papers are not the immediate concern for this course, but I think they cannot be totally ignored. After all, the aim is to bring learning from data techniques into a practical context.
http://www.crisp-dm.org
CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a collective effort of some companies to come to an industry- and tool-neutral data mining process model. The full process model can be found here.
Data Exploration as a Process, chapter 1 in Data Preparation for Data Mining by Dorian Pyle
Gives yet another overview of the different steps of a data mining project. Pyle's book is not very concise though (it takes him 5 pages to say what others say in 5 lines), so therefore I would not recommend it very much. Interesting maybe is the fact that he stresses very much the importance of the preliminary stages of the data mining process, the ones even before data preparation: exploring the actual problem that needs to be solved and the set of possible solutions. Although they make up for only 20% of the time spent on a data mining project, they account for 80% of the importance to success (no idea how he measures this though). He certainly has a point when saying that these are important aspects that should not be overlooked.
A funny paper about one of the reasons why data mining is becoming so important: How much information is there in the world.
Here is a short article about how the american national security wants to use data mining after 11/9/2001. The company that does this for the national security is Virtual Gold. On their website you can ask for a password to view the articles about this, but they ask to keep it really confidential. Apart from their work for the american government they also made an application for NBA basketball teams (to investigate match statistics) together with IBM. This story is avialable here.
A good introductory lecture is given in the slides for the first chapter of Han and Kamber's book (available on-line in ppt format). I have printed this out.

Home : Teaching : Courses : Dme : Fredduc