A data mining project involves more than applying intelligent algorithms to available data. It is a process that starts with stating the goals of the end-user (what does he want to know), and ends with the deployment of the newly found knowledge (eg in the from of a computer program for the business user). In between lie steps like collecting the data, cleaning and preprocessing them, choosing data mining tasks and techniques, mining the data and interpreting the results. More information about this can be found in the following texts.
They see data mining as one step in the whole process of knowledge discovery in databases (which others call data mining process or data mining project). They give the following definition (cited in most subsequent publications on data mining):
Their article is a comprehensive overview of all the aspects of the data mining (or kdd) process. It's a good general introduction.
This article describes the data mining process with special attention for the user and the process. They stress that end-users are usually not data analysts. The data miner has to interact with them to gain insight into the questions that have to be answered, the data that are available, the domain in which the project is set, and the final product that is wanted. They also point out that data mining is a complex and iterative process.
I assume this kind of papers are not the immediate concern for this course, but I think they cannot be totally ignored. After all, the aim is to bring learning from data techniques into a practical context.
CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a collective effort of some companies to come to an industry- and tool-neutral data mining process model. The full process model can be found here.
Gives yet another overview of the different steps of a data mining project. Pyle's book is not very concise though (it takes him 5 pages to say what others say in 5 lines), so therefore I would not recommend it very much. Interesting maybe is the fact that he stresses very much the importance of the preliminary stages of the data mining process, the ones even before data preparation: exploring the actual problem that needs to be solved and the set of possible solutions. Although they make up for only 20% of the time spent on a data mining project, they account for 80% of the importance to success (no idea how he measures this though). He certainly has a point when saying that these are important aspects that should not be overlooked.
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |