MATH3346 Data Mining Assignment 3

Change Log

Version 080922: Change "fMisc" to "fBasics". Reword Task 2 dataset suggestions.

Version 080919: Original release.


Background

Please submit your answers to the following tasks by emailing two documents, as attachments named your_name.pdf and your_name.R as detailed below, to me Graham.Williams@togaware.com by 17 October 2008.

The first document will be a .PDF file (PDF can be generated from Word and OpenOffice documents using OpenOffice). PDF ensures I see the document in the form you submitted. This will be a report on what you have done for the assignment.

The second document will be a .R file containing a self-contained and runnable script that includes your R code and demonstrations for both questions 1 and 2. This should be a commented script that I should be able to run in R, and if it requires any packages be sure to use "require". The scrip should generate output that shows explains and then illustrates any results, summaries of models, and visualisations. Use the "readline()" function to stop the script each time some output is generated or some plot is displayed, and wait for the user to press .

Name both submitted files using your name (so that if I were submitting the documents I would be naming the files "graham_williams.pdf" and "graham_williams.R").


Task 1

The basic "standard" algorithms we introduced in lectures included decision trees (rpart), boosting (ada), random forests (randomForest), support vector machines (kernlab), generalised linear models (glm) and neural networks (nnet). Rattle can illustrate the function calls for each of these.

In lectures we mentioned the NetFlix competition and the KorBell submission which developed a linear combination of the other models.

For this question, we will explore the concept of a linear combination of other models. Using two datasets, one being the audit.csv data set that is supplied with Rattle, and another of your own choice (there are data mining and machine learning repositories on the Internet) build a collection of classification or regression models and then explore how to combine the models into one model.

Report on how you went about this and how the ensemble performance compares to the individual models.


Task 2

Often (e.g., in practise), we build models on one data set and then apply the model to new data as it arrives. The models might be applied to data that arrives one or two years after the original model was built.

Research and discuss the issues with applying a model to new datasets that might be more recent (and so look different --- i.e., different distributions) to that on which the model was built.

Using R explore options for testing how different the two datasets are, and provide guidance as to whether a model might still be able to be applied to the second dataset. Illustrate with the audit dataset (e.g., compare first 1000 rows against second 1000 rows) and separately, another dataset of your choice (e.g., the psid1/psid2/psid3 or cps1/cps2/cps3 or nswdemo or covsample datasets from the DAAGxtras package). As a starting point, perhaps consider locationTest from the fBasics package in R.