Version 100905: Original release.
Version 100913: Suggest using weatherAUS rather than weather.
Version 100920: Add alternative to 1 (d).
Please submit your answers to the following tasks by emailing two documents, as attachments named like u123456.pdf and u123456.R (replacing u123456 with your own student ID), to me Graham.Williams@togaware.com by 24 September 2010.
You can export PDF from OpenOffice for MS/Word and OpenOffice documents. PDF ensures I see the document in the form you submitted. The PDF document will be a report on what you have done for the assignment.
The R file will contain a self-contained and runnable script that includes your R code and demonstrations for the two tasks. This should be a commented script that I will run in R. If it requires any packages, please be sure to use "require". The script should generate output that explains and then illustrates any results, summaries of models, and visualisations.
[12 marks]
(a) Rattle provides the weatherAUS dataset with a binary target variable (RainTomorrow). Provide a descriptive overview of the dataset using R to generate appropriate summaries and plots.
(b) Using R, choose one predictive model building algorithm (perhaps one we have talked about in class (rpart, randomForest, ada, glm, nnet, svm) and build a precitive model on 70% of the dataset.
(c) Evaluate the performance of the predictive model on the remaining 15% dataset. Why would we partition the data into 70%/15%/15% for training/validation/testing?
(d) Write an R function that accepts a data frame and a column number to identify the target variable (you can assume it is a binary valued variable), and returns a vector which, for each numeric input variable, records the information gain (as used for decision tree building) that would result if the data was split by the mean value of that variable.
(d alternative on request) Write down the steps that you would follow to perform the operation described in (d). Then create a small subset (e.g., 20 rows) of the weather or weatherAUS dataset containing four columns, one being the binary target variable (RainTomorrow) and the others being 3 columns of your choice from the full dataset. Calculate and document the original value of entropy and the value of entropy after partitioning by each of the 3 columns. Which variable offers the greatest gain?
[8 marks]
(a) For the same dataset used in Task 1 build (any) 3 additional models, to end up with four models. How do the models compare in terms of communicating "knowledge"?
(b) Apply each model to the testing dataset and compare the agreement between the models and agreement with the actual values of the target variable. Use the capabilities of R (perhaps correlation analysis and appropriate plots) to support your observations.