hw_gmm, hw_step, and adult. This latter data set is
from the UCI Irvine
machine learning repository, under the 'Summary Page' tab. The
data set has been cleaned so that there are no missing values. Each
data set has one or more training data files and one test data file:
adult data files: adult.data training file adult.test test file
adult.names informational file
hw_gmm data files
hw_gmm_25.arff 25 training examples
hw_gmm_50.arff 50 training examples
hw_gmm_100.arff 100 training examples
hw_gmm_250.arff 250 training examples
hw_gmm_500.arff 500 training examples
hw_gmm_test.arff test data file
hw_step data files
hw_step-25.arff 25 training examples
hw_step-50.arff 50 training examples
hw_step-100.arff 100 training examples
hw_step-250.arff 250 training examples
hw_step-500.arff 500 training examples
hw_step_test.arff test data file
You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.
adult.arff and adult_test.arff. You should
keep the attributes and instances in the same order as they are in the
original files.
TURN IN:
You should turn in the top 50 lines of your
adult.arff and adult_test.arff files.
For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.
TURN IN:
A table in the following format:
------------------------------------------------------- hw_gmm: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 hw_step: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 adult: N Perceptron LogReg J48 kNN-1 kNN-5 30163 xxx yyy zzz kkk1 kkk5 -------------------------------------------------------Where
xxx gives the error rate of the perceptron,
yyy gives the error rate of LogisticRegression,
etc.
hw_gmm and
hw_step plotting the performance of the five algorithms
as a function of the size of the training data set (known as a
"learning curve"). I recommend using gnuplot, excel or matlab for
constructing the graphs as WEKA does not provides an easy way to
do this.
For gnuplot, you need to create a separate file for each learner.
Each file should consist of x,y pairs, where x is the training
set size and y is the accuracy. You can then plot these files
using the plot command.
For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.
hw_gww_25 and hw_step_50
training sets and what kind of decision boundaries that logistic
regression found. To compute the decision boundary for Logistic
Regression, recall that the logistic regression model has the form
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2WEKA produces a table that looks like
Variable Coeff.
1 w1
2 w2
Intercept w0
TURN IN:
(i, 10 points) Plot of the data points for hw_gmm_25
with lines showing the decision boundary learned by Logistic
Regression. That is, you should plot the data as points in the x/y
plane and then plot the decision boundary learned by the algorithm.
(ii, 10 points) Plot of the data points for hw_step_50
with a line showing the learned decision boundary for Logistic
Regression.
Now, let us consider the hw_gmm_250 and
hw_step_250 training sets and the kind of decision
boundaries found by J48. This will require that you read the decision
tree and understand the decision boundary. J48 displays the tree in
the following format:
x1 <= 1.0: positive (75.0/17.0) x1 > 1.0 | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
TURN IN:
(i, 10 points): Plot of the data points for hw_gmm_250
with lines showing the decision boundary learned by J48.
Note: You should plot all separating lines chosen by J48.
Further, highlight the line segments that separate the two classes.
(i, 10 points): Plot of the data points for hw_step_250
with lines showing the decision boundary learned by J48.
Reconsider the hw_gmm_250 and hw_step_250
training sets and the kind of decision boundaries found by IBk, for
K=1,5. To assist you with this, I have provided an additional file
grid.arff. This file
contains 10201 points on a 0.1 grid for x in [-5,5] and y in [-5,5].
To compute the decision boundary for IBK, you should use the WEKA GUI.
Select grid.arff as your "Supplied test set" in WEKA.
Then after the training is complete, you can right-click on the last
entry in the Result list and select "Visualize classifier errors".
You can visualize the decision boundary by selection "X: x (Num)" and
"Y: y (Num)". All of the points in grid.arff are labeled
Positive. Incorrectly classified points are plotted by WEKA as red
squares, correctly classified points are plotted as blue x's. This
will allow you to see the boundary. However, to determine the points
on the boundary, click the "Save" button and choose a file name in
which to save the outputs. If you examine this file, you will see
that it contains five comma-separated values per line. The second and
third values give the X and Y coordinates of the points. The fourth
value is the predicted class and the fifth value is the correct class.
You should write a program (or perl script) to find pairs of lines
where the predicted class changes from one line to the next and where
the X coordinate does not change. These points will give an
approximation to the decision boundary.
TURN IN:
(i, 5 points): Plot of the data points for hw_gmm_250
for IBk, for K=1
(i, 5 points): Plot of the data points for hw_gmm_250
for IBk, for K=5
(i, 5 points): Plot of the data points for hw_step_250
for IBk, for K=1
(i, 5 points): Plot of the data points for hw_step_250
for IBk, for K=5
You can obtain WEKA by visiting the WEKA Project Webpage
and clicking on the appropriate link for your operating system.
Alternatively, if you are on one of the CS systems, you may be able to
access WEKA by connecting to /usr/usc/weka (I have asked
that it be installed there). You can either run it by using the
"run-weka" command in that directory, or by typing in
"java -jar /usr/usc/weka/weka.jar" (make sure you are using
java 1.4+).
Using the GUI
These instructions will describe how to apply the learning algorithms to the BR data set. The others can be processed in exactly the same way, of course. When you start up Weka, you will first see the WEKA GUI Chooser, which has a picture of a bird (a weka) and four buttons. You should click on the Explorer button. This opens a large panel with several tabs, and the Preprocess tab will already be selected.Click on "Open file...", then click on the "data" folder, and then select the "adult.arff" file. The "Current relation" window should now show "Relation" as ADULT with 30163 instances and 15 attributes. The table and bar plot on the right-hand side of the window should show 7508 examples in one class and 22654 in the other class (depending on how you converted the adult data set into arff format).
Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. First we will load in the test data. Click on the radio button "Supplied test set". Then click on the "Set..." button. A small "Test Instances" pop-up window should appear. Click on "Open file...", navigate to the "data" folder, and select "adult_test.arff". The Test Instances window should now show the relation "ADULT" with 15062 instances and 15 attributes. You may close this window at this point.
You don't need to specify which attribute to predict, because you will predict the default--namely the last variable (assuming you converted the adult data set into arff using the same ordering of attributes). Otherwise, you can tell Weka which of the 15 attributes is the class variable. Below the Test options panel, there is a drop down menu with the entry "(Nom) XXX" selected, where XXX is the name of the last variable. Click on this and choose "(Nom) class" instead. [Num means numeric; Nom means nominal, i.e., discrete]
Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". ZeroR is a very simple rule-learning algorithm (which we do not want). The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.
Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "trees", "rules". For example, to choose Logistic Regression, choose "functions" and then "Logistic". To select the Perceptron algorithm, choose "functions" and then "VotedPerceptron".
Once we have chosen an algorithm to run, Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Logistic Regression, this output consists of several sections:
- Run Information: Details of the data set
- Classifier model: The learned model.
- Evaluation on test set: This gives various statistics. The key item is the second one: Incorrectly Classified Instances will be expressed as a count and a percentage. You should report the percentages in your answer. One other item of interest comes at the very end: The Confusion Matrix. This shows how many false positive and false negative errors were made.
Using the command-line
You can also run weka from the command line. This is probably easier to do, as you can then run things in a batch script, which will make it easier to do all the runs.
Basic Run of a Learner on a training and test set:
java -cp WEKAJAR LEARNER -t TRAINFILE -T TESTFILEThis command has four parameters:Here are two example runs using the nearest neighbor (the first is a single nearest neighbor, the seconds is a 5-nearest neighbor, using the '-K' specific option to nearest neighbor):
- WEKAJAR: This should point to the weka.jar that you have installed.
- LEARNER: This is the class name of the learner you want to run. You will use four learners in this homework:
weka.classifiers.lazy.IBk
weka.classifiers.functions.Logistic
weka.classifiers.functions.VotedPerceptron
weka.classifiers.trees.J48- TRAINFILE: This is the name of the file that contains the training set
- TESTFILE: This is the name of the file that contains the test set
java -cp weka.jar weka.classifiers.lazy.IBk -t hw_step_10.arff -T hw_step_test.arffThe output of running these commands consist of output the learned model as well as test and training statistics. You may want to save this output to a file and then later extract out what you need.
java -cp weka.jar weka.classifiers.lazy.IBk -t hw_step_10.arff -T hw_step_test.arff -K 5
If you do not provide any options, then you get a general usage output for that particular learner.Saving output and extracting accuracy:
The first command saves the output into a file called 'OUTFILE', and the second command extracts the accuracy on the test set from OUTFILE.java -cp WEKAJAR LEARNER -t TRAINFILE -T TESTFILE > OUTFILE grep Correctly OUTFILE | tail -1 | awk '{print $5}'Wrapping around a script:
You can wrap that above command inside a shell or perl script to do all the training runs and extract all the accuracies. From there on, generating the plots and tables should be straight forward.