**Learning Algorithms**. We will compare Perceptron, Logistic Regression, Decision Trees (J48), and k-nearest neighbors (IBk) (two variations: 1-NN and 5-NN).**Data Sets**. We will apply these five algorithms to the data sets`hw_gmm`

,`hw_step`

, and`statlog`

. This latter data set is from the UCI Irvine machine learning repository. The data set has been cleaned so that there are no missing values. The artificial data sets have one or more training data files and one test data file, the statlog data is one large file:statlog files in the data folder directory: Index, australian.dat, australian.doc hw_gmm data files hw_gmm_25.arff 25 training examples hw_gmm_50.arff 50 training examples hw_gmm_100.arff 100 training examples hw_gmm_250.arff 250 training examples hw_gmm_500.arff 500 training examples hw_gmm_test.arff test data file hw_step data files hw_step-25.arff 25 training examples hw_step-50.arff 50 training examples hw_step-100.arff 100 training examples hw_step-250.arff 250 training examples hw_step-500.arff 500 training examples hw_step_test.arff test data file

You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.

**Exercises / What to turn in**.**[Data Handling, 10 points]:**Your first task is to download the statlog data and convert it weka format. This tasks will make you familiar with how to download and handle data.- Go to the webpage and follow the link to 'Data Folder' at the top (right under the main name).
- Download the
`australian.dat`

file and split it into two files of 490 instances for the training set (name it`statlog.arff`

and 200 instances for the test set (name it`statlog_test.arff`

). - Add a weka arff header to these two files, using the details of the attribute information on the web-page. Look at the artificial data sets to help you get the syntax correct. Make sure you keep the attributes and instances in the same order as they are in the original files.

**TURN IN:**

You should turn in the top 50 lines of your`statlog.arff`

and`statlog_test.arff`

files.**[Training Set Sensitivity, 30 points]:**How sensitive are the various learners to the training set size. We will have each learner learn on each of the train files (sizes 10 to 500) and record their accuracies. This exercise gives us insight into the behavior of each learner and how sensitive it is to training set sizes. This knowledge is useful when deciding which learner to use in a specific problem.For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.

**[Tabular comparison, 20 points]**

**TURN IN:**

A table in the following format:------------------------------------------------------- hw_gmm: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 hw_step: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 adult: N Perceptron LogReg J48 kNN-1 kNN-5 490 xxx yyy zzz kkk1 kkk5 -------------------------------------------------------

Where`xxx`

gives the error rate of the perceptron,`yyy`

gives the error rate of LogisticRegression, etc.**[Graph Comparison, 10 points]**

**TURN IN:**

Graphs of the results for`hw_gmm`

and`hw_step`

plotting the performance of the five algorithms as a function of the size of the training data set (known as a "learning curve"). I recommend using gnuplot, excel or matlab for constructing the graphs as WEKA does not provides an easy way to do this.For gnuplot, you need to create a separate file for each learner. Each file should consist of x,y pairs, where x is the training set size and y is the accuracy. You can then plot these files using the

`plot`

command.For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.

**[Decision Boundaries, 60 points]**

Each learner creates decision boundaries and we often would like to know what these boundaries are. In some cases, such as logistic regression and J48, computing these boundaries is straight forward. In other cases, such as VotedPerceptron and Nearest Neighbor, this is not so easy and we need to use other means. These exercises are meant to help you understand how to get the decision boundaries from the learned models.**[Logistic Regression, 20 points]**Let us consider the`hw_gww_25`

and`hw_step_50`

training sets and what kind of decision boundaries that logistic regression found. To compute the decision boundary for Logistic Regression, recall that the logistic regression model has the formlog [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2

WEKA produces a table that looks likeVariable Coeff. 1 w1 2 w2 Intercept w0

**TURN IN:**(i, 10 points) Plot of the data points for

`hw_gmm_25`

with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm.(ii, 10 points) Plot of the data points for

`hw_step_50`

with a line showing the learned decision boundary for Logistic Regression.**[J48, 20 points]**:Now, let us consider the

`hw_gmm_250`

and`hw_step_250`

training sets and the kind of decision boundaries found by J48. This will require that you read the decision tree and understand the decision boundary. J48 displays the tree in the following format:x1 <= 1.0: positive (75.0/17.0) x1 > 1.0 | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)

The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.**TURN IN:**(i, 10 points): Plot of the data points for

`hw_gmm_250`

with lines showing the decision boundary learned by J48.**Note:**You should plot all separating lines chosen by J48. Further, highlight the line segments that separate the two classes.(i, 10 points): Plot of the data points for

`hw_step_250`

with lines showing the decision boundary learned by J48.**[Nearest Neighbor, 20 points]**:Reconsider the

`hw_gmm_250`

and`hw_step_250`

training sets and the kind of decision boundaries found by IBk, for K=1,5. To assist you with this, I have provided an additional file`grid.arff`

. This file contains 10201 points on a 0.1 grid for x in [-5,5] and y in [-5,5]. To compute the decision boundary for IBK, you should use the WEKA GUI. Select`grid.arff`

as your "Supplied test set" in WEKA. Then after the training is complete, you can right-click on the last entry in the Result list and select "Visualize classifier errors". You can visualize the decision boundary by selection "X: x (Num)" and "Y: y (Num)". All of the points in`grid.arff`

are labeled Positive. Incorrectly classified points are plotted by WEKA as red squares, correctly classified points are plotted as blue x's. This will allow you to see the boundary. However, to determine the points on the boundary, click the "Save" button and choose a file name in which to save the outputs. If you examine this file, you will see that it contains five comma-separated values per line. The second and third values give the X and Y coordinates of the points. The fourth value is the predicted class and the fifth value is the correct class. You should write a program (or perl script) to find pairs of lines where the predicted class changes from one line to the next and where the X coordinate does not change. These points will give an approximation to the decision boundary.**TURN IN:**(i, 5 points): Plot of the data points for

`hw_gmm_250`

for IBk, for K=1(i, 5 points): Plot of the data points for

`hw_gmm_250`

for IBk, for K=5(i, 5 points): Plot of the data points for

`hw_step_250`

for IBk, for K=1(i, 5 points): Plot of the data points for

`hw_step_250`

for IBk, for K=5

## Obtaining Weka

You can obtain WEKA by visiting the WEKA Project Webpage and clicking on the appropriate link for your operating system. Alternatively, if you are on one of the CS systems, you may be able to access WEKA by connecting to

`/usr/usc/weka`

(I have asked that it be installed there). You can either run it by using the "run-weka" command in that directory, or by typing in "java -jar /usr/usc/weka/weka.jar" (make sure you are using java 1.4+).

## Using Weka

**Using the GUI**These instructions will describe how to apply the learning algorithms to the BR data set. The others can be processed in exactly the same way, of course. When you start up Weka, you will first see the WEKA GUI Chooser, which has a picture of a bird (a weka) and four buttons. You should click on the Explorer button. This opens a large panel with several tabs, and the Preprocess tab will already be selected.

Click on "Open file...", then click on the "data" folder, and then select the "adult.arff" file. The "Current relation" window should now show "Relation" as ADULT with 30163 instances and 15 attributes. The table and bar plot on the right-hand side of the window should show 7508 examples in one class and 22654 in the other class (depending on how you converted the adult data set into arff format).

Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. First we will load in the test data. Click on the radio button "Supplied test set". Then click on the "Set..." button. A small "Test Instances" pop-up window should appear. Click on "Open file...", navigate to the "data" folder, and select "adult_test.arff". The Test Instances window should now show the relation "ADULT" with 15062 instances and 15 attributes. You may close this window at this point.

You don't need to specify which attribute to predict, because you will predict the default--namely the last variable (assuming you converted the adult data set into arff using the same ordering of attributes). Otherwise, you can tell Weka which of the 15 attributes is the class variable. Below the Test options panel, there is a drop down menu with the entry "(Nom) XXX" selected, where XXX is the name of the last variable. Click on this and choose "(Nom) class" instead. [Num means numeric; Nom means nominal, i.e., discrete]

Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". ZeroR is a very simple rule-learning algorithm (which we do not want). The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.

Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "trees", "rules". For example, to choose Logistic Regression, choose "functions" and then "Logistic". To select the Perceptron algorithm, choose "functions" and then "VotedPerceptron".

Once we have chosen an algorithm to run, Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Logistic Regression, this output consists of several sections:

- Run Information: Details of the data set
- Classifier model: The learned model.
- Evaluation on test set: This gives various statistics. The key item is the second one: Incorrectly Classified Instances will be expressed as a count and a percentage. You should report the percentages in your answer. One other item of interest comes at the very end: The Confusion Matrix. This shows how many false positive and false negative errors were made.

**Using the command-line**You can also run weka from the command line. This is probably easier to do, as you can then run things in a batch script, which will make it easier to do all the runs.

**Basic Run of a Learner on a training and test set**:java -cp WEKAJAR LEARNER -t TRAINFILE -T TESTFILE

This command has four parameters:__WEKAJAR:__This should point to the weka.jar that you have installed.__LEARNER:__This is the class name of the learner you want to run. You will use four learners in this homework:

weka.classifiers.lazy.IBk

weka.classifiers.functions.Logistic

weka.classifiers.functions.VotedPerceptron

weka.classifiers.trees.J48__TRAINFILE:__This is the name of the file that contains the training set__TESTFILE:__This is the name of the file that contains the test set

java -cp weka.jar weka.classifiers.lazy.IBk -t hw_step_10.arff -T hw_step_test.arff

The output of running these commands consist of output the learned model as well as test and training statistics. You may want to save this output to a file and then later extract out what you need.

java -cp weka.jar weka.classifiers.lazy.IBk -t hw_step_10.arff -T hw_step_test.arff -K 5

If you do not provide any options, then you get a general usage output for that particular learner.**Saving output and extracting accuracy**:java -cp WEKAJAR LEARNER -t TRAINFILE -T TESTFILE > OUTFILE grep Correctly OUTFILE | tail -1 | awk '{print $5}'

**Wrapping around a script**:You can wrap that above command inside a shell or perl script to do all the training runs and extract all the accuracies. From there on, generating the plots and tables should be straight forward.