In this lab, we will experiment with linearly separable and linearly inseparable data using an open-source implementation of support vector machines in C.
We will be using the open-source package SVMlight for this assignment. The website for the package is http://svmlight.joachims.org/. This code allows one to train SVMs with both linearly separable data and linearly inseparable data. You can choose from the three kernel functions given in class for linearly inseparable data.
Create a directory called svm_light
and change into that
directory. Use the wget command to download the current source code for
SVMlight and follow the directions on the author's page to
unpack the tarball and compile the code.
Create an additional directory called data
and change into
that directory. Copy the following two programs to create datasets for
this lab into that directory and compile them:
gcc -o svm_dataset svm_dataset.c gcc -o holdout holdout.c ./svm_dataset > plain_data ./holdout plain_data ./svm_dataset -i 50 > insep_data ./holdout insep_dataMake several datasets, both linearly separable and linearly inseparable. Copy the resulting *.train and *.test files up one directory to the
svm_light
directory.
Run the svm_learn
program on the *.train datasets and the
svm_classify
program on the *.test datasets (do this one dataset
at a time, not on all datasets at once). Note the accuracy, precision, and
recall of svm_classify
on the testing datasets.
Try different options for svm_learn
, as specified on the
program's website. Particularly, see how the kernel selection affects the
accuracy, precision, and recall for the testing datasets.
svm_dataset
), the
svm_learn
options used to generate model files, and the resulting
statistics for svm_classify
on those model files.
If you had a dataset that show particular difficulty with classification,
convert that dataset file into a CSV file using the following vi substitution
command (create a copy of the file first and edit the copy, e.g.
cp insep_data insep.csv
):
:1,$s/ [0-9][0-9]*:/, /gUse a simple visualization technique, such as OpenOffice scatter plots, to visualize the overlap in the data. Here is an example of the visualization for a randomly generated linearly nonseparable file: nonsep.pdf. Upload your visualization file and the dataset.