paqclass


Classification using PAQ8

Made by Byron Knoll in 2013.

PAQclass is a classification algorithm based on a powerful compression program called PAQ8. PAQ8 is slow but gets state of the art compression rates. For more information about PAQ8, see http://cs.fit.edu/~mmahoney/compression/#paq

So, how does data compression help with classification? If two sets of data compress well together, they probably share common patterns. Compression rate can therefore act as a similarity metric between two sets of data. PAQclass uses this similarity metric to efficiently measure the distance between test and training data to create classifications. In general, better compression rates lead to better classifications (so PAQ8 outperforms gzip on classification tasks). For more details on how PAQclass works and how it compares to other classification algorithms, see my master's thesis: http://www.byronknoll.com/thesis.pdf

PAQclass works well for classifying sequential data - such as text documents or time series data. It supports binary or multi-class classification. If your data is continuous, discretizing the data points into a small number of buckets will probably improve performance.

PAQclass is designed for Linux. To compile, just type "make". In order to use PAQclass you will need to organize your dataset into a specific folder structure:

Training data: a separate folder should be created for training data. For each label/class in your data, create a subfolder in this training folder. For example, if you have a binary classification task you could have the folder structure: "train/class0" and "train/class1". Put your training data files into the folders that correspond with their label.

Test data: a separate folder should be created for test data. Each data point/document that you want classified should be an individual file in this folder.

To run PAQclass, you just need to specify the location of the training and test folders: "./runner /path/to/train /path/to/test". The output gets printed to standard output, so you can direct it to a file with: "./runner train test > output.tsv". The output is a tab separated file:

The first row contains the column names. Subsequent rows contain data for each test file. The first column is the file name. The second column is the classification. In case you want more control over the output class (such as choosing a threshold for a binary classification task), the corresponding weights for each label are output in the subsequent columns. The weights are the cross-entropy rate of PAQ8 - so a lower value means that the label is a better match. The class that is chosen for the second column is simply the label with the lowest value.

In case you want more control over the performance of PAQclass, you can specify an optional third parameter for the compression level: "./runner train test 2 > output.tsv". This parameter should be between 1 and 8: 1 = fastest, worst compression, least memory usage 8 = slowest, best compression, most memory usage

Project Information

The project was created on May 19, 2013.