GenLASSO_C: the Generalized LASSO Classifier

Updated version V1.1 !!

License:

This software is free only for non-commercial use. It must not be modified and distributed without prior permission of the author. The author is not responsible for implications from the use of this software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY.

General Description:

This is an implementation of the Generalized LASSO method for classification. Details can be found in
Roth, V. (2002), The Generalized LASSO: a wrapper approach to gene selection for microarray data. University of Bonn, Dep. Computer Science III, Tech. Report IAI-TR-2002-8.
Please cite this reference if you use this program.

The user must provide a file which contains the feature values for each sample (One feature per line, see below). Standardization for zero mean and unit variance usually is a very good idea. The optimal l_1 constraint value kappa is automatically found by averaging over randomly chosen 80%/20% training/test splits of the dataset. The prediction performance of the finally selected model is estimated on the basis of new randomly chosen splits.

A demo application for microarray data (sample classification) is included.

In order to obtain nonlinear variants of the Generalized LASSO, the features may also be preprocessed by a Mercer kernel, resulting in a (N x N) input file. RBF-kernels usually work best.

Installation:

Download GenLASSO-1.1.tar.gz. Unzip/untar this file:
gunzip GenLASSO-1.1.tar.gz
tar xfv GenLASSO.tar
.
This will install the sources into a directory called GenLASSO-1.1.

The Generalized LASSO classifier uses part of the donlp2 optimization package by P. Spellucci (to be exact the donlp2_ansi_c.tar.gz version rewritten in Ansi C by Serge Schoeffert). The latter can be downloaded from http://plato.la.asu.edu/donlp2.html. Please unzip/untar donlp2_ansi_c.tar.gz and copy donlp2.c and user_eval.c and all header-files *.h into your GenLASSO-1.1 directory before running the make program.

Edit the Makefile and eventually adjust your compiler settings.

Type make, close your eyes, stop breathing and hope for the best. If you are lucky, this should create the executable GenLASSO_C. Otherwise follow your intuition. Re-start breathing in any case. To avoid panic, however, eyes might be kept close. If nothing helps, send me an e-mail.

Usage:

GenLASSO_C <file_name>, where <file_name> specifies a configuration file.

Syntax of configuration file:

1. line: <number of samples> <number of features>
-> The problem dimensions

2. line: <initial constraint value> <constraint increment> <number of increments>
-> specifies the grid-search for the optimal l_1 constraint value kappa.

3. line: <Number of cross-validation iterations>
-> Number of training/test splits of dataset used for model selection and assessment

4. line: <data filename>
-> Name of file containing both the class labels (first line, either 0 or 1) and the (pre-processed) expression levels (one line per gene). See also the example datafile golub.dat.

5. line: <annotation filename>
-> Name of file containing feature (gene) annotations, one per line, See also example file golub_names.dat.

6. line: <reduce flag>
-> if set to one, the model is re-trained on a subset of genes, consisting of those genes which have stability scores > 0.05 in the first run.

See also the included example configuration file golub_config.dat.

Demo:

Preprocessed data from (Golub et al., 1999):
Original dataset consists of 72 samples, of which 47 are acute lymphoblastic leukemia (ALL), and 25 samples are acute myeloid leukemia (AML). Expression levels of 7129 genes were measured using Affymetrix high-density oligonucleotide arrays. We preprocessed the data by firstly excluding genes with mostly negative intensity values, and secondly by excluding genes with max/min <2 and max-min < 1000. This leaves us with a reduced set of 1479 genes. Finally, the data were log-transformed, "squashed" through a tanh-function (for outlier--reduction), and standardized to zero mean and unit variance across samples. Class labels and expression values are summarized in file golub.dat.

To run the demo application, type GenLASSO_C golub_config.dat.
Results (including stability scores and annotations) will be summarized in the file relevant_genes.dat: model parameters, cross-validation error (standard deviation), average number of "relevance vectors" (RVs), stability scores and gene annotations.

Graphical output is provided by the file rg.pgm. The error rates for thresholding by different confidence levels and the fraction of samples contained in the doubt class are printed in file doubt.dat. Format per line: <threshold> <error rate> <fraction of samples in doubt class>.