This is an implementation of the Generalized LASSO method
for classification. Details can be found in
Roth, V. (2002), The Generalized LASSO: a wrapper approach to gene selection
for microarray data.
University of Bonn, Dep. Computer Science III,
Tech. Report IAI-TR-2002-8.
Please cite this reference if you use this program.
The user must provide a file which contains the feature values for each sample (One feature per line, see below). Standardization for zero mean and unit variance usually is a very good idea. The optimal l_1 constraint value kappa is automatically found by averaging over randomly chosen 80%/20% training/test splits of the dataset. The prediction performance of the finally selected model is estimated on the basis of new randomly chosen splits.
A demo application for microarray data (sample classification) is included.
In order to obtain nonlinear variants of the Generalized LASSO, the features may also be preprocessed by a Mercer kernel, resulting in a (N x N) input file. RBF-kernels usually work best.
Download GenLASSO-1.1.tar.gz.
Unzip/untar this file:
gunzip GenLASSO-1.1.tar.gz
tar xfv
GenLASSO.tar.
This will install the sources into
a directory called GenLASSO-1.1.
The Generalized LASSO classifier uses part of the donlp2 optimization package by P. Spellucci (to be exact the donlp2_ansi_c.tar.gz version rewritten in Ansi C by Serge Schoeffert). The latter can be downloaded from http://plato.la.asu.edu/donlp2.html. Please unzip/untar donlp2_ansi_c.tar.gz and copy donlp2.c and user_eval.c and all header-files *.h into your GenLASSO-1.1 directory before running the make program.
Edit the Makefile and eventually adjust your compiler settings.
Type make, close your eyes, stop breathing and hope for the best. If you are lucky, this should create the executable GenLASSO_C. Otherwise follow your intuition. Re-start breathing in any case. To avoid panic, however, eyes might be kept close. If nothing helps, send me an e-mail.
GenLASSO_C <file_name>, where <file_name> specifies a configuration file.
Syntax of configuration file:
1. line: <number of samples> <number of features>
-> The problem dimensions
2. line: <initial constraint value> <constraint increment> <number of
increments>
-> specifies the grid-search for the optimal l_1 constraint value kappa.
3. line: <Number of cross-validation iterations>
-> Number of training/test splits of dataset used for model selection and
assessment
4. line: <data filename>
-> Name of file containing both the class labels (first line, either 0 or 1)
and the (pre-processed) expression levels (one line per gene). See also the
example datafile golub.dat.
5. line: <annotation filename>
-> Name of file containing feature (gene) annotations, one per line, See
also example file golub_names.dat.
6. line: <reduce flag>
-> if set to one, the model is re-trained on a subset of genes, consisting
of those genes which have stability scores > 0.05 in the first run.
See also the included example configuration file golub_config.dat.
Preprocessed data from (Golub et al., 1999):
Original
dataset consists of 72 samples, of which 47 are acute lymphoblastic
leukemia (ALL), and 25 samples are acute myeloid leukemia (AML). Expression
levels of 7129 genes were measured using Affymetrix high-density
oligonucleotide arrays. We preprocessed the data by firstly excluding genes
with mostly negative intensity values, and secondly by excluding genes with
max/min <2 and max-min < 1000. This leaves us with a reduced set of 1479
genes. Finally, the data were log-transformed, "squashed" through a
tanh-function (for outlier--reduction), and standardized to zero mean and
unit variance across samples. Class labels and expression values are
summarized in file golub.dat.
To run the demo application, type GenLASSO_C golub_config.dat.
Results (including stability scores and annotations) will be summarized in
the file relevant_genes.dat: model parameters, cross-validation error
(standard deviation), average number of "relevance vectors" (RVs), stability
scores and gene annotations.
Graphical output is provided by the file rg.pgm. The error rates for thresholding by different confidence levels and the fraction of samples contained in the doubt class are printed in file doubt.dat. Format per line: <threshold> <error rate> <fraction of samples in doubt class>.