Welcome to AutoClass@IJM
the webserver for AutoClass Bayesian clustering system.

Developped by F. Achcar1,2and D. Mestivier1 in collaboration with J.M. Camadro2
We kindly ask users to cite this paper when publishing results derived of the use of AutoClass@IJM.





AutoClass@IJM
Enter your email*
Retype your email
Load example files
Example One: Download dataset or or get an example of the output.
Example Two: Download dataset or or get an example of the output.
Or upload your data files*
Real Scalar: singly bounded real values, typically bounded below at zero (Ex: Length)
Experiment file
Header
Relative error on data (Optional -- default = 0.01 i.e. 1%)
Real Location: other real values (Ex: Elevation, microarray log ratio)
Experiment file
Header
Absolute error on data (Optional -- default = 0.01 e.g. 1.23 ± 0.01, -10.35 ± 0.01, ...)
Discrete data
Experiment file
Header

*neither of these pieces of information are stored or repurposed.

Help
File Format
Experiment file should be tab-delimited with or without a header. Missing values are allowed (see example below, YBL001C ). If there is a header it must have a title for each column. The first column is a row id (must be shorter than 30 characters, without space/blank character).
Example :
  • YORF [tab] Experiment1 [tab] Experiment2
  • YAL001C [tab] 0.01 [tab] 1.2
  • YBL001C [tab][tab] 0.5
Warnings: please check that no columns have an unique value for all lines.
AutoClass can handle an “unlimited” number of lines and a maximum of 999 columns (default setting of AutoClass; the maximum number of columns may be increased upon user’s request).
Outputs
  • A tab-delimited file that associates each row id with the number of its class.
  • Two CDT files to read the results in JavaTreeview-like software (if there is only numerical data, otherwise use your favorite spreadsheet): one contains the experimental data and the probabilities for each item to belong to different classes; the second contains only the experimental data.
  • log files.
As soon as the job is completed, all these files are zipped. Then, a URL to this zip-archive is sent by e-mail for the user to download. The zip-archive will be stored on our web server for 5 days before being deleted.
Note that currently, the cdt are annotated only for Saccharomyces cerevisiae genes (the row id is used for other type of data).
[top]

Tutorial
AutoClass@IJM provides an web interface to the powerful clustering software, AutoClass, developed by the Ames Research Center at N.A.S.A. AutoClass is an unsupervised Bayesian classification system which has many powerful features:
  • it determines the number of classes automatically,
  • it can use mixed discrete and real valued data,
  • it can handle missing values,
  • it computes class membership probabilities.
For a summary of AutoClass algorithm see this figure. For more informations please see Cheeseman et al. article(see Reference section).
Our web interface aim at simplifying the use of AutoClass by:
  • providing an easy-to-use interface (AutoClass is a command line program which requires the user to write several configuration files),
  • providing computational ressources,
  • post-processing the AutoClass output files (generating easy-to-use output files, see Help),
The user only needs to upload his data files using the web interface, and wait to receive the outputs by e-mail.
Input files
AutoClass can handle three different types of data:
  • real scalars: real numbers singly bounded, for instance length, weight, etc.
  • real location: real numbers distributed on the two sides of an origin, such as cartesian coordinates (in the case, the origin is 0.0), microarray log ratio, elevation (where sea level is the origin), etc.
  • discrete data: any qualitative data, such as chromosome number, phenotype, eyes color, etc.
AutoClass can handle an “unlimited” number of lines and a maximum of 999 columns (default setting of AutoClass; the maximum number of columns may be increased upon user’s request).
Error parameter
As quoted in the AutoClass documentation files:

"The fundamental question in all of this is: "To what extent do you believe the numbers that are to be given to AutoClass?" AutoClass will run quite happily with whatever it is given. It is up to the user to decide what is meaningful and what is not. [...]
It turns out that a constant error in the logarithm of a value is equivalent to a relative error in the original value. That is, the error in the value should be proportional to the value, rather than being itself a constant. And REL_ERROR is just the ratio of the error to the value. If your knowledge of the data generating process is sufficient to specify such a ratio, just give it as the value of REL_ERROR. Otherwise give your estimate of the constant error as ERROR, and AutoClass will compute the ratio of this to the average data value and use this as REL_ERROR."

For practical purposes:

  1. when dealing with 'real scalars' data type, the error parameter is expressed as percent of the value and should obviously be less than 100% (i.e. < 1). e.g. a 0.02 relative error corresponds to: (i) for a datum of 100, a precision of 100 ± 2, (ii) for a datum of 1, a precision of 1 ± 0.02.
  2. when dealing with 'real location' data type, the error parameter is a constant value. e.g. a 0.02 absolute error corresponds to: (i) for a datum of 100, a precision of 100 ± 0.02, (ii) for a datum of 1, a precision of 1 ± 0.02.

Note that: the impact of the error parameter on classification is highly dependent on the structure of the dataset and as mentionned above, should be set according to the confidence into the data to be classified. For example, in some experiments with real scalar data ranging from 0 to 10 000, the datum 80 may be considered as not different from 100 (and an error of 0.2 is acceptable), in other experiments, with real scalar data ranging from 0 to 10, the datum 1 may be known to be very different from 1.1, and an error of 0.01 is required.

If the error parameter entered by user is too large with respect to the data, the error message generated by AutoClass is interpreted and a e-mail is sent to the user with AutoClass log file attached.

Submit your files
  1. For each type of data, the user can submit one file (the three "Experimental File" fields in the above form) or leave the field blank. The first column of each file contains id's (see Help/File Format). After uploading, the files will be concatenated according to their id's (for numerical data, if there is more than one line for a given id, the mean is computed).
  2. Indicated if there is an header line in the data file
  3. Clic the "process" button.
[top]

Example files

We provide two example files for you to test AutoClass@IJM (these examples can be loaded here):
Example One
A file with real values :
The data are from Yoshimoto's paper: Yoshimoto H. et al. (2002) "Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae.J Biol Chem. 277(34):31079-88(GEO DataSet: GSE3456).
Example Two
Two files: one with real values and the other with discrete values as test examples for clustering of heterogeneous data.
The data come from the French National Institut for Health Watch (I.N.V.S. http://www.invs.sante.fr ): one file reporting incidence and mortality rate for cancer (real values) and one file reporting the location of the primary cancer and the gender of the populations (discrete values) (source: Belot A. et al. (2008) "Cancer incidence and mortality in France over the period 1980-2005."Rev Epidemiol Sante Publique. 56(3):159-75).
[top]

References
  • The command line version of AutoClass is available at the NASA website.
  • Cheeseman P, Kelly J, Self M, Stutz J, Taylor W, et al. (1988) AutoClass A Bayesian classification system. NASA Ames Research Center. NASA-TM-107903 NASA-TM-107903.
  • Cheeseman P, Stutz J (1996) Bayesian Classification (AutoClass) : theory and results. In: Fayyad U, Piatelsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in Knowledge Discovery and Data Mining. Cambridge, MA: AAAI Press/MIT.
[top]

Citing AutoClass@IJM
We kindly ask users to cite the following paper when publishing results derived of the use of AutoClass@IJM :
AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology
Fiona Achcar; Jean-Michel Camadro; Denis Mestivier
Nucleic Acids Research 2009; doi: 10.1093/nar/gkp430
[top]

Development

To ask a question or report a problem, please contact Achcar F. or Mestivier D.

({achcar,mestivier}[AT]ijm.univ-paris-diderot.fr)
1) "Modelling in Integrative Biology" group2) "Protein Engineering and Metabolic Control" group
Institut Jacques Monod
[top]