Genome BioInformatics Research Lab

  IMIM * UPF * CRG * GRIB HOME Software geneid
 
geneid HomePage
 
   
Contents

  1. What's geneid?
  2. Main features
  3. Examples
  4. Training geneid
  5. Accuracy
  6. Gene predictions on genomes
  7. Speed
  8. Source code distribution
  9. geneid parameter files
  10. geneid web server
  11. If you encounter problems ...
  12. References
  13. Authors and acknowledgements

What's geneid?

 
geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the the log-likelihood ratio of a Markov Model for coding DNA. Finally, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. geneid offers some type of support to integrate predictions from multiple source via external gff files and the redefinition of the general gene structure or model is also feasible. The accuracy of geneid compares favorably to that of other existing tools, but geneid is likely more efficient in terms of speed and memory usage. Currently, geneid v1.2 analyzes the whole human genome in 3 hours (approx. 1 Gbp / hour) on a processor Intel(R) Xeon CPU 2.80 Ghz.

Main features

  • geneid accuracy compares to that of other existing "ab initio" gene prediction tools.
  • geneid is very efficient in terms of speed and memory usage. In practice, geneid can analyze chromosome size sequences at a rate of about 1 Gbp per hour on the Intel(R) Xeon CPU 2.80 Ghz. For the largest human chromosome (chr1), it requires 1/2 Gbyte of RAM plus the size of the Fasta sequence. .
  • geneid offers support to integrate predictions from multiple sources (ESTs, blast HSPs) and to reannotate genomic sequences, via external gff files and together with the redefinition of the "gene model".
  • geneid output can be customized to different levels of detail, including exhaustive listing of potential signals and exons. Furthermore, several output formats as gff or XML are available.
  • There are available parameter files in geneid v 1.2 for Drosophila Melanogaster, human (which can be also used for vertebrate genomes), Dictyostelium discoideum and Tetraodon nigroviridis (which can be used for Fugu rubripes) among many others for species spanning the four "classical" kingdoms. The additional currently available parameter files can be found under the section "geneid parameter files" .

Examples

 
SAMPLES:


FORMATS:

Training Geneid

 
In order to build a parameter file for geneid it is necessary to "train" the program and parameter configurations exist for a number of eukaryotic species. Training basically consists of computing position weight matrices (PWMs) or Markov models for the splice sites and start codong and deriving a model for coding DNA (generally a Markov model of order 4 or 5). The basic requirements for a training set is an annotation file (preferably in geneid gff format and a set of fasta sequences corresponding to the gene models in the annotation file.

Generally as few as 100 gene models could be enough to build a reasonably accurate geneid parameter file, but generally a user would want to have as many sequences as possible (> 500) to build an optimally accurate matrix and also to be able to set aside some of the gene models for testing purposes (see training document).

If a user wants to evaluate the accuracy of the newly developed parameter file she will also require an annotation file and fasta files corresponding to the sequences in the evaluation set. However if a user only has a limited number of gene models to train geneid with (generally < 500 sequences) she can use a "leave-one-out strategy" for evaluating the accuracty (more information in the training tutorial).

The user can go through an example of a typical geneid "training" protocol (Training geneid for the parasite Perkinsus marinus) by following this tutorial

Gene predictions on genomes

 
This link contains the set of predicted genes using geneid on the recently sequenced genomes (Drosophila melanogaster, Homo sapiens, Mus musculus, Fugu rubripes or Dictyostelium discoideum) for some of their most common releases.

Accuracy

 
Because of the lack of well annotated large genomic sequences, it is difficult to assess the accuracy of "ab initio" gene finders. We have attempted to analyze the accuracy of geneid in a number of different sets. We believe that in the analysis of large genomic sequences geneid may be superior to other existing tools. A side by side comparison with genscan can be found here.

Speed

 
The benchmark sequence is the human Chromosome 1 (239 Mb) extracted from
the goldenPath-UCSC assembly (July 2003 release):

Computer Intel Pentium Intel(R) Xeon CPU 2.80 Ghz. 4Gb RAM
CPU/real time(s) 1025 / 1045 secs

Source code distribution

 
geneid distributions contains several directories and files compressed in tar.gz file. Source code and documentation files are included in the distribution, as well as several parameters files and other extra information.

All of the files can be obtained from our ftp server:

Cummulative change log: ChangeLog

geneid v 1.4.4 (current development version):

  • geneid v 1.4.4 full distribution: source code and documentation
    (documentation does not yet reflect new features; for help, type geneid -h)
    [DOWNLOAD]

  • Note: Please, verify the check-sum file value
    Type: md5sum geneid_v1.4.4.Jan_13_2011.tar.gz
    -> 05c00f283a8fa996418aff0bc8db1c6d


  • geneid v 1.4.4 full distribution: source code and documentation
    (documentation does not yet reflect new features; for help, type geneid -h)
    [DOWNLOAD]

  • Note: Please, verify the check-sum file value
    Type: md5sum geneid_v1.4.4.Jan_13_2011.tar.gz
    -> 05c00f283a8fa996418aff0bc8db1c6d


geneid v 1.3 preview release 3 (version used for NGASP phase II category 4):

  • geneid v 1.3 full distribution: source code and documentation
    (documentation does not yet reflect new features; for help, type geneid -h)
    [DOWNLOAD]

  • Note: Please, verify the check-sum file value
    Type: md5sum geneid_v1.3.Mar_30_2007.tar.gz
    -> 10cad4e6ae25a57fcc6bb062692626ae


geneid v 1.3 preview release 1 (version used for NGASP phase I category 1):

  • geneid v 1.3 full distribution: source code and documentation
    (documentation does not yet reflect new features; for help, type geneid -h)
    [DOWNLOAD]

  • Note: Please, verify the check-sum file value
    Type: md5sum geneid_v1.3.Dec_21_2006.tar.gz
    -> 1ff0f870e5ec5a553e4603102a9d7c62


geneid v 1.2:

  • geneid v 1.2 full distribution: source code and documentation
    [DOWNLOAD]

  • Note: Please, verify the check-sum file value
    Type: md5sum geneid_v1.2.March_1_2005.tar.gz
    -> 6f350210ead7e49ac76be1fd17ef91f9


  • geneid v 1.2 Solaris 64-bits distribution
    (Makefiles optimized by Mithun Sridharan, Sun Microsystems GmbH)
    [FULL VERSION - DOWNLOAD] [BINARY FILE]
  • geneid v 1.2 Linux binary (gcc version 3.3.1)
    [DOWNLOAD]
  • geneid v 1.2 documentation (HTML)
    [READ]

Instructions to install geneid in your computer.


Old releases:

geneid v 1.1:

  • geneid v 1.1 full distribution: source code and documentation
    [DOWNLOAD]
  • geneid v 1.1 Linux binary (gcc version 2.95 19990728)
    [DOWNLOAD]
  • geneid v 1.1 documentation (HTML)
    [DOWNLOAD] [READ]

geneid v 1.0:

  • geneid v 1.0 full distribution: source code and documentation
         [DOWNLOAD]
  • geneid v 1.0 binary files for some architectures
         Linux, SGI and Solaris.
  • geneid v 1.0 documentation (PostScript)
         [DOWNLOAD]

geneid v 1.0 (Parallel version): -- Requires UNIX/LINUX pthreads library --

  • geneid Parallel full distribution: source code and documentation
         [DOWNLOAD]

geneid parameter files

 
geneid has been trained on several species and it is being trained on other genomes as well. See this help for more details about the different parts of parameter files as well as their statistical meaning.

- - The parameter files for geneid v 1.2 are not compatible with previous versions - -
- - The parameter files for geneid v 1.3 and 1.4 are not back-compatible with previous versions, however, version 1.2 parameter files ARE forward-compatible with version 1.3 and 1.4 - -

List of available parameter files (geneid v 1.3 and 1.4):

  • Homo sapiens (suitable for vertebrates) (UPDATED - January 2nd, 2007)
  • Drosophila melanogaster (suitable for fly and mosquito) (UPDATED - January 2nd, 2007)
  • Acyrthosiphon pisum (This version of the aphid parameter file detects GC donors and requires geneid v 1.3 and above)


List of available ANIMAL parameter files (geneid v 1.2 and above):



List of available PROTIST parameter files (geneid v 1.2 and above):



List of available PLANT parameter files (geneid v 1.2 and above):



List of available FUNGI parameter files (geneid v 1.2 and above):



List of available parameter files for OLDER VERSION OF GENEID (geneid v 1.1):



Web server

 
A geneid web server is available to submit sequences over the Internet. There is no limit to the length of the submitted sequence, other than the imposed by the Internet (except when plotting is required).

If you encounter problems...

 
If you encounter problems using geneid, or have suggestions on how to improve it send an e-mail to geneid@crg.es

References

 

  • E. Blanco, G. Parra and R. Guigó,
    "Using geneid to Identify Genes.",
    In A. Baxevanis, editor:
    Current Protocols in Bioinformatics. Unit 4.3.
    John Wiley & Sons Inc., New York (2002) (in press)

  • E. Blanco, G. Parra, S. Castellano, J.F. Abril, M. Burset, X. Fustero, X. Messeguer and R. Guigó
    "Gene Prediction in the Post-Genomic Era."
    IX th  ISMB (Poster), Copenhagen, Denmark (2001)

  • G. Parra, E. Blanco, and R. Guigó,
    "Geneid in Drosophila",
    Genome Research 10(4):511-515 (2000).

  • R. Guigó,
    "Assembling genes from predicted exons in linear time with dynamic programming",
    Journal of Computational Biology, 5:681-702 (1998).

  • R. Guigó, S. Knudsen, N. Drake, and T. F. Smith,
    "Prediction of gene structure",
    Journal of Molecular Biology, 226:141-157 (1992).

Authors and acknowledgements

 
The current version of geneid has been written by Enrique Blanco, Tyler Alioto and Roderic Guigó.
The parameter files have been constructed by Genis Parra, Tyler Alioto and Francisco Camara.
With contributions from Josep F.Abril, Moises Burset and Xavier Messeguer.



This training tutorial document was prepared by: Francisco Camara.
CopyRight © 2002

geneid is under GNU General Public License.

 
  Disclaimer webmaster