Genome BioInformatics Research Lab

  IMIM * UPF * CRG * GRIB HOME DATASETS Human/Mouse Gene Prediction
   
SUPPLEMENTARY MATERIALS FOR

Comparative gene prediction
in human and mouse


G. Parra, P. Agarwal, J.F. Abril,
T. Wiehe, J.W. Fickett and R. Guigó *.


Genome Research 13(1):108-117 (Jan 1, 2003)
[ PubMed ]   [Abstract]   [Full Text]


* To whom correspondence should be adressed.
Email: rguigo@imim.es. Ph: +034 93-224-0877.

Summary

 
SGP2 is a program to predict genes by comparing anonymous genomic sequences from two different species. It combines tblastx (WU-Blast), a sequence similarity search program, with geneid, an "ab initio" gene prediction program. In "assymetric" mode, genes are predicted in one sequence from one species (the target sequence), using a set of sequences (maybe only one) from the other species (the reference set). Essentially, geneid is used to predict all potential exons along the target sequence. Scores of exons are computed as log-likelihood ratios, function of the splice sites defining the exon, the coding bias in composition of the exon sequence as measured by a Markov Model of order five, and of the optimal alignment at the amino acid level between the target exon sequence and the counterpart homologous sequence in the reference set. From the set of predicted exons, the gene structure is assembled (eventually multiple genes in both strands) maximizing the sum of the scores of the assembled exons.


CONTENTS


SGP2 Test Sets

IMOG dataset

This is a list of 15 pairs of single gene sequences, with little overlap with the Sanger Center data set [Jareborg et al., Genome Research 9(9):815, 1999]. The gene accession-pairs associates a gene id to the corresponding human-mouse pair (first the human sequence, then the mouse).


You can donwload a tarball containing all of above from here.



BI dataset

These are three pairs of multigene sequences. Annotation is not available for all the sequences, and of unknown reliability.


You can donwload a tarball containing all of above from here.



SCIMIT dataset

This set contains 129 pairs of single gene sequences and combines non-overlaping sequences from IMOG (see above), the Sanger Center [Jareborg et al., Genome Research 9(9):815, 1999] and the MIT [Batzoglou et al., Genome Research 10(7):950, 2000] data sets. The gene accession-pairs associates a gene id to the corresponding human-mouse pair (first the human sequence, then the mouse).


You can donwload a tarball grouping all of above from here.

Gene predictions: FINISHED HOMOLOGOUS SEQUENCES

Finished Orthologous

SGP2 predictions on the eigth human/mouse homologous sequences browsed from "http://pipeline.lbl.gov/TESTS/" (including MHC). Unfortunately, that URL is no longer available. We just added a column (see Sequences and Annotations) to the table appearing below, which contains the human and mouse fasta files and the corresponding human annotations we have obtained from there.

Each human sequence was compared against the corresponding homologous mouse sequence.

We introduced few changes in the PostScript maps:

  • We show the "real" length of annotated genes (taking into account first and last UTR coords), but we still display only the anontated CDS's.
  • As we ran our programs on the original masked sequence, we are displaying only the masked regions for each sequence without labeling them (in the central axes of each block).
  • We also included Twinscan results for the human/mouse homologous set.


  DATASET Sequences +
Annotations
TBLASTX
Results
SGP2
Predictions
PostScript
Maps
  BTK FA+RS TBX HSP GFF GTF2 A4 A3
  CFTR FA+RS TBX HSP GFF GTF2 A4 A3
  DFNA5 FA+RS TBX HSP GFF GTF2 A4 A3
  ELN FA+RS TBX HSP GFF GTF2 A4 A3
  HOXa FA+RS TBX HSP GFF GTF2 A4 A3
  KvLQT1 FA+RS TBX HSP GFF GTF2 A4 A3
  MHC FA+RS TBX HSP GFF GTF2 A4 A3
  SIL FA+RS TBX HSP GFF GTF2 A4 A3
  ALL -- TBX HSP GFF GTF2 -- --
  TARBALL FA+RS TBX HSP GFF GTF2 A4 A3

FA+RS The set of human and mouse fasta sequences (masked and unmasked), plus the human RefSeqs mapped onto the human sequences in GFF format. There are two GFF files for each region, the "*.pipeline_refseq.gff" having the original annotations produced at Berkeley, and the "*.Korf_refseqs.gff" which contains the subset of hand-curated annotations for the same regions (except for the MHC region that was too big). Those annotations were curated by Ian Korf, see further information at: http://sapiens.wustl.edu/~ikorf/annotation/
TBX contains the raw tblastx (WU-Blast) results of each human sequence against the corresponding homologous mouse sequence. tblastx has been run with -nogap, and the blosum62 matrix, were penalty for aligning with stop codons have been set to -500.
HSP contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast).
GFF `General Feature Format' (GFF) is described on the Sanger Centre gff definition page.
GTF2 `Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link.
A4/A3 contains a PostScript map showing SGP2 predictions, altoghether with geneid and genscan predictions, tblastx matches, repeat locations and, when available, annotations of the known genes. Maps were obtained using gff2ps. Two sizes are provided to be printed into a4 or a3 paper size, but we recommend a3 to visualize with ghostview or similar programs.


Finished human vs. mouse reads

SGP2 predictions on the eigth human sequences browsed from "http://pipeline.lbl.gov/TESTS/" (including MHC) against the mouse WGS 3X (a database of about 13 milion mouse reads). Unfortunately, that URL is no longer available, see previous table for the sequences and annotation of the ortologous datasets.

We used here human fasta sequences that were masked slightly different than those used in the human/mouse orthologous section.


  SEQUENCES TBLASTX
Results
SGP2
Predictions
PostScript
Maps
  BTK TBX HSP GFF GTF2 A4 A3
  CFTR TBX HSP GFF GTF2 A4 A3
  DFNA5 TBX HSP GFF GTF2 A4 A3
  ELN TBX HSP GFF GTF2 A4 A3
  HOXa TBX HSP GFF GTF2 A4 A3
  KvLQT1 TBX HSP GFF GTF2 A4 A3
  MHC TBX HSP GFF GTF2 A4 A3
  SIL TBX HSP GFF GTF2 A4 A3
  ALL TBX HSP GFF GTF2 -- --
  TARBALL TBX HSP GFF GTF2 A4 A3

TBX contains the raw tblastx (WU-Blast) results of each human sequence against a database of about 13 milion mouse reads. tblastx has been run with -nogap, and the blosum62 matrix, were penalty for aligning with stop codons have been set to -500.
HSP contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast).
GFF `General Feature Format' (GFF) is described on the Sanger Centre gff definition page.
GTF2 `Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link.
A4/A3 contains a PostScript map showing SGP2 predictions, altoghether with geneid and genscan predictions, tblastx matches, repeat locations and, when available, annotations of the known genes. Maps were obtained using gff2ps. Two sizes are provided to be printed into a4 or a3 paper size, but we recommend a3 to visualize with ghostview or similar programs.

Gene predictions: HUMAN CHROMOSOME 22

 
This section contains SGP2 predictions on human chromosome 22. Chromosome 22 annotation was compiled by Victoria Haghighi from the Columbia Genome Center. The data was downloaded from http://www.cs.columbia.edu/~vic/sanger2gbd.
There are two sets of SGP2 predictions. The first one are raw prediction along the whole Chromosome 22 sequence (Homology Only). The second one is a set of predictions confined to regions void of annotated genes or pseudogenes (Homology + Evidences). The goal is here predicting novel genes minimizing chimeric predictions. In this case, annotations are taken from the Combined Gene + CDS Set (879 genes).


TBLASTX
Results

geneid
Predictions
SGP2 Predictions
PostScript
Maps
Homology
only
Homology
+ Evidences
TBX HSP SR GFF GTF2 GFF GTF2 GFF GTF2 Not Available Yet

TBX contains the raw tblastx (WU-Blast) results of each human sequence against a database of about 19 milion mouse reads (WGS). tblastx has been run with the following parameters: -nogap, Z=3000000000, E=0.01, W=5, B=10000, V=10000, -hspmax=4, -topcomboN=4, -filter=xnu, and a modified blosum62 matrix were penalty for aligning with stop codons have been set to -500.
HSP contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast).
SR similarity regions in GFF format (but with frames 1,2,3 as in blast), as they were projected from the HSPs (see how they are obtained and how they influence the exons score in the SGP2 algorithm description page).
GFF `General Feature Format' (GFF) is described on the Sanger Centre gff definition page.
GTF2 `Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link.

Whole-Genome Gene-Predictions

 
The results of SGP2 on human and mouse genomes are available from our new Gene-Prediction section. Follow these links to download them:

Homo sapiens   SGP2 results on H.sapiens based on M.musculus MGSC version-3 assembly

Version of the Human genome used:
   golden_path_20011222 (22nd of December 2001).

Version of the Mouse genome used:
   goldenPath assembly (mmFeb2002-MGSCv3-February, 2002). Predictions were obtained on the masked version of the genome. These are the predictions for the v3 of the mouse genome assembly. NOTE: These SGP2 predictions combine geneid predictions with tblastx comparison of the Human genome against the Mouse genome.
  SGP2 results on H.sapiens based on M.musculus Sanger Phusion assembly

Version of the Human genome used:
   golden_path_20010806 (6th of August 2001).

Version of the Mouse genome used:
   sanger_phusion_20011109 (9th of November 2001). Predictions were obtained on the masked version of the genome. NOTE: These SGP2 predictions combine geneid predictions with tblastx comparison of the Human genome against the Mouse genome.

Mus musculus   SGP2 results on M.musculus based on H.sapiens December Golden Path assembly

Version of the Mouse genome used:
   goldenPath assembly (mmFeb2002-MGSCv3-February, 2002).

Version of the Human genome used:
   golden_path_20011222 (22nd of December 2001). Predictions were obtained on the masked version of the genome. NOTE: These SGP2 predictions combine geneid predictions with tblastx comparison of the Mouse genome (v3) against the Human genome.

 
  Disclaimer webmaster