Genome Informatics Research Lab

  IMIM * UPF * CRG * GRIB HOME DATASETS Chimeric gene prediction
   
Tandem chimerism as a means to increase protein complexity in the human genome

G. Parra, A. Reymond, N. Dabbouseh, E.T. Dermitzakis, R. Castelo, T.M. Thomson, S.E. Antonarakis, and R. Guigó*
.

Genome Research, in press

*To whom correspondence should be adressed.


Contents


In this site we describe and provide all the programs and data used to predict chimeric transcripts in the human and mouse genomes.


EST whole genome analysis



First, we searched fusion transcripts in the complete genomes using expression evidence. In two adjacent non-overlapping coding genes, we looked for ESTs that alt least partially overlap coding regions in each of the two adjacent genes. This experiment has been performed in the human and the mouse genome.

  • Human Apr. 2003 Assembly UCSC Genome Browser

    • Gene pairs in the same orientation (Tandem pairs) : 7,679
      (The file contains the genomic coordinates in gff format and the ids of both RefSeqs)

    • Tandem pairs with out intervening genes from other collections ("bona fide" tandem pairs) : 6,369
      (The file contains the genomic coordinates in gff format and the ids of both RefSeqs)

    • Tandem pairs with ESTs linking the two transcripts : 1,288
      (The file contains the adjacent RefSeqs ids in the fisrt and second fields and all the ids of all EST that cross both Refseqs)

    • Tandem pairs with with ESTs overlapping coding exons from each thranscript in the pair: 176
      (format: RefSeq1_id RefSeq2_id EST_ids)

    • "Bona fide" TICs after re-alignment of the EST into the genome using Spidey : 127
      (format: RefSeq1_id RefSeq2_id EST_ids)

    • Previously described TICs:

      Gene 1 Name and Protein Product Gene 2 Name and Protein Product EST or cDNA
      GALT
      (NM_000155)
      IL11Ra
      (NM_004512)
      AL558734
      CYP22C18
      (NM_000772)
      CYP22C19
      (NM_000769)
      L07093
      VPS72
      (NM_005997)
      TMOD4
      (NM_013353)
      AI763406
      SSF1
      (NM_020230)
      P2RY11
      (NM_002566)
      AJ300588

    • Conservation of the ORF across the TIC : 46

    Of the 46 cases in which the ORF was conserved, we tested 32 cases in which the TIC did not include additional "intergenic" exons. Of these, 11 yielded specific amplification products.


    NM1
    CODE
    Oligo 1
    NM2
    CODE
    Oligo 2
    SEQ
    CNF
     
    BR
     
    HE
     
    KI
     
    SP
     
    LI
     
    CO
     
    SI
     
    MU
     
    LU
     
    ST
     
    TE
     
    PL
    NM_007221 TCCCAGAGAAGGATCTGCAC NM_000711 GTGCAGAGTCCAGCAAAGGT OK 1 1 0 1 1 0 0 0 0 1 0 0
    NM_000269 GAGACCAACCCTGCAGACTC NM_002512 CTCGAAGCGCTTGATGATCT OK 1 0 0 0 1 0 0 0 1 1 1 1
    NM_015996 CTCCTCCATCGCCATGTT NM_003186 CACTGCACTATGATCCACTCC OK 1 0 0 0 1 0 0 0 0 0 1 0
    NM_020892 GTTCACTGCCAGAGGGTTTC NM_030570 CGAATGTCAGCCAGTGTCTC OK 0 0 0 0 1 0 0 0 0 0 0 0
    NM_002157 GGCAGGACAAGCGTTTAGAA NM_015387 AAAGGATTCATCAGGCCAAT OK 0 0 0 0 0 0 0 1 0 0 0 0
    NM_031925 TGCTGGAATGGGAGAAGACT NM_016086 ATGGAGCTGGTGCAATTGTT OK 1 1 0 0 0 0 0 0 0 0 0 0
    NM_153268 ATGGAGTTGGTGATGTGCAA NM_145753 GCTGAGGCTCTCCATCATGT OK 0 0 0 0 1 0 0 0 0 0 0 0
    NM_172341 TCCTGCCTTTTCTCTGGTTG NM_019104 TCCAGCAGACACTGCAAGAC OK 0 0 0 0 0 0 0 0 0 0 1 0
    NM_000175 GCACCAAGATGATACCCTGTG NM_032346 AATACACCTGCACGACCAGA OK 0 1 0 0 0 0 0 0 0 1 0 0
    NM_003838 GCCCTTCAATGTTTGGAAAA NM_015978 TCAGTTCTTTTTCCTTGATCTGC OK 1 1 0 0 0 0 0 1 0 0 1 0
    NM_025247 GCCTTGGATATAGCCATGATT NM_000690 CCTGACAGATGACCTCTCCA OK 0 0 0 1 0 0 0 0 0 0 0 0
    NM1
    CODE
    Oligo 1
    NM2
    CODE
    Oligo 2
    SEQ
    CNF

    BR

    HE

    KI

    SP

    LI

    CO

    SI

    MU

    LU

    ST

    TE

    PL
     

    TABLE COLUMNS DESCRIPTION

    Here we describe what does contain each column in the previous table:

    • Refseq NM code for the first gene.
    • Oligo 1 used for the amplification.
    • Refseq NM code for the second gene.
    • Oligo 2 used fot the amplification.
    • SEQ CNF : RT-PCR succeeded and the sequence of the fusion gene was confirmed.
    • Results on tissues: RT-PCR success on each of the mouse tissues tested in this experiment ("green" color meaning success and "red" color for failure). The tissues were: brain (BR), heart (HE), kidney (KI), spleen (SP), liver (LI), colon (CO), small intestine (SI), muscle (MU), lung (LU), stomach (ST), testis (TE), and placenta (PL).

    Positive Genes are those having "OK" in the "SEQ CNF" column.


  • Mouse mm4 Assembly UCSC Genome Browser

    • Number of contiguous in the same strand : 7,893
      (The file contains the genomic coordinates in gff format and the ids of both RefSeqs)

    • Pair of genes with at least one EST crossing the boundaries of the two known refseqs : 1,670
      (The file contains the adjacent RefSeqs ids in the fisrt and second fields and all the ids of all EST that cross both Refseqs)

    • Pair of genes with at least one EST crossing the boundaries and with at least one overlapping exon :

      • Original mapping with BLAT: 208
        (format: chromosome EST_id RefSeq1_id number_suported_exons RefSeq2_id number_suported_exons)
      • ESTs re-mapped with Spidey: 135
        (format: chromosome EST_id RefSeq1_id number_suported_exons RefSeq2_id number_suported_exons)
      • Spidey aligned pairs after filtering the strand alignment and more than one EST exon crossing both refseqs: 73

    • Pair of genes with at least one EST crossing the boundaries and with at least one overlapping exon for each gene and a ORF crossing the junction between the two genes : 19



    From the set chimeric transcripts we performed RT-PCR in cDNAs from different tissues and 8 cases could be confirmed experimentally. The RT-PCR proves and the amplified tissues.


    NM1
    CODE
    Oligo 1
    NM2
    CODE
    Oligo 2
    SEQ
    CNF
    BR HE KI TH LI ST MU LU TE SK EY OV
    NM_008866 GGCCGATCAACAGTGCTAA NM_011541 TCTTTCAGCAAATCCAATGC OK 0 0 1 0 1 0 0 1 1 1 0 1
    NM_010724 GGGTGCCCTCTATCCAGAGT NM_011530 CCTGAGGCCTCCTTCTCTCT OK 0 0 0 0 0 0 0 1 0 0 0 0
    NM_021419 CCCACAGTTTCTGCTCCTTC NM_028791 GACGATGTTCCCTCCACAAG OK 0 0 0 0 0 0 0 1 0 0 0 0
    NM_026380 CTCTCCACGGAAGAAGCAAC NM_011267 CAGGTTCTCCTCGCTGAACT OK 0 0 0 0 0 0 0 1 0 0 0 0
    NM_027903 GGTGGTGAATGGAGAGAGGA NM_007527 GTAGCAAAAAGGCCCCTGTC no 0 0 0 0 0 0 0 0 0 0 0 0
    NM_028873 AGGCCCAACAGTTCAGTACC NM_025364 TGGTCTCTAAACCACGAGCA OK 0 0 0 1 1 0 0 1 1 0 0 1
    NM_029090 GGAGATATTCTAGCCTCCAGCTT NM_029738 TTCACAAGCCACAGAAGCAC OK 0 0 0 0 0 0 0 1 0 0 0 0
    NM_134247 CCACGCTCTTCCTGCCAC NM_134246 GCAGGTAGGTCACAGCTTCC no 0 0 0 0 0 0 0 0 0 0 0 0
    NM_138604 TGGGCCTACCGTCATTTAAG NM_138606 GCCACGACTTGGGTAAAGAA OK 1 0 0 1 0 1 0 1 0 0 0 0
    NM_144894 CTTTGAGCCTGCAGCTCTTC NM_023684 TCCACTGGTACCAAACTGTCC OK 0 0 0 0 0 0 0 1 0 0 0 0
    NM_178017 CCTAGCCTCCCTCCTGTCTG NM_011622 CTCCAGTTGCTGGGAGAGTC no 0 0 0 0 0 0 0 0 0 0 0 0
    NM1
    CODE
    Oligo 1
    NM2
    CODE
    Oligo 2
    SEQ
    CNF
    BR HE KI TH LI ST MU LU TE SK EY OV
     

    TABLE COLUMNS DESCRIPTION

    Here we describe what does contain each column in the previous table:

    • Refseq NM code for the first gene.
    • Oligo 1 used for the amplification.
    • Refseq NM code for the second gene.
    • Oligo 2 used fot the amplification.
    • SEQ CNF : RT-PCR succeeded and the sequence of the fusion gene was confirmed.
    • Results on tissues: RT-PCR success on each of the mouse tissues tested in this experiment ("green" color meaning success and "red" color for failure). The tissues were: brain (BR), heart (HE), kidney (KI), thymus (TH), liver (LI), stomach (ST), muscle (MU), lung (LU), testis (TE), skin (SK), eye (EY) and ovary (OV).

    Positive Genes are those having "OK" in the "SEQ CNF" column.

geneid predictions in the ENODE regions



Observed cases in the ENCODE regions:

ENCODE region chr gene 1 transcripts 1 gene 2 transcripts 2 Expressed evidence
ENm009 chr 11 TRIM6 NM_001003818
NM_05866
TRIM34 NM_021616
NM_130389
AB039903
ENr233 chr 15 serf2 NM_005770 HYpk NM_016400 AK000438
ENm005 chr 21 CRYZL1 BC033023 DONSON NM_017613
NM_145794
NM_145795
AL157441
ENr223 chr 6 C6orf148 NM_030568 AC019205.8-001 AK090984 BM544101

The last approach was the prediction of putative new fusion genes using the ab initio gene preidciton program geneid in the ENCODE regions. You can find a more detailed information about the selection and features of the ENCODE regions in the main ENCODE webpage or in the GENCODE webpage. All the sequences containing pairs of adjacent genes in the ENCODE regions regions were obtained. geneid was run on those regions and only predictions that overlapped coding regions from both genes were considered.


The following subset of sequences have been generated for the experiment:

    • Number of initial transcripts (RefSeq, VEGA annotations and Known genes from UCSC genome browser): 1,288 transcripts.
      (This file contains the information of every transcript in UCSC format )

    • Non redundant set of translatable transcripts: 594 transcripts.
      (Identificators and corresponding ENCODE region from the filtered previous set of genes)

    • Cluster of overlapping transcripts: 321 clusters.
      (The file contains the ENCODE id plus the cluster number GCL_XXX, the begining and the end of the overlapping transcripts and the ids of the corresponding ovelapped transcripts)

    • Pair of adjacent transcipts (same strand): 165 pairs of clusters.

    • geneid predictions overlapping both genes: 126.
      (The file contains the ids of the clusters where geneid has predict a protein that overlaps coding regions of both set of transcripts)


From the previous 126 predictions, 96 where tested by experimental RT-PCR. The exons corresponding only to the chimeric predictions have been tested. Three of the chimeric pair of exons predicted by geneid amplified:

ENCODE region chr transcript 1 oligo 1 transcript 2 oligo 2 Pred CDS Ampli Seq
ENm005 chr 21 NM_144659 CCAAGGAGCTGAGAAGAACG NM_021254 ATACGCTGGCCACAAGAATC fasta fasta
ENm013 chr 7 AK124057 CGGAGAACTTTGTCGGAGAG NM_000629 TGGAAAAACTAGGGGAAGGA fasta fasta
ENr331 chr 2 AK023854 TGCCTACTTCACTGTCACCA BC061909 CCAGAGAGGATTGTGCACCC fasta fasta

TABLE COLUMNS DESCRIPTION

Here we describe what does contain each column in the previous table:

    • ENCODE region code.
    • Human chromosome where the ENCODE region is.
    • Gene 1 code for the first gene.
    • Oligo 1 used fot the amplification.
    • Gene 2 code for the second gene.
    • Oligo 2 used fot the amplification.
    • Pred CDS corresponds to the predicted chimeric CDS by geneid.
    • Ampli Seq corresponds to the amplified DNA.


 
  Disclaimer webmaster