DATASETS: Identification of chimeric genes

Genome Informatics Research Lab

Resources & Datasets Gene Predictions | Seminars & Courses

IMIM

UPF

CRG

GRIB

DATASETS

Chimeric gene prediction

Tandem chimerism as a means to increase protein complexity in the human genome

G. Parra, A. Reymond, N. Dabbouseh, E.T. Dermitzakis, R. Castelo, T.M. Thomson, S.E. Antonarakis, and R. Guigó^*.

Genome Research, in press

^*To whom correspondence should be adressed.

Contents

In this site we describe and provide all the programs and data used to predict chimeric transcripts in the human and mouse genomes.

EST whole genome analysis
geneid prediction in the ENCODE regions

EST whole genome analysis

First, we searched fusion transcripts in the complete genomes using expression evidence. In two adjacent non-overlapping coding genes, we looked for ESTs that alt least partially overlap coding regions in each of the two adjacent genes. This experiment has been performed in the human and the mouse genome.

Human Apr. 2003 Assembly UCSC Genome Browser

Gene pairs in the same orientation (Tandem pairs) : 7,679
(The file contains the genomic coordinates in gff format and the ids of both RefSeqs)

Tandem pairs with out intervening genes from other collections ("bona fide" tandem pairs) : 6,369
(The file contains the genomic coordinates in gff format and the ids of both RefSeqs)

Tandem pairs with ESTs linking the two transcripts : 1,288
(The file contains the adjacent RefSeqs ids in the fisrt and second fields and all the ids of all EST that cross both Refseqs)

Tandem pairs with with ESTs overlapping coding exons from each thranscript in the pair: 176
(format: RefSeq1_id RefSeq2_id EST_ids)

"Bona fide" TICs after re-alignment of the EST into the genome using Spidey : 127
(format: RefSeq1_id RefSeq2_id EST_ids)

Previously described TICs:

Gene 1 Name and Protein Product Gene 2 Name and Protein Product EST or cDNA

GALT
(NM_000155) IL11Ra
(NM_004512) AL558734

CYP22C18
(NM_000772) CYP22C19
(NM_000769) L07093

VPS72
(NM_005997) TMOD4
(NM_013353) AI763406

SSF1
(NM_020230) P2RY11
(NM_002566) AJ300588

Conservation of the ORF across the TIC : 46

Of the 46 cases in which the ORF was conserved, we tested 32 cases in which the TIC did not include additional "intergenic" exons. Of these, 11 yielded specific amplification products.

NM1 CODE	Oligo 1	NM2 CODE	Oligo 2	SEQ CNF	BR	HE	KI	SP	LI	CO	SI	MU	LU	ST	TE	PL
NM_007221	TCCCAGAGAAGGATCTGCAC	NM_000711	GTGCAGAGTCCAGCAAAGGT	OK	1	1	0	1	1	0	0	0	0	1	0	0
NM_000269	GAGACCAACCCTGCAGACTC	NM_002512	CTCGAAGCGCTTGATGATCT	OK	1	0	0	0	1	0	0	0	1	1	1	1
NM_015996	CTCCTCCATCGCCATGTT	NM_003186	CACTGCACTATGATCCACTCC	OK	1	0	0	0	1	0	0	0	0	0	1	0
NM_020892	GTTCACTGCCAGAGGGTTTC	NM_030570	CGAATGTCAGCCAGTGTCTC	OK	0	0	0	0	1	0	0	0	0	0	0	0
NM_002157	GGCAGGACAAGCGTTTAGAA	NM_015387	AAAGGATTCATCAGGCCAAT	OK	0	0	0	0	0	0	0	1	0	0	0	0
NM_031925	TGCTGGAATGGGAGAAGACT	NM_016086	ATGGAGCTGGTGCAATTGTT	OK	1	1	0	0	0	0	0	0	0	0	0	0
NM_153268	ATGGAGTTGGTGATGTGCAA	NM_145753	GCTGAGGCTCTCCATCATGT	OK	0	0	0	0	1	0	0	0	0	0	0	0
NM_172341	TCCTGCCTTTTCTCTGGTTG	NM_019104	TCCAGCAGACACTGCAAGAC	OK	0	0	0	0	0	0	0	0	0	0	1	0
NM_000175	GCACCAAGATGATACCCTGTG	NM_032346	AATACACCTGCACGACCAGA	OK	0	1	0	0	0	0	0	0	0	1	0	0
NM_003838	GCCCTTCAATGTTTGGAAAA	NM_015978	TCAGTTCTTTTTCCTTGATCTGC	OK	1	1	0	0	0	0	0	1	0	0	1	0
NM_025247	GCCTTGGATATAGCCATGATT	NM_000690	CCTGACAGATGACCTCTCCA	OK	0	0	0	1	0	0	0	0	0	0	0	0
NM1 CODE	Oligo 1	NM2 CODE	Oligo 2	SEQ CNF	BR	HE	KI	SP	LI	CO	SI	MU	LU	ST	TE	PL

TABLE COLUMNS DESCRIPTION

Here we describe what does contain each column in the previous table:

Refseq NM code for the first gene.

Oligo 1 used for the amplification.

Refseq NM code for the second gene.

Oligo 2 used fot the amplification.

SEQ CNF : RT-PCR succeeded and the sequence of the fusion gene was confirmed.

Results on tissues: RT-PCR success on each of the mouse tissues tested in this experiment ("green" color meaning success and "red" color for failure). The tissues were: brain (BR), heart (HE), kidney (KI), spleen (SP), liver (LI), colon (CO), small intestine (SI), muscle (MU), lung (LU), stomach (ST), testis (TE), and placenta (PL).

Positive Genes are those having "OK" in the "SEQ CNF" column.

Mouse mm4 Assembly UCSC Genome Browser

Number of contiguous in the same strand : 7,893
(The file contains the genomic coordinates in gff format and the ids of both RefSeqs)

Pair of genes with at least one EST crossing the boundaries of the two known refseqs : 1,670
(The file contains the adjacent RefSeqs ids in the fisrt and second fields and all the ids of all EST that cross both Refseqs)

Pair of genes with at least one EST crossing the boundaries and with at least one overlapping exon :
- Original mapping with BLAT: 208
  (format: chromosome EST_id RefSeq1_id number_suported_exons RefSeq2_id number_suported_exons)
- ESTs re-mapped with Spidey: 135
  (format: chromosome EST_id RefSeq1_id number_suported_exons RefSeq2_id number_suported_exons)
- Spidey aligned pairs after filtering the strand alignment and more than one EST exon crossing both refseqs: 73

Pair of genes with at least one EST crossing the boundaries and with at least one overlapping exon for each gene and a ORF crossing the junction between the two genes : 19

From the set chimeric transcripts we performed RT-PCR in cDNAs from different tissues and 8 cases could be confirmed experimentally. The RT-PCR proves and the amplified tissues.

NM1 CODE	Oligo 1	NM2 CODE	Oligo 2	SEQ CNF	BR	HE	KI	TH	LI	ST	MU	LU	TE	SK	EY	OV
NM_008866	GGCCGATCAACAGTGCTAA	NM_011541	TCTTTCAGCAAATCCAATGC	OK	0	0	1	0	1	0	0	1	1	1	0	1
NM_010724	GGGTGCCCTCTATCCAGAGT	NM_011530	CCTGAGGCCTCCTTCTCTCT	OK	0	0	0	0	0	0	0	1	0	0	0	0
NM_021419	CCCACAGTTTCTGCTCCTTC	NM_028791	GACGATGTTCCCTCCACAAG	OK	0	0	0	0	0	0	0	1	0	0	0	0
NM_026380	CTCTCCACGGAAGAAGCAAC	NM_011267	CAGGTTCTCCTCGCTGAACT	OK	0	0	0	0	0	0	0	1	0	0	0	0
NM_027903	GGTGGTGAATGGAGAGAGGA	NM_007527	GTAGCAAAAAGGCCCCTGTC	no	0	0	0	0	0	0	0	0	0	0	0	0
NM_028873	AGGCCCAACAGTTCAGTACC	NM_025364	TGGTCTCTAAACCACGAGCA	OK	0	0	0	1	1	0	0	1	1	0	0	1
NM_029090	GGAGATATTCTAGCCTCCAGCTT	NM_029738	TTCACAAGCCACAGAAGCAC	OK	0	0	0	0	0	0	0	1	0	0	0	0
NM_134247	CCACGCTCTTCCTGCCAC	NM_134246	GCAGGTAGGTCACAGCTTCC	no	0	0	0	0	0	0	0	0	0	0	0	0
NM_138604	TGGGCCTACCGTCATTTAAG	NM_138606	GCCACGACTTGGGTAAAGAA	OK	1	0	0	1	0	1	0	1	0	0	0	0
NM_144894	CTTTGAGCCTGCAGCTCTTC	NM_023684	TCCACTGGTACCAAACTGTCC	OK	0	0	0	0	0	0	0	1	0	0	0	0
NM_178017	CCTAGCCTCCCTCCTGTCTG	NM_011622	CTCCAGTTGCTGGGAGAGTC	no	0	0	0	0	0	0	0	0	0	0	0	0
NM1 CODE	Oligo 1	NM2 CODE	Oligo 2	SEQ CNF	BR	HE	KI	TH	LI	ST	MU	LU	TE	SK	EY	OV

TABLE COLUMNS DESCRIPTION

Here we describe what does contain each column in the previous table:

Refseq NM code for the first gene.

Oligo 1 used for the amplification.

Refseq NM code for the second gene.

Oligo 2 used fot the amplification.

SEQ CNF : RT-PCR succeeded and the sequence of the fusion gene was confirmed.

Results on tissues: RT-PCR success on each of the mouse tissues tested in this experiment ("green" color meaning success and "red" color for failure). The tissues were: brain (BR), heart (HE), kidney (KI), thymus (TH), liver (LI), stomach (ST), muscle (MU), lung (LU), testis (TE), skin (SK), eye (EY) and ovary (OV).

Positive Genes are those having "OK" in the "SEQ CNF" column.

geneid predictions in the ENODE regions

Observed cases in the ENCODE regions:

ENCODE region	chr	gene 1	transcripts 1	gene 2	transcripts 2	Expressed evidence
ENm009	chr 11	TRIM6	NM_001003818 NM_05866	TRIM34	NM_021616 NM_130389	AB039903
ENr233	chr 15	serf2	NM_005770	HYpk	NM_016400	AK000438
ENm005	chr 21	CRYZL1	BC033023	DONSON	NM_017613 NM_145794 NM_145795	AL157441
ENr223	chr 6	C6orf148	NM_030568	AC019205.8-001	AK090984	BM544101

The last approach was the prediction of putative new fusion genes using the ab initio gene preidciton program geneid in the ENCODE regions. You can find a more detailed information about the selection and features of the ENCODE regions in the main ENCODE webpage or in the GENCODE webpage. All the sequences containing pairs of adjacent genes in the ENCODE regions regions were obtained. geneid was run on those regions and only predictions that overlapped coding regions from both genes were considered.

The following subset of sequences have been generated for the experiment:

Number of initial transcripts (RefSeq, VEGA annotations and Known genes from UCSC genome browser): 1,288 transcripts.
(This file contains the information of every transcript in UCSC format )

Non redundant set of translatable transcripts: 594 transcripts.
(Identificators and corresponding ENCODE region from the filtered previous set of genes)

Cluster of overlapping transcripts: 321 clusters.
(The file contains the ENCODE id plus the cluster number GCL_XXX, the begining and the end of the overlapping transcripts and the ids of the corresponding ovelapped transcripts)

Pair of adjacent transcipts (same strand): 165 pairs of clusters.

Genomic coordinates of the region containing the adjacent genes in gff format and fasta sequences.

geneid predictions: proteins in fasta format, coordinates in gff format and postscript plots.

geneid predictions overlapping both genes: 126.
(The file contains the ids of the clusters where geneid has predict a protein that overlaps coding regions of both set of transcripts)

From the previous 126 predictions, 96 where tested by experimental RT-PCR. The exons corresponding only to the chimeric predictions have been tested. Three of the chimeric pair of exons predicted by geneid amplified:

ENCODE region	chr	transcript 1	oligo 1	transcript 2	oligo 2	Pred CDS	Ampli Seq
ENm005	chr 21	NM_144659	CCAAGGAGCTGAGAAGAACG	NM_021254	ATACGCTGGCCACAAGAATC	fasta	fasta
ENm013	chr 7	AK124057	CGGAGAACTTTGTCGGAGAG	NM_000629	TGGAAAAACTAGGGGAAGGA	fasta	fasta
ENr331	chr 2	AK023854	TGCCTACTTCACTGTCACCA	BC061909	CCAGAGAGGATTGTGCACCC	fasta	fasta

TABLE COLUMNS DESCRIPTION

Here we describe what does contain each column in the previous table:

ENCODE region code.
Human chromosome where the ENCODE region is.
Gene 1 code for the first gene.
Oligo 1 used fot the amplification.
Gene 2 code for the second gene.
Oligo 2 used fot the amplification.
Pred CDS corresponds to the predicted chimeric CDS by geneid.
Ampli Seq corresponds to the amplified DNA.

Disclaimer

webmaster