Gene annotation of genomes

ABSTRACT

We analyze an anonymous human genome contig using a variety of computational tools. We first download the sequence. Then, we analyze it by means of:
  1. Alignment of transcripts (mRNAs/cDNAs/ESTs)
  2. Alignment of proteins
  3. Gene prediction
We are specially interested in deliniating the correct(s) gene structure(s), if any, in this genomic region.

Programs and languages

In this practical we use several well known bioinformatics programs, as well as gawk, perl and UNIX commands and scripts. Being fluent on the command-line is a second aim of this practical.

Downloading the sequence

We download the sequence from the ensembl site. The contig used is AC091491

We save the file locally as AC091491.embl

Obtaining the sequence in fasta format

the following command does the job

grep '^ ' AC091491.embl | sed 's/[ 0-9]//g' > AC091491.fa

However, the fasta format requires each line to be preceeded by the '>seqname' line. We use emacs to introduce this line, and make the file AC091491.fa conform the fasta format.

Of course, there is a number of ways in which we can convert the original file AC091491.embl to the fasta formated one AC091491.fa, without the need of editing the file. Can you think of one such way? check here.

It may be useful to have also the sequence in tabular format. The following script will convert the fasta file into a tabular file.


awk '{printf "%s", $0}'  AC091491.fa > AC091491.tbl

Again, we need to use emacs to introduce a \tab between the sequence name and the sequence itself, to get AC091491.tbl in the prober tabular format. And, again a number of simple unix commands will do the whole conversion automatically. Can you think on such command? check here

The tabular format allows to perform some preliminary analysis on the problem sequence

Length of the sequence


awk '{print length($2)}' AC091491.tbl 
152882 

G+C content

awk '{print $2}' AC091491.tbl | fold -1 | sort | uniq -c | gawk '{print $2, $1/152882}' 
a 0.323073
c 0.200802
g 0.189257
t 0.286868

What is the biological significance of the G+C content?

Alignment of transcripts

cDNAs

The alignment of genomic and transcript sequences is intended to show distribution of exons and introns along the genomic sequence. Thus, we compare the query sequence agains the database of human cDNAs (made of overlaping ESTs) using BLAST. We can use the TIGR BLAST server. Select no query filter, default e-value and 100 descriptions and alignments.

After running the blast search, we store the results in, for instance, the file AC091491.blast.cDNA. Browse it and have a look at best matching sequences.

We now extract the fasta sequence for the best matching cDNAs (first line). Save it in a file and edit it to add an ID line (fasta format): NP105048.fa. Do the same for the next best (and different) match: THC1259113.fa. Browse the BLAST result for these cDNAs and try to get and idea of the potential gene structure defined (make sure that you fully understand the output: Query and subject sequences, HSPs, strands, score and e-value).

The alignment of genomic and mature transcripts is useful to pinpoint gene structures. It elicits the exon-intron regions in the genomic sequence and allow exon-intron junctions (GT donors and AG acceptors) to be checked. We use two diferent RNA-DNA alignment programs that take into account splice sites (note that the BLAST algorithm is based only on sequence similarity; thus, alignment edges are usually blur). Have a look at these web servers:

However, we will run these programs on command line: