Case Study -- STX10

Objective of this exercise

The point of the following exercise is to familiarize you with some tools for finding genes based on protein sequences. Let's say you are interested in a particular gene or gene family, for example, syntaxin. Your idea: Map the Syntaxin 10 protein to the genome so you can make a minigene to study splicing efficiency in vitro and in cell culture. You have the mouse protein sequence, but you want to know the human intron exon structure and CDS sequence.

For the purpose of this exercise, let's assume that we don't know the structure of the gene, even though we can look it up on the UCSC genome browser. We will try a number of methods, evaluate the results, and then decide on the most likely intron-exon structure.

Exercise

BLAT

The easiest and fastest thing to do is to BLAT the Syntaxin 10 protein sequence (gi:48146337 in Genbank, fasta can be obtained here) against the hg17 (May, 2004) version of the human genome at UCSC. This should work since mouse and human are phylogenetically close enough. Look closely -- does the BLAT alignment match the Known gene or Refseq? Why not? What about the other gene predictions?

It looks like the first exon is missing and that the last exon is truncated. Also the second and third exon splice sites are different than the "UCSC Known Genes." Zoom in on them. Do you see anything odd?

Genewise

A more sensitive alignment can be performed by Genewise. Let's run Genewise to align the protein to the genome sequence corresponding to the BLAT hit. The genomic region surrounding the BLAT hit, chr19:13,110,001-13,130,000, can be downloaded from here. Repeats are in lower case characters. These can be masked by converting all lowercase letters to N's using tr, like this:

perl -ne 'if ($_!~m/^>/){tr/a-z/N;}print;' < hg17_chr19_13110001_13130000.fa > hg17_chr19_13110001_13130000.msk.fa

Then run genewise:

genewise stx10protein.fa hg17_chr19_13110001_13130000.msk.fa -both -gff

Why doesn't genewise generate the same structure as the RefSeq?

ab initio:geneid, genscan, etc.

Perhaps you are unsatisfied by BLAT and Genewise or perhaps you want to know if perhaps there may be other alternative splice forms not represented by the protein query sequence, or maybe you're not sure if the query sequence is representative of the gene. Or maybe you don't have a homologous sequence, but you have other evidence leading you to this genomic region (traditional genetic mapping, cosmid rescue, high density SNP arrays... you name it.) In this case you'd want to run a number of ab initio or de novo gene predictors.

In this example, we'll use geneid. [We've chosen geneid mainly because it's the program we actively develop in our group. It's not the most accurate, but it's highly flexible, making it a good research tool.] Let's run the latest version of geneid, geneid v1.3.9, on this sequence. Download the .tar.gz file, unzip and untar it (on linux and mac, tar -xvzf geneid.v1.3.9.Oct19_2007.tar.gz, or on mac, just let stuffit expander do the job). Compile geneid by typing make in the geneid directory. Run it on the genomic sequence using the human3iso.param file in the geneid/param directory.

/path/to/geneid/bin/geneid -GP /path/to/geneid/param/human3iso.param hg17_chr19_13110001_13130000.msk.fa

How does the result compare to the RefSeq coordinates (GTF file)? Now let's run GeneID with the U12 option -U and an appropriate parameter file.

/path/to/geneid/bin/geneid -GUP /path/to/geneid/param/human.070606.u2branch.ppt.param hg17_chr19_13110001_13130000.msk.fa

For GFF3 output (recommended, but produces larger output files) use the -3 option. To print introns use the -n option.

How does U12 intron prediction affect the overall quality of the gene prediction?

homology-based:GenomeScan, dual-genome:SGP

We don't have time, but these two BLAST-based versions of genscan and geneid can give more accurate results. In the first case, tblastn of protein seqs agains the genome is used. In the case of SGP a genome-vs-genome tblastx is performed and then the hsps are given to geneid in gff format, the scores of which are used to modify the predicted exon scores. More complex multi-genome methods exist which rely on whole-genome alignments, the most notable of which are Twinscan and NSCAN.

Visualization

I generally use two methods to compare predictions, gff2ps and UCSC genome browser custom tracks. Both take input in GFF format. In the case of gff2ps, the predictions are plotted using postscript and can be printed easily on your laser printer or viewed in gv/Preview or converted to PDF. It is highly customizable. The advantage in visualization is that it can show differences in frame and the phase of introns. The UCSC custom tracks provide a dynamic view of your predictions and they can be viewed alongside many other annotations on the genome. However, the genome must be served by UCSC and your predictions must be done on genomic coordinates.

gff2ps

Let's try gff2ps first. The GFF (-G) output of geneid is ready as is. The output of genewise should be processed to keep only the gff records. Genscan output needs to be converted to GFF. One solution is implemented in this awk script. think about how might you convert this to perl? You could also use the Bio::Tools::Genescan (for input) and Bio::Tools::GFF (for output) modules. For comparison, the known gene gff records can be found here. However, these records are in genomic coordinates and our prediction coordinates are with reference to the extracted locus, so we must first subtract 13110000 from the start and end coordinates of each record first. We should also change the sequence name to hg17_dna. Can you think of a quick way to do this using awk or perl? Here's one way:

perl -ne '$o=13110000;@F=split("\t",$_); $F[3]-=$o;$F[4]-=$o;$F[0]="hg17_dna";print join ("\t",@F),"\n";' < stx10cds.gtf > stx10cds.relcoords.gff

Now we can run gff2ps. {If you try running it on a mac, you may run into difficulties, as gawk is not installed by default. But it's easy to get it from gnu.org and compile it. Or just use the Pasteur Institute's Pise web service for gff2ps.}

gff2ps stx10cds.relcoords.gff hg17_chr19_13110001_13130000.msk.genscan.gff hg17_chr19_13110001_13130000.msk.geneid.normal.gff hg17_chr19_13110001_13130000.msk.geneid.U12.gff > stx10locus.ps

View the result with gv, or convert to pdf with ps2pdf and view in acrobat reader.

UCSC genome browser custom tracks

First the GFF files need to be converted to chromosomal coordinates and the seq name should be changed to chr19. Again, perl to the rescue. If you find yourself converting between the same coordinate systems all the time you can write a simple script to do so automatically by storing the offsets in a separate file.(for example, ths script and param file can convert between hg17 chromosomal coordinates and ENCODE coordinates.) You can upload the gff files one by one or you can add "track" lines above each group of gff records instructing the browser on how to display the features. The most useful is probably the assignment of different colors. Read the documentaion for instructions on the format.

Now that you have them uploaded, you can compare the splice junctions by zooming right on down to the base level. The ability to see the genomic sequence as well as other genomic features are two of the main advantages of using the genome browser.