DNA Sequence for these tutorials: oryza.fasta a segment of genomic DNA from (Oryza sativa).
1. First, we will align ESTs to the genomic DNA with BLASTN. Open the link to the Oryza genomic DNA segment above and copy/paste it into Geneious. Perform a BLASTN search to the est_others database.
2. Examine the locations of the alignments on the query sequence. These alignments should correspond to the parts of the gene that are transcribed (exons and UTRs)
3. Select the top 5 or 6 ESTs with the best alignments. Record the start and end locations of each HSP.
4. In Geneious, create a feature for each HSP from the blast report.
5. Using the Oryza genomic sequence, perform a BLASTX search. You my try either the nr database or the SwissProt database. Select the top 4 or 5 BLAST HSPs and create features on the genomic sequence in Geneious.
6. Compare the EST features and the BLASTX features. Can you answer the following questions?
How many genes are in the Oryza DNA fragment?
How many exons are in the genes?
Can you determine the correct reading frame each exon?
7. In Geneious, click on the graphs symbol and select “Protein Coding Prediction”. This graph is calculated using the Fickett algorithm. Do the predicted protein coding regions correspond to the EST and BLASTX features that you made?
8. Use the geneid web server (homepage) to find genes in the genomic DNA fragment. Create features in Geneious for the predicted exons from geneid. How closely do the predicted exons match the ones you found with Blast?
9. Use the Augustus web server to find genes in the genomic DNA fragment and create features in Geneious.
10. Now you should have enough evidence to annotate your gene. In Geneious, you can create a CDS feature with multiple intervals, each interval is an exon. In general, protein coding genes should follow these rules:
- the first exon should begin with a start codon (ATG).
- introns should begin with GT and end with AG.
- the last exon should end with a stop codon.
- When all of the exons are splices together, the result should be a seqeuence whose length is a multiple of 3 and it should be translated into the final proteins sequence.
Note that people often use the genes predicted by geneid, Augustus, or some other program because these genes comply with all of the rules listed above. Keep in minds that it is important to compare the results of these programs with BLASTN and BLASTX results and manually fix any errors in the gene model.