×

Orphan gene finding – an exon assembly approach. (English) Zbl 1038.68147

Summary: This paper introduces an algorithm for finding eukaryotic genes. It particularly addresses the problem of orphan genes, that is of genes that cannot, based on homology alone, be connected to any known gene family and to which it is therefore not possible to apply traditional gene finding methods. To the best of our knowledge, this is also the first algorithm that attempts to compare in an exact way two DNA sequences that contain both coding (i.e. exonic) and non-coding (i.e. intronic and, possibly, intergenic) parts. The comparison is performed following an algorithmical model of a gene that is as close as possible to the biological one (we consider in this paper the “one ORF, one gene” problem only). A gene is seen as a set of exons that are pieces of an assembly and are not independent. The algorithm is efficient enough: although the constants are higher than for usual sequence comparison, its time complexity is proportional to the product of the sequences lengths while its space complexity scales linearly with the length of the smallest sequence.

MSC:

68W05 Nonnumerical algorithms
Full Text: DOI

References:

[1] Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J., Basic local alignment search tool, J. Mol. Biol., 215, 403-410 (1990)
[2] Arvestad, L., Aligning coding DNA in the presence of frameshift errors, (Farach-Colton, M., Combinatorial Pattern Matching, 1997, Lecture Notes in Computer Science, Vol. 1264 (1997), Springer: Springer Berlin), 180-190
[3] Bafna, V.; Huson, D. H., The conserved exon method for gene finding, (Bourne, P.; Gribskov, M.; Altman, R.; Jensen, N.; T. Lengauer, D. Hope,; Mitchell, J.; Scheeff, E.; Smith, C.; Strande, S.; Weissig, ., 8th Internat. Conf. on Intelligent Systems for Molecular Biology (2000), AAAI Press: AAAI Press San Diego, California, USA)
[4] Batzoglou, S.; Pachter, L.; Mesirov, J. P.; Berger, B.; Lander, E. S., Human and mouse gene structure: comparative analysis and application to exon prediction, Genome Res., 10, 950-958 (2000)
[5] C. Burge, Identification of genes in human genomic DNA, Ph.D. Thesis, Stanford University, Stanford, CA, USA, 1997.; C. Burge, Identification of genes in human genomic DNA, Ph.D. Thesis, Stanford University, Stanford, CA, USA, 1997.
[6] Burge, C.; Karlin, S., Prediction of complete gene structures in human genomic DNA, J. Mol. Biol., 268, 78-94 (1997)
[7] Burset, M.; Guigó, R., Evaluation of gene structure prediction programs, Genomics, 34, 353-367 (1996)
[8] Claverie, J.-M., Computational methods for the identification of genes in vertebrate genomic sequence, Hum. Mol. Genet., 6, 1735-1744 (1997)
[9] Dayhoff, M. O.; Schwartz, R. M.; Orcutt, B. C., A model of evolutionary change in proteins, Atlas of Protein Sequence and Structure (Natl. Biomed. Res. Found.), 5, suppl. 3, 345-352 (1978)
[10] Dujon, B., The yeast genome project: what did we learn?, Trends Genet., 12, 263-270 (1996)
[11] Durbin, R.; Eddy, S. R.; Krogh, A.; Mitchison, G., Biological Sequence Analysis, Probabilistic Models of Proteins and Nucleic Acids (1998), Cambridge University Press: Cambridge University Press Cambridge · Zbl 0929.92010
[12] Fickett, J. W., Recognition of protein coding regions in DNA sequences, Nucleic Acids Res., 10, 5503-5518 (1982)
[13] Ficket, J. W., Finding genes by computerthe state of the art, Trends Genet., 12, 316-320 (1996)
[14] Florea, L.; Hartzell, G.; Zhang, Z.; Rubin, G. M.; Miller, W. M., A computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res., 8, 967-974 (1998)
[15] Gelfand, M. S., Computer prediction of the exon-intron structure of mammalian pre-mRNAs, Nucleic Acids Res., 18, 5865-5869 (1990)
[16] Gelfand, M. S.; Mironov, A.; Pevzner, P. A., Spliced alignment problema new approach to gene recognition, (Hirschberg, D. S.; Myers, E. W., Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 1075 (1996), Springer: Springer Berlin), 141-158
[17] Gelfand, M. S.; Mironov, A. A.; Pevzner, P. A., Gene recognition via spliced sequence alignment, Proc. Natl. Acad. Sci. (USA), 93, 9061-9066 (1996)
[18] Gotoh, O., An improved algorithm for matching biological sequences, J. Mol. Biol., 162, 705-708 (1982)
[19] Guan, X.; Uberbacher, E. C., Alignments of DNA and protein sequences containing frameshift errors, Comp. Appl. Biol. Sci., 12, 1, 31-40 (1996)
[20] Guigó, R., Assembling genes from predicted exons in linear time with dynamic programming, J. Comput. Biol., 5, 4, 681-702 (1998)
[21] Hein, J., An algorithm combining DNA and protein alignment, J. Theoret. Biol., 167, 169-174 (1994)
[22] Hein, J.; Stovlbaek, J., Combined DNA and protein alignment, Methods Enzymol., 266, 402-418 (1996)
[23] Henikoff, S.; Henikoff, J. G., Amino acid substitution matrices for protein blocks, Proc. Natl. Acad. Sci. USA, 89, 10915-10919 (1992)
[24] Hirschberg, D. S., A linear space algorithm for computing maximal common subsequences, ACM, 18, 341-343 (1975) · Zbl 0301.68042
[25] Hua, Y.; Jiang, T.; Wu, B., Aligning DNA sequences to minimize change in protein, (Farach-Colton, M., Combinatorial Pattern Matching, 1998, Lecture Notes in Computer Science, Vol. 1448 (1998), Springer: Springer Berlin), 221-234
[26] Hutchinson, G. B.; Hayden, M. R., The prediction of exons through an analysis of spliceable open reading frames, Nucleic Acids Res., 20, 3453-3462 (1992)
[27] Myers, E. W.; Miller, W., Optimal alignments in linear space, Comput. Appl. Biosci., 4, 11-17 (1988)
[28] N. Pavy, S. Rombauts, P. Déhais, C. Mathé, D.V.V. Ramana, P. Leroy, P. Rouzé, Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana; N. Pavy, S. Rombauts, P. Déhais, C. Mathé, D.V.V. Ramana, P. Leroy, P. Rouzé, Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana
[29] C.N.S. Pedersen, personal communication.; C.N.S. Pedersen, personal communication.
[30] Pedersen, C. N.S.; Lyngso, R.; Hein, J., Comparison of coding DNA, (Farach-Colton, M., Combinatorial Pattern Matching, 1998, Lecture Notes in Computer Science, Vol. 1448 (1998), Springer: Springer Berlin), 153-173
[31] Shepherd, J. C.W., Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification, Proc. Natl. Acad. Sci., 78, 1576-1600 (1981)
[32] Staden, R.; McLachlan, A. D., Codon preference and its use in identifying protein coding regions in long DNA sequences, Nucleic Acids Res., 10, 5503-5518 (1982)
[33] Terryn, N.; Heijnen, L.; De Keyser, A.; Van Asseldonck, M.; De Clercq, R.; Verbakel, H.; Gielen, J.; Zabeau, M.; Villarroel, R.; Jesse, T.; Neyt, P.; Hogers, R.; Van Den Daele, H.; Ardiles, W.; Schueller, C.; Mayer, K.; Déhais, P.; Rombauts, S.; Van Montagu, M.; Rouz, .; Vos, P., Evidence for an ancient chromosomal duplication in Arabidopsis thaliana by sequencing and analyzing a 400-kb contig at the APETALA2 locus on chromosome 4, FEBS Lett., 445, 237-245 (1999)
[34] Thomas, A.; Skolnick, M. H., A probabilistic model for detecting coding regions in DNA sequences, IMA J. Math. Appl. Med. Biol., 11, 149-160 (1994) · Zbl 0809.92008
[35] Wiehe, T.; Gebauer-Jung, S.; Mitchell-Olds, T.; Guigo, R., SGP-1prediction and validation of homologous genes based on sequence alignments, Genome Res., 11, 1574-1583 (2001)
[36] Y. Xu, J.R. Einstein, R.J. Mural, M. Shah, E.C. Uberbacher, An improved system for exon recognition and gene modeling in human DNA sequences, 2nd Internat. Conf. on Intelligent Systems for Molecular Biology, AAAAI Press, Menlo Park, California, 1994, pp. 376-384.; Y. Xu, J.R. Einstein, R.J. Mural, M. Shah, E.C. Uberbacher, An improved system for exon recognition and gene modeling in human DNA sequences, 2nd Internat. Conf. on Intelligent Systems for Molecular Biology, AAAAI Press, Menlo Park, California, 1994, pp. 376-384.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.