Please use this identifier to cite or link to this item:
Title: Classifying Coding DNA with Nucleotide Statistics
Authors: Carels, Nicolas
Frías, Diego
Affilliation: Fundação Oswaldo Cruz. Instituto Oswaldo Cruz. Laboratório de Genômica Funcional e Bioinformática. Rio de Janeiro, RJ. Brasil.
Universidade do Estado da Bahia. Departamento de Ciências Exatas e da Terra. Salvador, BA, Brasil.
Abstract: In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (250 bp), Drosophila melanogaster (250 bp) and Arabidopsis thaliana (200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
Keywords: Genomics
Universal correlation
Purines bias
Coding features
Open reading frame
Ancestral codon
keywords: Genômica
Correlação universal
Recursos de codificação
Viés purinas
Quadro de leitura aberta
Codon ancestral
Issue Date: 2009
Publisher: Libertas Academica
Citation: CARELS, Nicolas; FRÍAS, Diego. Classifying Coding DNA with Nucleotide Statistics. Bioinformatics and Biology Insights, v. 3, p. 141–154, 2009.
ISSN: 1177-9322
Copyright: open access
Appears in Collections:IOC - Artigos de Periódicos

Files in This Item:
File Description SizeFormat 
nicolascarels_diegofrias_IOC_2009.pdf1.8 MBAdobe PDFView/Open

FacebookTwitterDeliciousLinkedInGoogle BookmarksBibTex Format mendeley Endnote DiggMySpace

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.