Automatic annotation of protein functional class from sparse and imbalanced data sets

Jung, J, and Thon, MR. 2006. Data Mining and Bioinformatics (Lecture Notes In Computer Science) 4316: 65-77.

Abstract:

In recent years, high-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function however the process for annotating proteins with GO terms is usually through a tedious manual curation process by trained professional annotators. With the wealth of genomic data that are now available, there is a need for accurate automated annotation methods. In this paper, we propose a method for automatically predicting GO terms for proteins by applying statistical pattern recognition techniques. We employ protein functional domains as features and learn independent Support Vector Machine classifiers for each GO term. This approach creates sparse data sets with highly imbalanced class distribution. We show that these problems can be overcome with standard feature and instance selection methods. We also present a meta-learning scheme that utilizes multiple SVMs trained for each GO term, resulting in improved overall performance than either SVM can achieve alone. The implementation of the tool is available at http://fcg.tamu.edu/AAPFC.

PDF: Jung and Thon 2006

Automatic annotation of protein functional class from sparse and imbalanced data sets

Categories