ABSTRACT The study of proteins and DNA is very important role in many essential biological processes such as DNA replication, transcription, splicing, and repair [1]. Furthermore, it may also contribute to drug design and discovery, such as aiding the design of artificial transcription factors [2]. Biological science can be combined with computer technology to make help to the research stage. This thesis is based on the prediction of DNA and protein binding sequences using computational methods. These can be pided into three basic categories: sequence-based DNA-binding site prediction, structure-based DNA-binding site prediction, and homology modelling and threading. In our study we will look at the different ways datasets and various forms of information can be used to give a prediction result. We also propose specific implications that are likely to result in novel prediction methods, increased performance, or practical applications. DNA exhibits less perse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. This is meant to review existing research on machine learning methods for comparison and selection, evaluation methods, performance comparison of different tools, and future directions in protein DNA-binding site prediction. In particular, we detail the meta-analysis of protein DNA-binding sites. We also propose other practical methods that are likely to result in better prediction methods and improve performance.42349
Contents
i. ABSTRACT: 3
ii. DNA PROTEIN BACKGROUND: 3
iii. INTRODUCTION: 4
Backpropagation: 6
K-Nearest neighbor: 7
Support vector machines: 8
iv. IMBALANCED DATA 9
v. EVALUATION IN IMBALANCED DOMAINS 11
vi. PSI-BLAST AND PSSM 12
vii. PSSM 16
viii. MATERIALS AND METHODS 18
ix. PREDICTION METHODS 21
Prediction Based on Sequences and Structures: 21
Homology Modelling and Threading: 22
x. PREDICTION ALGORITHM: 22
Decision tree: 23
Random forest: 23
Bayesian learning: 24
xi. HYBRID LEARNING AND META-PREDICTION METHODS 24
xii. COMPARISON OF DIFFERENT PREDICTION METHODS 26
xiii. CONCLUSION 27
xiv. REFERENCE 28
ii. DNA PROTEIN BACKGROUND
Firstly we will have a look at the discovery made in the 1950s on the genetic analyses in bacteria. This provided evidence for the existence of gene regulatory proteins that are able to turn specific sets of genes on or off. The regulator called lambda repressor is encoded by bacteriophage lambda which is a bacterial virus. The repressor has the ability to shut off the viral genes that code for the protein components of new virus particles and this therefore enables the viral genome to stay in a silent passenger in the bacterial chromosome, multiplying with the bacterium only when conditions are suitable for bacterial growth. The lambda repressor was one of the first gene regulatory proteins to be characterized, and it remains one of the best understood, as we discuss later. Bacterial regulators of other forms can respond to nutritional conditions by shutting off genes encoding specific sets of metabolic enzymes when they are not needed. An important step towards understanding gene regulation. A majority of these mutants were deficient in proteins acting as specific repressors for these gene sets. Due to the fact that these proteins present in small quantities, it was of much difficulty and time-consuming to isolate them. They eventually had to be purified by fractionating the cell extracts. Once isolated, the proteins were shown to bind to specific DNA sequences close to the genes that they regulate. The DNA sequences that they recognized were then determined by a combination of classical genetics, DNA sequencing, and DNA-foot printing experiments were then realised. 论文网 英语论文SVM的蛋白质与DNA交互作用预测研究:http://www.751com.cn/yingyu/lunwen_42760.html