Python dna sequence analysis

1/30/2024

Python dna sequence analysis

Read Now

The blue curve reflects the computational time time for the parameter optimization process when using Pse-Analysis of one CPU core to process the five subsets for nucleosome positioning prediction of Caenorhabditis elegans, while the red curve reflects the corresponding computational time when using Pse-Analysis of ten CPU cores to do the same.Īs pointed out in a comprehensive review paper, the general form of PseKNC (pseudo K-tuple nucleotide composition) can cover all the existing feature vectors for DNA/RNA sequences.

The computational cost of Pse-Analysis can be significantly reduced by using multiprocessing technique It has been shown when dealing with the above case that the computing time for the parameter optimization process can be reduced by 6 folds when using 10 cores instead of a single core, as shown in Figure Figure2 2. In this regard, the multiprocessing technique has been applied to significantly speed up the computational processes. The computational speed in optimizing many different parameters is a bottleneck for the efficiency of the Pse-Analysis platform. into the package, and Pse-Analysis will automatically do all the remaining jobs: optimising sample formulation optimising operation engine conducting cross-validations and forming a web-server that is fully equivalent to the iNuc-PseKNC of. Now, with the Pse-Analysis package, what we need to do is just to input the benchmark dataset used by Guo et al. had praiseworthily developed a predictor called iNuc-PseKNC by going thru all the five procedures described in the Introduction section. Users can directly apply it on various relevant problems, substantially saving a lot of time to repeat tedious for developing an effective predictor.įor instance, it is a very important task to effectively predict nucleosome positioning in genomes. Note: the meaning of the “output” here is not limited in the predicted results for the original query biological sequence data submitted along with the benchmark dataset, but also include an optimal predictor. The “predict.py” is to generate the output. It includes four steps: (1) feature extraction, (2) parameter selection, (3) model training, and (4) cross validation.

The “train.py” is designed for training a Support Vector Machine (SVM) model. The “predict.py” is for using the trained model to predict the query samples and evaluate their prediction quality by a set of widely used metrics Acc, MCC, Sn, Sp, and AUC. It contains four procedures i.e., feature extraction, parameter selection, model training, and cross validation. The “train.py” script is for training the predictive model based on the benchmark dataset submitted by the user. The flowchart of Pse-Analysis Python package All the tedious things in the aforementioned steps (2)–(5) can be totally skipped and leave them to be fulfilled by the computer. The users only need to input their benchmark dataset and the query biological sequences, followed by getting their desired results from the output of the Pse-Analysis system. To speed up such processes, we are to propose a Python package called Pse-Analysis, which is based on the framework of LIBSVM and which can automatically generate the predictor desired by users. Each of the five procedures is time-consuming and tedious, particularly in how to select the optimal parameters for the samples concerned and for the operation engine adopted. It is quite laborious even if using computational approches to deal with these problems since the development of each computational predictor needs to undergo the following five steps : (1) benchmark dataset preparation, (2) optimise sample formulation, (3) optimize operation engine, (4) conduct cross-validations, and (5) establish a web-server. PPBS (proire-protein binding sites, as well as a long list of references cited in a recent comprehensive review. įor protein/peptide sequences, they are about how to identify various PTM (Posttranslational Modification) sites, anticancer peptides, interactions between drugs and target proteins, PPI (protein-protein interaction). For DNA/RNA sequences, these problems are about how to identify the recombination spots, nucleosome positioning, promoters, microRNA precursors, enhancers, translation initiation sites, various PTRM (postpost-replication modification) sites in DNA and PTCM (post-transcriptiom modification) sites in RNA, RNA pseudouridine sites, DNA origin of replication, adenosine to inosine editing sites in RNA, and many more other topics as mentioned in a recent review article. With the explosive growth of biological sequences in the post-genomic age, we are facing a lot of binary classification problems.

0 Comments

Python dna sequence analysis

Leave a Reply.

Author

Archives

Categories