Research

We explore the most advanced and captivating intersections of AI and bioinformatics.

Total Citations

0

Times our research has been cited by other scholars

h-index

0

Papers with at least h citations

i10-index

0

Papers with at least 10 citations

Featured Research

Discover our latest AI breakthroughs and updates from the lab

PTMGPT2
PTMGPT2

Post-translational modification prediction via prompt-based fine-tuning of a GPT-2 model

Post-translational modification prediction
Published: 02/01/2024Preview
SynergyGTN
SynergyGTN

Unlocking the therapeutic potential of drug combinations through synergy prediction using graph transformer networks

Drug combination predication for cancer cell-line
Published: 02/01/2024Preview
PUResNet
PUResNet

PUResNetV2. 0: a deep learning model leveraging sparse representation for improved ligand binding site prediction

Ligand Binding Site Prediction
Published: 02/01/2024Preview

Research Publications

Explore our contributions to AI research and development across various domains

2025

Transforming Highway Safety With Autonomous Drones and AI: A Framework for Incident Detection and Emergency Response

03/11/2025
Authors: Muhammad Farhan, Hassan Eesaar, Afaq Ahmed, Kil To Chong, Hilal Tayara
Journal: IEEE Open Journal of Vehicular Technology
Abstract: Highway accidents pose serious challenges and safety risks, often resulting in severe injuries and fatalities due to delayed detection and response. Traditional accident management methods heavily rely on manual reporting, which can be sometime inefficient and error-prone resulting in valuable life loss. This paper proposes a novel framework that integrates autonomous aerial systems (drones) with advanced deep learning models to enhance real-time accident detection and response capabilities. The system not only dispatch the drones but also provide live accident footage, accident identification and aids in coordinating emergency response. In this study we implemented our system in Gazebo simulation environment, where an autonomous drone navigates to specified location based on the navigation commands generated by Large Language Model (LLM) by processing the emergency call/transcript. Additionally, we created a dedicated accident dataset to train YOLOv11 m model for precise accident detection. At accident location the drone provides live video feeds and our YOLO model detects the incident, these high-resolution captured images after detection are analyzed by Moondream2, a Vision language model (VLM), for generating detailed textual descriptions of the scene, which are further refined by GPT 4-Turbo, large language model (LLM) for producing concise incident reports and actionable suggestions. This end-to-end system combines autonomous navigation, incident detection and incident response, thus showcasing its potential by providing scalable and efficient solutions for incident response management. The initial implementation demonstrates promising results and accuracy, validated through Gazebo simulation. Future work will focus on implementing this framework to the hardware implementation for real-world deployment in highway incident system.
View Publication →

Analysis of Ruddlesden‐Popper and Dion‐Jacobson 2D Lead Halide Perovskites Through Integrated Experimental and Computational Analysis

02/17/2025
Authors: Basir Akbar, Kil To Chong, Hilal Tayara
Journal: Battery Energy
Abstract: Two-dimensional (2D) lead halide perovskites (LHPs) have captured a range of interest for the advancement of state-of-the-art optoelectronic devices, highly efficient solar cells, next-generation energy harvesting technologies owing to their hydrophobic nature, layered configuration, and remarkable chemical/environmental stabilities. These 2D LHPs have been categorized into the Dion-Jacobson (DJ) and Ruddlesden-Popper (RP) systems based on their layered configuration respectively. To efficiently classify the RP and DJ phases synthetically and reduce reliance on trial/error method, machine learning (ML) techniques needs to develop. Herein, this work effectively identifies RP and DJ phases of 2D LHPs by implementing various ML models. ML models were trained on 264 experimental data set using 10-fold stratified cross-validation, hyperparameter optimization with Optuna, and Shapley Additive Explanations (SHAP) were employed. The stacking classifier efficiently classified RP and DJ phases, demonstrating a minimal variation between the sensitivity and specificity and achieved a high Balance Accuracy (BA) of (0.83) on independent test data set. Our best model tested on 17 hybrid 2D LHPs and three experimental synthesized 2D LHPs aligns well experimental outcomes, a significant advance in cutting edge ML models. Thus, this proposed study has unlocked a new route toward the rational classification of RP and DJ phases of 2D LHPs.
View Publication →
2024

From Detection to Action: A Multimodal AI Framework for Traffic Incident Response

12/09/2024
Authors: Afaq Ahmed, Muhammad Farhan, Hassan Eesaar, Kil To Chong, Hilal Tayara
Journal: Drones
Abstract: With the rising incidence of traffic accidents and growing environmental concerns, the demand for advanced systems to ensure traffic and environmental safety has become increasingly urgent. This paper introduces an automated highway safety management framework that integrates computer vision and natural language processing for real-time monitoring, analysis, and reporting of traffic incidents. The system not only identifies accidents but also aids in coordinating emergency responses, such as dispatching ambulances, fire services, and police, while simultaneously managing traffic flow. The approach begins with the creation of a diverse highway accident dataset, combining public datasets with drone and CCTV footage. YOLOv11s is retrained on this dataset to enable real-time detection of critical traffic elements and anomalies, such as collisions and fires. A vision–language model (VLM), Moondream2, is employed to generate detailed scene descriptions, which are further refined by a large language model (LLM), GPT 4-Turbo, to produce concise incident reports and actionable suggestions. These reports are automatically sent to relevant authorities, ensuring prompt and effective response. The system’s effectiveness is validated through the analysis of diverse accident videos and zero-shot simulation testing within the Webots environment. The results highlight the potential of combining drone and CCTV imagery with AI-driven methodologies to improve traffic management and enhance public safety. Future work will include refining detection models, expanding dataset diversity, and deploying the framework in real-world scenarios using …
View Publication →

Possum: identification and interpretation of potassium ion inhibitors using probabilistic feature vectors

10/22/2024
Authors: Mir Tanveerul Hassan, Hilal Tayara, Kil To Chong
Journal: Archives of Toxicology
Abstract: The flow of potassium ions through cell membranes plays a crucial role in facilitating various cell processes such as hormone secretion, epithelial function, maintenance of electrochemical gradients, and electrical impulse formation. Potassium ion inhibitors are considered promising alternatives in treating cancer, muscle weakness, renal dysfunction, endocrine disorders, impaired cellular function, and cardiac arrhythmia. Thus, it becomes essential to identify and understand potassium ion inhibitors in order to regulate the ion flow across ion channels. In this study, we created a meta-model, POSSUM, for the identification of potassium ion inhibitors. Two distinct datasets were used for training, testing, and evaluation of the meta-model. We employed seven feature descriptors and five distinctive classifiers to construct 35 baseline models. We used the mean Gini index score to select the optimal base models and classifiers. The POSSUM method was trained on the optimal probabilistic feature vectors. The proposed optimal model, POSSUM, outperforms the baseline models and the existing methods on both datasets. We anticipate POSSUM will be a very useful tool and will be essential in the process of finding and screening possible potassium ion inhibitors.
View Publication →

GGAS2SN: Gated Graph and SmilesToSeq Network for Solubility Prediction

10/10/2024
Authors: Waqar Ahmad, Kil To Chong, Hilal Tayara
Journal: Journal Of Chemical Information And Modeling
Abstract: Aqueous solubility is a critical physicochemical property of drug discovery. Solubility is a key issue in pharmaceutical development because it can limit a drug’s absorption capacity. Accurate solubility prediction is crucial for pharmacological, environmental, and drug development studies. This research introduces a novel method for solubility prediction by combining gated graph neural networks (GGNNs) and graph attention neural networks (GATs) with Smiles2Seq encoding. Our methodology involves converting chemical compounds into graph structures with nodes representing atoms and edges indicating chemical bonds. These graphs are then processed by using a specialized graph neural network (GNN) architecture. Incorporating attention mechanisms into GNN allows for capturing subtle structural dependencies, fostering improved solubility predictions. Furthermore, we utilized the Smiles2Seq encoding technique to bridge the semantic gap between molecular structures and their textual representations. Smiles2Seq seamlessly converts chemical notations into numeric sequences, facilitating the efficient transfer of information into our model. We demonstrate the efficacy of our approach through comprehensive experiments on benchmark solubility data sets, showcasing superior predictive performance compared to traditional methods. Our model outperforms existing solubility prediction models and provides interpretable insights into the molecular features driving solubility behavior. This research signifies an important advancement in solubility prediction, offering potent tools for drug discovery, formulation development, and environmental assessments. The fusion of GGNN and Smiles2Seq encoding establishes a robust framework for accurately forecasting solubility across various chemical compounds, fostering innovation in various domains reliant on solubility data.
View Publication →

In Silico Exploration of Novel EGFR Kinase Mutant-Selective Inhibitors Using a Hybrid Computational Approach

08/23/2024
Authors: Md Ali Asif Noor, Md Mazedul Haq, Md Arifur Rahman Chowdhury, Hilal Tayara, HyunJoo Shim, Kil To Chong
Journal: Pharmaceutics
Abstract: Targeting epidermal growth factor receptor (EGFR) mutants is a promising strategy for treating non-small cell lung cancer (NSCLC). This study focused on the computational identification and characterization of potential EGFR mutant-selective inhibitors using pharmacophore design and validation by deep learning, virtual screening, ADMET (Absorption, distribution, metabolism, excretion and toxicity), and molecular docking-dynamics simulations. A pharmacophore model was generated using Pharmit based on the potent inhibitor JBJ-125, which targets the mutant EGFR (PDB 5D41) and is used for the virtual screening of the Zinc database. In total, 16 hits were retrieved from 13,127,550 molecules and 122,276,899 conformers. The pharmacophore model was validated via DeepCoy, generating 100 inactive decoy structures for each active molecule and ADMET tests were conducted using SWISS ADME and PROTOX 3.0. Filtered compounds underwent molecular docking studies using Glide, revealing promising interactions with the EGFR allosteric site along with better docking scores. Molecular dynamics (MD) simulations confirmed the stability of the docked conformations. These results bring out five novel compounds that can be evaluated as single agents or in combination with existing therapies, holding promise for treating the EGFR-mutant NSCLC.
View Publication →

Toxicity Prediction for Immune Thrombocytopenia Caused by Drugs Based on Logistic Regression with Feature Importance

08/01/2024
Authors: Osphanie Mentari, Muhammad Shujaat, Hilal Tayara, Kil T Chong
Journal: Current Bioinformatics
Abstract: One of the problems in drug discovery that can be solved by artificial intelligence is toxicity prediction. In drug-induced immune thrombocytopenia, toxicity can arise in patients after five to ten days by significant bleeding caused by drugdependent antibodies. In clinical trials, when this condition occurs, all the drugs consumed by patients should be stopped, although sometimes this is not possible, especially for older patients who are dependent on their medication. Therefore, being able to predict toxicity in drug-induced immune thrombocytopenia is very important. Computational technologies, such as machine learning, can help predict toxicity better than empirical techniques owing to the lower cost and faster processing. Objective: Previous studies used the KNN method. However, the performance of these approaches needs to be enhanced. This study proposes a Logistic Regression to improve accuracy scores. Methods: In this study, we present a new model for drug-induced immune thrombocytopenia using a machine learning method. Our model extracts several features from the Simplified Molecular Input Line Entry System (SMILES). These features were fused and cleaned, and the important features were selected using the SelectKBest method. The model uses a Logistic Regression that is optimized and tuned by the Grid Search Cross Validation. Results: The highest accuracy occurred when using features from PADEL, CDK, RDKIT, MORDRED, BLUEDESC combinations, resulting in an accuracy of 80%. Conclusion: Our proposed model outperforms previous studies in accuracy categories.
View Publication →

AntiCPs-CompML: A Comprehensive Fast Track ML method to predict Anti-Corona Peptides

06/27/2024
Authors: Prem Singh Bist, Sadik Bhattarai, Hilal Tayara, Kil To Chong
Journal: Cold Spring Harbor Laboratory
Abstract: This work introduces AntiCPs-CompML, a novel Machine learning framework for the rapid identification of anti-coronavirus peptides (ACPs). ACPs, acting as viral shields, offer immense potential for COVID-19 therapeutics. However, traditional laboratory methods for ACP discovery are slow and expensive. AntiCPs-CompML addresses this challenge by utilizing three primary features for peptide sequence analysis: Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), and Composition-Transition-Distribution (CTD). The framework leverages 26 different machine learning algorithms to effectively predict potential anti-coronavirus peptides. This capability allows for the analysis of vast datasets and the identification of peptides with hallmarks of effective ACPs. AntiCPs-CompML boasts unprecedented speed and cost-effectiveness, significantly accelerating the discovery process while enhancing research efficiency by filtering out less promising options. This method holds promise for developing therapeutic drugs for COVID-19 and potentially other viruses. Our model demonstrates strong performance with an F1 Score of 92.12% and a Roc AUC of 76% in the independent test dataset. Despite these promising results, we are continuously working to refine the model and explore its generalizability to unseen datasets. Future enhancements will include featurebased and oversampling augmentation strategies addressing the limitation of anti-covid peptide data for comprehensive study, along with concrete feature selection algorithms, to further refine the model’s predictive power. AntiCPs-CompML ushers in a new era of expedited anti-covid peptides discovery, accelerating the development of novel antiviral therapies.
View Publication →

AMPred-CNN: Ames mutagenicity prediction model based on convolutional neural networks

06/01/2024
Authors: Thi Tuyet Van Tran, Hilal Tayara, Kil To Chong
Journal: Computers In Biology And Medicine
Abstract: Mutagenicity assessment plays a pivotal role in the safety evaluation of chemicals, pharmaceuticals, and environmental compounds. In recent years, the development of robust computational models for predicting chemical mutagenicity has gained significant attention, driven by the need for efficient and cost-effective toxicity assessments. In this paper, we proposed AMPred-CNN, an innovative Ames mutagenicity prediction model based on Convolutional Neural Networks (CNNs), uniquely employing molecular structures as images to leverage CNNs’ powerful feature extraction capabilities. The study employs the widely used benchmark mutagenicity dataset from Hansen et al. for model development and evaluation. Comparative analyses with traditional ML models on different molecular features reveal substantial performance enhancements. AMPred-CNN outshines these models, demonstrating superior accuracy, AUC, F1 score, MCC, sensitivity, and specificity on the test set. Notably, AMPred-CNN is further benchmarked against seven recent ML and DL models, consistently showcasing superior performance with an impressive AUC of 0.954. Our study highlights the effectiveness of CNNs in advancing mutagenicity prediction, paving the way for broader applications in toxicology and drug development.
View Publication →

Stack-AAgP: Computational prediction and interpretation of anti-angiogenic peptides using a meta-learning framework

05/01/2024
Authors: Saima Gaffar, Hilal Tayara, Kil To Chong
Journal: Computers In Biology And Medicine
Abstract: Angiogenesis plays a vital role in the pathogenesis of several human diseases, particularly in the case of solid tumors. In the realm of cancer treatment, recent investigations into peptides with anti-angiogenic properties have yielded encouraging outcomes, thereby creating a hopeful therapeutic avenue for the treatment of cancer. Therefore, correctly identifying the anti-angiogenic peptides is extremely important in comprehending their biophysical and biochemical traits, laying the groundwork for uncovering novel drugs to combat cancer. In this work, we present a novel ensemble-learning-based model, Stack-AAgP, specifically designed for the accurate identification and interpretation of anti-angiogenic peptides (AAPs). Initially, a feature representation approach is employed, generating 24 baseline models through six machine learning algorithms (random forest [RF], extra tree classifier [ETC], extreme gradient boosting [XGB], light gradient boosting machine [LGBM], CatBoost, and SVM) and four feature encodings (pseudo-amino acid composition [PAAC], amphiphilic pseudo-amino acid composition [APAAC], composition of k-spaced amino acid pairs [CKSAAP], and quasi-sequence-order [QSOrder]). Subsequently, the output (predicted probabilities) from 24 baseline models was inputted into the same six machine-learning classifiers to generate their respective meta-classifiers. Finally, the meta-classifiers were stacked together using the ensemble-learning framework to construct the final predictive model. Findings from the independent test demonstrate that Stack-AAgP outperforms the state-of-the-art methods by a considerable margin. Systematic experiments were conducted to assess the influence of hyperparameters on the proposed model. Our model, Stack-AAgP, was evaluated on the independent NT15 dataset, revealing superiority over existing predictors with an accuracy improvement ranging from 5% to 7.5% and an increase in Matthews Correlation Coefficient (MCC) from 7.2% to 12.2%.
View Publication →

Harnessing machine learning to predict cytochrome P450 inhibition through molecular properties

04/15/2024
Authors: Hamza Zahid, Hilal Tayara, Kil To Chong
Journal: Archives of Toxicology
Abstract: Cytochrome P450 enzymes are a superfamily of enzymes responsible for the metabolism of a variety of medicines and xenobiotics. Among the Cytochrome P450 family, five isozymes that include 1A2, 2C9, 2C19, 2D6, and 3A4 are most important for the metabolism of xenobiotics. Inhibition of any of these five CYP isozymes causes drug-drug interactions with high pharmacological and toxicological effects. So, the inhibition or non-inhibition prediction of these isozymes is of great importance. Many techniques based on machine learning and deep learning algorithms are currently being used to predict whether these isozymes will be inhibited or not. In this study, three different molecular or substructural properties that include Morgan, MACCS and Morgan (combined) and RDKit of the various molecules are used to train a distinct SVM model against each isozyme (1A2, 2C9, 2C19, 2D6, and 3A4). On the independent dataset, Morgan fingerprints provided the best results, while MACCS and Morgan (combined) achieved comparable results in terms of balanced accuracy (BA), sensitivity (Sn), and Mathews correlation coefficient (MCC). For the Morgan fingerprints, balanced accuracies (BA), Mathews correlation coefficients (MCC), and sensitivities (Sn) against each CYPs isozyme, 1A2, 2C9, 2C19, 2D6, and 3A4 on an independent dataset ranged between 0.81 and 0.85, 0.61 and 0.70, 0.72 and 0.83, respectively. Similarly, on the independent dataset, MACCS and Morgan (combined) fingerprints achieved competitive results in terms of balanced accuracies (BA), Mathews correlation coefficients (MCC), and sensitivities (Sn) against each CYPs isozyme, 1A2, 2C9, 2C19, 2D6, and 3A4, which ranged between 0.79 and 0.85, 0.59 and 0.69, 0.69 and 0.82, respectively.
View Publication →

Unlocking the therapeutic potential of drug combinations through synergy prediction using graph transformer networks

03/01/2024
Authors: Waleed Alam, Hilal Tayara, Kil To Chong
Journal: Computers In Biology And Medicine
Abstract: Drug combinations are frequently used to treat cancer to reduce side effects and increase efficacy. The experimental discovery of drug combination synergy is time-consuming and expensive for large datasets. Therefore, an efficient and reliable computational approach is required to investigate these drug combinations. Advancements in deep learning can handle large datasets with various biological problems. In this study, we developed a SynergyGTN model based on the Graph Transformer Network to predict the synergistic drug combinations against an untreated cancer cell line expression profile. We represent the drug via a graph, with each node and edge of the graph containing nine types of atomic feature vectors and four bonds features, respectively. The cell lines represent based on their gene expression profiles. The drug graph was passed through the GTN layers to extract a generalized feature map for each drug pairs. The drug pair extracted features and cell-line gene expression profiles were concatenated and subsequently subjected to processing through multiple densely connected layers. SynergyGTN outperformed the state-of-the-art methods, with a receiver operating characteristic area under the curve improvement of 5% on the 5-fold cross-validation. The accuracy of SynergyGTN was further verified through three types of cross-validation tests strategies namely leave-drug-out, leave-combination-out, and leave-tissue-out, resulting in improvement in accuracy of 8%, 1%, and 2%, respectively. The Astrazeneca Dream dataset was utilized as an independent dataset to validate and assess the generalizability of the proposed method, resulting in an improvement in balanced accuracy of 13%. In conclusion, SynergyGTN is a reliable and efficient computational approach for predicting drug combination synergy in cancer treatment.
View Publication →

An integrative machine learning model for the identification of tumor T-cell antigens

03/01/2024
Authors: Mir Tanveerul Hassan, Hilal Tayara, Kil To Chong
Journal: BioSystems
Abstract: The escalating global incidence of cancer poses significant health challenges, underscoring the need for innovative and more efficacious treatments. Cancer immunotherapy, a promising approach leveraging the body’s immune system against cancer, emerges as a compelling solution. Consequently, the identification and characterization of tumor T-cell antigens (TTCAs) have become pivotal for exploration. In this manuscript, we introduce TTCA-IF, an integrative machine learning-based framework designed for TTCAs identification. TTCA-IF employs ten feature encoding types in conjunction with five conventional machine learning classifiers. To establish a robust foundation, these classifiers are trained, resulting in the creation of 150 baseline models. The outputs from these baseline models are then fed back into the five classifiers, generating their respective meta-models. Through an ensemble approach, the five meta-models are seamlessly integrated to yield the final predictive model, the TTCA-IF model. Our proposed model, TTCA-IF, surpasses both baseline models and existing predictors in performance. In a comparative analysis involving nine novel peptide sequences, TTCA-IF demonstrated exceptional accuracy by correctly identifying 8 out of 9 peptides as TTCAs. As a tool for screening and pinpointing potential TTCAs, we anticipate TTCA-IF to be invaluable in advancing cancer immunotherapy.
View Publication →

iProm-Yeast: Prediction Tool for Yeast Promoters Based on ML Stacking

02/01/2024
Authors: Muhammad Shujaat, Sunggoo Yoo, Hilal Tayara, Kil To Chong
Journal: Current Bioinformatics
Abstract: Background and Objective: Gene promoters play a crucial role in regulating gene transcription by serving as DNA regulatory elements near transcription start sites. Despite numerous approaches, including alignment signal and content-based methods for promoter prediction, accurately identifying promoters remains challenging due to the lack of explicit features in their sequences. Consequently, many machine learning and deep learning models for promoter identification have been presented, but the performance of these tools is not precise. Most recent investigations have concentrated on identifying sigma or plant promoters. While the accurate identification of Saccharomyces cerevisiae promoters remains an underexplored area. In this study, we introduced “iPromyeast”, a method for identifying yeast promoters. Using genome sequences from the eukaryotic yeast Saccharomyces cerevisiae, we investigate vector encoding and promoter classification. Additionally, we developed a more difficult negative set by employing promoter sequences rather than nonpromoter regions of the genome. The newly developed negative reconstruction approach improves classification and minimizes the amount of false positive predictions. Methods: To overcome the problems associated with promoter prediction, we investigate alternate vector encoding and feature extraction methodologies. Following that, these strategies are coupled with several machine learning algorithms and a 1-D convolutional neural network model. Our results show that the pseudo-dinucleotide composition is preferable for feature encoding and that the machine- learning stacking approach is excellent for accurate promoter categorization. Furthermore, we provide a negative reconstruction method that uses promoter sequences rather than non-promoter regions, resulting in higher classification performance and fewer false positive predictions. Results: Based on the results of 5-fold cross-validation, the proposed predictor, iProm-Yeast, has a good potential for detecting Saccharomyces cerevisiae promoters. The accuracy (Acc) was 86.27%, the sensitivity (Sn) was 82.29%, the specificity (Sp) was 89.47%, the Matthews correlation coefficient (MCC) was 0.72, and the area under the receiver operating characteristic curve (AUROC) was 0.98. We also performed a cross-species analysis to determine the generalizability of iProm-Yeast across other species. Conclusion: iProm-Yeast is a robust method for accurately identifying Saccharomyces cerevisiae promoters. With advanced vector encoding techniques and a negative reconstruction approach, it achieves improved classification accuracy and reduces false positive predictions. In addition, it offers researchers a reliable and precise webserver to study gene regulation in diverse organisms.
View Publication →

DL-SPhos: Prediction of serine phosphorylation sites using transformer language model

02/01/2024
Authors: Palistha Shrestha, Jeevan Kandel, Hilal Tayara, Kil To Chong
Journal: Computers In Biology And Medicine
Abstract: Serine phosphorylation plays a pivotal role in the pathogenesis of various cellular processes and diseases. Roughly 81% of human diseases have links to phosphorylation, and an overwhelming 86.4% of protein phosphorylation takes place at serine residues. In eukaryotes, over a quarter of proteins undergo phosphorylation, with more than half implicated in numerous disorders, notably cancer and reproductive system diseases. This study primarily focuses on serine-phosphorylation-driven pathogenesis and the critical role of conserved motif identification. While numerous techniques exist for predicting serine phosphorylation sites, traditional wet lab experiments are resource-intensive. Our paper introduces a cutting-edge deep learning tool for predicting S phosphorylation sites, integrating explainable AI for motif identification, a transformer language model, and deep neural network components. We trained our model on protein sequences from UniProt, validated it against the dbPTM benchmark dataset, and employed the PTMD dataset to explore motifs related to mammalian disorders. Our results highlight that our model surpasses other deep learning predictors by a significant 3%. Furthermore, we utilized the local interpretable model-agnostic explanations (LIME) approach to shed light on the predictions, emphasizing the amino acid residues crucial for S phosphorylation. Notably, our model also outperformed competitors in kinase-specific serine phosphorylation prediction on benchmark datasets.
View Publication →

In Silico Computational Method for ACP classification and Peptide Class validation Server in Bioinformatics

01/30/2024
Authors: Sadik Bhattarai, Prem Singh Bist, Hilal Tayara, Kil To Chomg
Journal: J. Living Sci. Res
Abstract: Cancer is the second-leading cause of death worldwide, and therapeutic peptides that target and destroy cancer cells have received a great deal of interest in recent years. Traditional wet experiments are expensive and inefficient for identifying novel anticancer peptides; therefore, the development of an effective computational approach is essential to recognize ACP candidates before experimental methods are used. In this study, we proposed an Ada-boosting algorithm with the base learner random forest called ACP-ADA, which integrates binary profile feature, amino acid index, and amino acid composition with a 210-dimensional feature space vector to represent the peptides. Training samples in the feature space were augmented to increase the sample size and further improve the performance of the model in the case of insufficient samples. Furthermore, we used five-fold cross-validation to find model parameters, and the cross-validation results showed that ACP-ADA outperforms existing methods for this feature combination with data augmentation in terms of performance metrics. Specifically, ACP-ADA recorded an average accuracy of 86.4% and a Mathew’s correlation coefficient of 74.01% for dataset ACP740 and 90.83% and 81.65% for dataset ACP240; consequently, it can be a very useful tool in drug development and biomedical research. Different prediction servers are available for identification of anticancer peptides where AntiCP, AntiCP2.0, mACPpred, MLACP etc. which can be validating platform for researcher working in peptide-based therapy.
View Publication →

SolPredictor: predicting solubility with residual gated graph neural network

01/05/2024
Authors: Waqar Ahmad, Hilal Tayara, HyunJoo Shim, Kil To Chong
Journal: International Journal Of Molecular Sciences
Abstract: Computational methods play a pivotal role in the pursuit of efficient drug discovery, enabling the rapid assessment of compound properties before costly and time-consuming laboratory experiments. With the advent of technology and large data availability, machine and deep learning methods have proven efficient in predicting molecular solubility. High-precision in silico solubility prediction has revolutionized drug development by enhancing formulation design, guiding lead optimization, and predicting pharmacokinetic parameters. These benefits result in considerable cost and time savings, resulting in a more efficient and shortened drug development process. The proposed SolPredictor is designed with the aim of developing a computational model for solubility prediction. The model is based on residual graph neural network convolution (RGNN). The RGNNs were designed to capture long-range dependencies in graph-structured data. Residual connections enable information to be utilized over various layers, allowing the model to capture and preserve essential features and patterns scattered throughout the network. The two largest datasets available to date are compiled, and the model uses a simplified molecular-input line-entry system (SMILES) representation. SolPredictor uses the ten-fold split cross-validation Pearson correlation coefficient R2 0.79±0.02 and root mean square error (RMSE) 1.03±0.04. The proposed model was evaluated using five independent datasets. Error analysis, hyperparameter optimization analysis, and model explainability were used to determine the molecular features that were most valuable for prediction.
View Publication →

Predicting the bandgap and efficiency of perovskite solar cells using machine learning methods.

01/04/2024
Authors: H Tayara, A Khan, J Kandel, KT Chong
Journal: Molecular Informatics
Abstract: Rapid and accurate prediction of bandgaps and efficiency of perovskite solar cells is a crucial challenge for various solar cell applications. Existing theoretical and experimental methods often accurately measure these parameters; however, these methods are costly and time-consuming. Machine learning-based approaches offer a promising and computationally efficient method to address this problem. In this study, we trained different machine learning(ML) models using previously reported experimental data. Among the different ML models, the CatBoostRegressor performed better for both bandgap and efficiency approximations. We evaluated the proposed model using k-fold cross-validation and investigated the relative importance of input features using Shapley Additive Explanations (SHAP). SHAP interprets valuable insights into feature contributions of the prediction of the proposed model. Furthermore, we validated the performance of the proposed model using an independent dataset, demonstrating its robustness and generalizability beyond the training data. Our findings show that machine learning-based approaches, with the aid of SHAP, can provide a promising and computationally efficient method for accurately and rapidly predicting perovskite solar cell properties. The proposed model is expected to facilitate the discovery of new perovskite materials and is freely available on GitHub (https://github.com/AsadKhanJBNU/perovskite_bandgap_and_efficiency.git) for the perovskite community.
View Publication →

IF-AIP: a machine learning method for the identification of anti-inflammatory peptides using multi-feature fusion strategy

01/01/2024
Authors: Saima Gaffar, Mir Tanveerul Hassan, Hilal Tayara, Kil To Chong
Journal: Computers In Biology And Medicine
Abstract: Background: The most commonly used therapy currently for inflammatory and autoimmune diseases is nonspecific anti-inflammatory drugs, which have various hazardous side effects. Recently, some anti-inflammatory peptides (AIPs) have been found to be a substitute therapy for inflammatory diseases like rheumatoid arthritis and Alzheimer’s. Therefore, the identification of these AIPs is an emerging topic that is equally important. Methods: In this work, we have proposed an identification model for AIPs using a voting classifier. We used eight different feature descriptors and five conventional machine-learning classifiers. The eight feature encodings were concatenated to get a hybrid feature set. The five baseline models trained on the hybrid feature set were integrated via a voting classifier. Finally, a feature selection algorithm was used to select the optimal feature set for the construction of our final model, named IF-AIP. Results: We tested the proposed model on two independent datasets. On independent data 1, the IF-AIP model shows an improvement of 3%–5.6% in terms of accuracies and 6.7%–10.8% in terms of MCC compared to the existing methods. On the independent dataset 2, our model IF-AIP shows an overall improvement of 2.9%–5.7% in terms of accuracy and 8.3%–8.6% in terms of MCC score compared to the existing methods. A comparative performance analysis was conducted between the proposed model and existing methods using a set of 24 novel peptide sequences. Notably, the IF-AIP method exhibited exceptional accuracy, correctly identifying all 24 peptides as AIPs.
View Publication →
2023

Predicting the bandgap and efficiency of perovskite solar cells using machine learning methods

12/05/2023
Authors: Asad Khan, Jeevan Kandel, Hilal Tayara, Kil To Chong
Journal: Molecular Informatics
Abstract: Rapid and accurate prediction of bandgaps and efficiency of perovskite solar cells is a crucial challenge for various solar cell applications. Existing theoretical and experimental methods often accurately measure these parameters; however, these methods are costly and time-consuming. Machine learning-based approaches offer a promising and computationally efficient method to address this problem. In this study, we trained different machine learning(ML) models using previously reported experimental data. Among the different ML models, the CatBoostRegressor performed better for both bandgap and efficiency approximations. We evaluated the proposed model using k-fold cross-validation and investigated the relative importance of input features using Shapley Additive Explanations (SHAP). SHAP interprets valuable insights into feature contributions of the prediction of the proposed model. Furthermore, we validated the performance of the proposed model using an independent dataset, demonstrating its robustness and generalizability beyond the training data. Our findings show that machine learning-based approaches, with the aid of SHAP, can provide a promising and computationally efficient method for the accurate and rapid prediction of perovskite solar cell properties. The proposed model is expected to facilitate the discovery of new perovskite materials and is freely available at GitHub (https://github.com/AsadKhanJBNU/perovskite_bandgap_and_efficiency.git) for the perovskite community.
View Publication →

ORI-Explorer: a unified cell-specific tool for origin of replication sites prediction by feature fusion

11/01/2023
Authors: Zeeshan Abbas, Mobeen Ur Rehman, Hilal Tayara, Kil To Chong
Journal: Bioinformatics
Abstract: Motivation The origins of replication sites (ORIs) are precise regions inside the DNA sequence where the replication process begins. These locations are critical for preserving the genome’s integrity during cell division and guaranteeing the faithful transfer of genetic data from generation to generation. The advent of experimental techniques has aided in the discovery of ORIs in many species. Experimentation, on the other hand, is often more time-consuming and pricey than computational approaches, and it necessitates specific equipment and knowledge. Recently, ORI sites have been predicted using computational techniques like motif-based searches and artificial intelligence algorithms based on sequence characteristics and chromatin states. Results In this article, we developed ORI-Explorer, a unique artificial intelligence-based technique that combines multiple feature engineering techniques to train CatBoost Classifier for recognizing ORIs from four distinct eukaryotic species. ORI-Explorer was created by utilizing a unique combination of three traditional feature-encoding techniques and a feature set obtained from a deep-learning neural network model. The ORI-Explorer has significantly outperformed current predictors on the testing dataset. Furthermore, by employing the sophisticated SHapley Additive exPlanation method, we give crucial insights that aid in comprehending model success, highlighting the most relevant features vital for forecasting cell-specific ORIs. ORI-Explorer is also intended to aid community-wide attempts in discovering potential ORIs and developing innovative verifiable biological hypotheses.
View Publication →

An ensemble of stacking classifiers for improved prediction of miRNA–mRNA interactions

09/01/2023
Authors: Priyash Dhakal, Hilal Tayara, Kil To Chong
Journal: Computers In Biology And Medicine
Abstract: MicroRNAs (miRNAs) are small non-coding RNA molecules that play a crucial role in regulating gene expression at the post-transcriptional level by binding to potential target sites of messenger RNAs (mRNAs), facilitated by the Argonaute family of proteins. Selecting the conservative candidate target sites (CTS) is a challenging step, considering that most of the existing computational algorithms primarily focus on canonical site types, which is a time-consuming and inefficient utilization of miRNA target site interactions. We developed a stacking classifier algorithm that addresses the CTS selection criteria using feature-encoding techniques that generates feature vectors, including k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. This innovative stacking classifier algorithm surpassed previous state-of-the-art algorithms in predicting functional miRNA targets. We evaluated the performance of the proposed model on 10 independent test datasets and obtained an average accuracy of 79.77%, which is a significant improvement of 7.26 % over previous models. This improvement shows that the proposed method has great potential for distinguishing highly functional miRNA targets and can serve as a valuable tool in biomedical and drug development research.
View Publication →

Meta-IL4: An ensemble learning approach for IL-4-inducing peptide prediction

09/01/2023
Authors: Mir Tanveerul Hassan, Hilal Tayara, Kil To Chong
Journal: Methods
Abstract: The cytokine interleukin-4 (IL-4) plays an important role in our immune system. IL-4 leads the way in the differentiation of naïve T-helper 0 cells (Th0) to T-helper 2 cells (Th2). The Th2 responses are characterized by the release of IL-4. CD4+ T cells produce the cytokine IL-4 in response to exogenous parasites. IL-4 has a critical role in the growth of CD8+ cells, inflammation, and responses of T-cells. We propose an ensemble model for the prediction of IL-4 inducing peptides. Four feature encodings were extracted to build an efficient predictor: pseudo-amino acid composition, amphiphilic pseudo-amino acid composition, quasi-sequence-order, and Shannon entropy. We developed an ensemble learning model fusion of random forest, extreme gradient boost, light gradient boosting machine, and extra tree classifier in the first layer, and a Gaussian process classifier as a meta classifier in the second layer. The outcome of the benchmarking testing dataset, with a Matthews correlation coefficient of 0.793, showed that the meta-model (Meta-IL4) outperformed individual classifiers. The highest accuracy achieved by the Meta-IL4 model is 90.70%. These findings suggest that peptides that induce IL-4 can be predicted with reasonable accuracy. These models could aid in the development of peptides that trigger the appropriate Th2 response.
View Publication →

XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites

08/02/2023
Authors: Zeeshan Abbas, Mobeen ur Rehman, Hilal Tayara, Quan Zou, Kil To Chong
Journal: Molecular Therapy
Abstract: 5-methylcytosine (m5C) is indeed a critical post-transcriptional alteration that is widely present in various kinds of RNAs and is crucial to the fundamental biological processes. By correctly identifying the m5C-methylation sites on RNA, clinicians can more clearly comprehend the precise function of these m5C-sites in different biological processes. Due to their effectiveness and affordability, computational methods have received greater attention over the last few years for the identification of methylation sites in various species. To precisely identify RNA m5C locations in five different species including Homo sapiens, Arabidopsis thaliana, Mus musculus, Drosophila melanogaster, and Danio rerio, we proposed a more effective and accurate model named m5C-pred. To create m5C-pred, five distinct feature encoding techniques were combined to extract features from the RNA sequence, and then we used SHapley Additive exPlanations to choose the best features among them, followed by XGBoost as a classifier. We applied the novel optimization method called Optuna to quickly and efficiently determine the best hyperparameters. Finally, the proposed model was evaluated using independent test datasets, and we compared the results with the previous methods. Our approach, m5C- pred, is anticipated to be useful for accurately identifying m5C sites, outperforming the currently available state-of-the-art techniques.
View Publication →

iCpG-Pos: an accurate computational approach for identification of CpG sites using positional features on single-cell whole genome sequence data

08/01/2023
Authors: Sehi Park, Mobeen Ur Rehman, Farman Ullah, Hilal Tayara, Kil To Chong
Journal: Bioinformatics
Abstract: The investigation of DNA methylation can shed light on the processes underlying human well-being and help determine overall human health. However, insufficient coverage makes it challenging to implement single-stranded DNA methylation sequencing technologies, highlighting the need for an efficient prediction model. Models are required to create an understanding of the underlying biological systems and to project single-cell (methylated) data accurately. Results In this study, we developed positional features for predicting CpG sites. Positional characteristics of the sequence are derived using data from CpG regions and the separation between nearby CpG sites. Multiple optimized classifiers and different ensemble learning approaches are evaluated. The OPTUNA framework is used to optimize the algorithms. The CatBoost algorithm followed by the stacking algorithm outperformed existing DNA methylation identifiers. Availability and implementation The data and methodologies used in this study are openly accessible to the research community. Researchers can access the positional features and algorithms used for predicting CpG site methylation patterns. To achieve superior performance, we employed the CatBoost algorithm followed by the stacking algorithm, which outperformed existing DNA methylation identifiers. The proposed iCpG-Pos approach utilizes only positional features, resulting in a substantial reduction in computational complexity compared to other known approaches for detecting CpG site methylation patterns. In conclusion, our study introduces a novel approach, iCpG-Pos, for predicting CpG site methylation patterns. By focusing on positional features, our model offers both accuracy and efficiency, making it a promising tool for advancing DNA methylation research and its applications in human health and well-being.
View Publication →

XGB5hmC: Identifier based on XGB model for RNA 5-hydroxymethylcytosine detection

07/15/2023
Authors: Agung Surya Wibowo, Hilal Tayara, Kil To Chong
Journal: Chemometrics And Intelligent Laboratory Systems
Abstract: One of the problems in bioinformatics that artificial intelligence can solve is RNA 5-hydroxymethylcytosine (5hmC) site detection, which has become increasingly important because of its benefits, such as cost savings in labor, materials, and time consumption. To create a reliable identifier, performance results must be as high as possible. In this study, we developed XGB5hmC, a high-performance identifier of RNA 5hmC. We use extreme gradient boosting (XGB) as the best model. In addition, we investigated other models, such as random forest (RF), ada boosting (AB), and gradient boosting (GB). First, IlearnPlus was used to run 15 different machine-learning models using 35 different descriptors to select the best descriptors. Then, it was decided that the composition of k-spaced nucleic acid pairs (CKSNAP), pseudo-K-tuple nucleotide composition (PseKNC), and position-specific trinucleotide propensity single strand (PSTNPss) are the best descriptors. Subsequently, the features were combined and reduced in dimension using chi-squared test filtering. Using these filtered features and the XGB model, we obtained better performance than the state-of-the-art methods. The increases in accuracy, sensitivity, specificity, and MCC values were 11.43, 15.82, 8.94, and 24.58%, respectively. This implies that our model improved as a reliable identifier to detect 5hmC.
View Publication →

Artificial intelligence in drug toxicity prediction: recent advances, challenges, and future perspectives

04/26/2023
Authors: Thi Tuyet Van Tran, Agung Surya Wibowo, Hilal Tayara, Kil To Chong
Journal: Journal Of Chemical Information And Modeling
Abstract: Toxicity prediction is a critical step in the drug discovery process that helps identify and prioritize compounds with the greatest potential for safe and effective use in humans, while also reducing the risk of costly late-stage failures. It is estimated that over 30% of drug candidates are discarded owing to toxicity. Recently, artificial intelligence (AI) has been used to improve drug toxicity prediction as it provides more accurate and efficient methods for identifying the potentially toxic effects of new compounds before they are tested in human clinical trials, thus saving time and money. In this review, we present an overview of recent advances in AI-based drug toxicity prediction, including the use of various machine learning algorithms and deep learning architectures, of six major toxicity properties and Tox21 assay end points. Additionally, we provide a list of public data sources and useful toxicity prediction tools for the research community and highlight the challenges that must be addressed to enhance model performance. Finally, we discuss future perspectives for AI-based drug toxicity prediction. This review can aid researchers in understanding toxicity prediction and pave the way for new methods of drug discovery.
View Publication →

Sars-escape network for escape prediction of SARS-COV-2

04/10/2023
Authors: Prem Singh Bist, Hilal Tayara, Kil To Chong
Journal: Briefings In Bioinformatics
Abstract: Viruses have coevolved with their hosts for over millions of years and learned to escape the host’s immune system. Although not all genetic changes in viruses are deleterious, some significant mutations lead to the escape of neutralizing antibodies and weaken the immune system, which increases infectivity and transmissibility, thereby impeding the development of antiviral drugs or vaccines. Accurate and reliable identification of viral escape mutational sequences could be a good indicator for therapeutic design. We developed a computational model that recognizes significant mutational sequences based on escape feature identification using natural language processing along with prior knowledge of experimentally validated escape mutants. Results Our machine learning-based computational approach can recognize the significant spike protein sequences of severe acute respiratory syndrome coronavirus 2 using sequence data alone. This modelling approach can be applied to other viruses, such as influenza, monkeypox and HIV using knowledge of escape mutants and relevant protein sequence datasets. Availability Complete source code and pre-trained models for escape prediction of severe acute respiratory syndrome coronavirus 2 protein sequences are available on Github at https://github.com/PremSinghBist/Sars-CoV-2-Escape-Model.git. The dataset is deposited to Zenodo at: doi: 10.5281/zenodo.7142638. The Python scripts are easy to run and customize as needed.
View Publication →

Recent studies of artificial intelligence on in silico drug distribution prediction

01/17/2023
Authors: Thi Tuyet Van Tran, Hilal Tayara, Kil To Chong
Journal: International Journal Of Molecular Sciences
Abstract: Drug distribution is an important process in pharmacokinetics because it has the potential to influence both the amount of medicine reaching the active sites and the effectiveness as well as safety of the drug. The main causes of 90% of drug failures in clinical development are lack of efficacy and uncontrolled toxicity. In recent years, several advances and promising developments in drug distribution property prediction have been achieved, especially in silico, which helped to drastically reduce the time and expense of screening undesired drug candidates. In this study, we provide comprehensive knowledge of drug distribution background, influencing factors, and artificial intelligence-based distribution property prediction models from 2019 to the present. Additionally, we gathered and analyzed public databases and datasets commonly utilized by the scientific community for distribution prediction. The distribution property prediction performance of five large ADMET prediction tools is mentioned as a benchmark for future research. On this basis, we also offer future challenges in drug distribution prediction and research directions. We hope that this review will provide researchers with helpful insight into distribution prediction, thus facilitating the development of innovative approaches for drug discovery.
View Publication →

Attention-Based Graph Neural Network for Molecular Solubility Prediction

01/12/2023
Authors: Waqar Ahmad, Hilal Tayara, Kil To Chong
Journal: Acs Omega
Abstract: Drug discovery (DD) research is aimed at the discovery of new medications. Solubility is an important physicochemical property in drug development. Active pharmaceutical ingredients (APIs) are essential substances for high drug efficacy. During DD research, aqueous solubility (AS) is a key physicochemical attribute required for API characterization. High-precision in silico solubility prediction reduces the experimental cost and time of drug development. Several artificial tools have been employed for solubility prediction using machine learning and deep learning techniques. This study aims to create different deep learning models that can predict the solubility of a wide range of molecules using the largest currently available solubility data set. Simplified molecular-input line-entry system (SMILES) strings were used as molecular representation, models developed using simple graph convolution, graph isomorphism network, graph attention network, and AttentiveFP network. Based on the performance of the models, the AttentiveFP-based network model was finally selected. The model was trained and tested on 9943 compounds. The model outperformed on 62 anticancer compounds with metric Pearson correlation R2 and root-mean-square error values of 0.52 and 0.61, respectively. AS can be improved by graph algorithm improvement or more molecular properties addition.
View Publication →
2022

ACP-ADA: A Boosting Method with Data Augmentation for Improved Prediction of Anticancer Peptides

02/01/2022
Authors: S Bhattarai, KS Kim, H Tayara, KT Chong
Journal: International Journal Of Molecular Sciences
Abstract: Cancer is the second-leading cause of death worldwide, and therapeutic peptides that target and destroy cancer cells have received a great deal of interest in recent years. Traditional wet experiments are expensive and inefficient for identifying novel anticancer peptides; therefore, the development of an effective computational approach is essential to recognize ACP candidates before experimental methods are used. In this study, we proposed an Ada-boosting algorithm with the base learner random forest called ACP-ADA, which integrates binary profile feature, amino acid index, and amino acid composition with a 210-dimensional feature space vector to represent the peptides. Training samples in the feature space were augmented to increase the sample size and further improve the performance of the model in the case of insufficient samples. Furthermore, we used five-fold cross-validation to find model parameters, and the cross-validation results showed that ACP-ADA outperforms existing methods for this feature combination with data augmentation in terms of performance metrics. Specifically, ACP-ADA recorded an average accuracy of 86.4% and a Mathew’s correlation coefficient of 74.01% for dataset ACP740 and 90.83% and 81.65% for dataset ACP240; consequently, it can be a very useful tool in drug development and biomedical research.
View Publication →