Welcome to our cutting-edge computational pipeline designed to accelerate Alzheimerβs Disease (AD) research. This project integrates advanced bioinformatics and cheminformatics, creating a seamless workflow from raw single-cell RNA sequencing (scRNA-seq) data to predictive Quantitative Structure-Activity Relationship (QSAR) modeling.
Our mission is to democratize access to powerful predictive tools, lowering the barrier to entry for researchers in the neurodegenerative disease space. This repository provides a comprehensive toolkit for data integration, cellular analysis, and machine learning-based bioactivity prediction.
You can access and use the live application at: https://QSARify.com
This pipeline is organized into three core modules, each providing a distinct set of functionalities.
scRNA-seq datasets from multiple public studies (GSE138852, GSE157827, GSE175814, GSE163577) across various brain regions into a unified Seurat object.SCTransform, and removes artifacts with DoubletFinder.SingleR package with established reference datasets.PCA for initial dimensionality reduction, Harmony for batch effect correction, and UMAP for visualization and clustering.MiloR to identify statistically significant changes in cell population abundance between experimental conditions.CellChat to infer and analyze intercellular communication networks, identifying key ligand-receptor interactions and signaling pathways.MAO-B, COX-2, VISFATIN, BACE1, AChE). It retrieves bioactivity data (e.g., IC50 values) and calculates critical ADME properties (MW, LogP, HBD, HBA, Lipinskiβs Rule).SMOTETomek technique to address class imbalance in the bioactivity data.
Flask-based API provides endpoints for health checks (/health), single predictions (/predict), and batch predictions (/predict_batch)..txt, .xls, .xlsx).
AlzheimerDisease_FromSingleCell/
βββ SingleCell/ # 𧬠scRNA-seq preprocessing (R)
β βββ Merge_Data.R
β βββ SingleCell_Main.R
βββ MiloR/ # 𧬠Differential-abundance analysis (R)
β βββ MiloR_CellAbundance.R
βββ CellChat/ # 𧬠Cellβcell communication analysis (R)
β βββ CellChat.R
βββ QSAR/ # π§ QSAR modeling & web app (Python)
β βββ figures/
β β βββ roc_curves_comparison.png
β βββ Data/
β β βββ chembl_results_P_27338_MAO-B_IC50_classified.csv
β β βββ chembl_results_P_35354_COX2_IC50_classified.csv
β β βββ chembl_results_P_43490_VISFATIN_IC50_classified.csv
β β βββ chembl_results_P_56817_BACE1_IC50_classified.csv
β β βββ chembl_results_Q_04844_ACHE_IC50_classified.csv
β βββ Model/
β β βββ final_tuned_model.pkl
β βββ templates/
β β βββ index.html
β βββ Target_Collection.ipynb
β βββ Ligand_Final.ipynb
β βββ app.py
β βββ requirements.txt
βββ Data/ # A large-scale analysis of over 500,000 cells was performed. A 25,000-cell subset (5,000 from each study) is provided on GitHub for convenience.
β βββ 25K_Sample.rds
βββ README.md
| Script π₯οΈ | Purpose π― | Key Libraries π οΈ | Output π |
|---|---|---|---|
Merge_Data.R |
Integrates raw scRNA-seq count matrices from multiple GSE studies. | Seurat, batchelor, SingleCellExperiment |
A unified Seurat object containing all datasets. |
SingleCell_Main.R |
Performs QC, normalization, clustering, and cell type annotation. | Seurat, harmony, DoubletFinder, SingleR |
A processed Seurat object with UMAPs and cell annotations. |
MiloR_CellAbundance.R |
Conducts differential abundance testing on cell neighborhoods. | miloR, SingleCellExperiment, ggplot2 |
Differential abundance statistics and visualizations. |
CellChat.R |
Infers and analyzes cell-cell communication pathways. | CellChat, Seurat, dplyr |
Communication network data and plots (bubble plots, heatmaps). |
Target_Collection.ipynb |
Retrieves and preprocesses bioactivity & ADME data from ChEMBL. | pandas, chembl_webresource_client, rdkit |
A cleaned DataFrame and exploratory data visualizations. |
Ligand_Final.ipynb |
Trains, tunes, and evaluates the QSAR machine learning model. | scikit-learn, imbalanced-learn, rdkit, pandas |
A serialized model (.pkl) and performance plots. |
app.py |
Serves a Flask-based web API for on-demand bioactivity predictions. | flask, flask-cors, joblib, rdkit |
JSON responses with predictions and confidence scores. |
index.html |
Provides the interactive front-end UI for the QSAR prediction tool. | HTML, CSS, JavaScript | An interactive web interface rendered in the browser. |
To set up the project environment, please follow these steps.
(Required for SingleCell, MiloR, and CellChat analysis)
# Install core packages from CRAN
install.packages(c("Seurat", "dplyr", "ggplot2", "patchwork", "scater", "scran", "harmony", "batchelor", "SingleR"))
# Install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("SingleCellExperiment", "miloR", "glmGamPoi"))
(Required for QSAR modeling and the Flask API)
# Clone the repository
git clone [https://github.com/xhammady/AD-scRNA2QSAR.git](https://github.com/xhammady/AD-scRNA2QSAR.git)
cd AD-scRNA2QSAR
# Install Python packages from requirements.txt
pip install -r QSAR/requirements.txt
Note: Key Python libraries include
chembl-webresource-client,rdkit,scikit-learn,imbalanced-learn,pandas,flask, andflask-cors.
Follow this sequence to run the full analysis pipeline.
SingleCell/Merge_Data.R to combine the raw count matrices.SingleCell/SingleCell_Main.R to perform QC, normalization, integration, and annotation.MiloR/MiloR_CellAbundance.R to compare cell populations.CellChat/CellChat.R to analyze signaling pathways.QSAR/Target_Collection.ipynb notebook to query ChEMBL and generate the analysis dataset.QSAR/Ligand_Final.ipynb to preprocess features, train the Random Forest model, and save the final .pkl file.python QSAR/app.py
.txt, .xls, .xlsx) for batch predictions.
We welcome contributions to improve this project! Please fork the repository, create a new branch for your feature, and submit a pull request with a detailed description of your changes. Ensure you follow existing coding standards and include tests where applicable.
This project is licensed under the MIT License. See the LICENSE file for more details.
Seurat, CellChat, MiloR, RDKit, scikit-learn, imbalanced-learn, Flask, DoubletFinder, SingleR, pandas, numpy, matplotlib.