On this tutorial, we construct an entire pipeline for single-cell RNA sequencing evaluation utilizing Scanpy. We begin by putting in the required libraries and loading the PBMC 3k dataset, then carry out high quality management, filtering, and normalization to organize the information for downstream evaluation. We then determine extremely variable genes, carry out PCA for dimensionality discount, and assemble a neighborhood graph to generate UMAP embeddings and Leiden clusters. By way of marker gene discovery and visualization, we discover how clusters correspond to organic cell populations and implement a easy rule-based annotation technique to infer cell sorts.
import sys
import subprocess
import importlib
def pip_install(*packages):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])
required = [
"scanpy",
"anndata",
"leidenalg",
"igraph",
"harmonypy",
"seaborn"
]
pip_install(*required)
import os
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import anndata as advert
sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=110, facecolor="white", frameon=False)
np.random.seed(42)
print("Scanpy version:", sc.__version__)
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print("nInitial AnnData:")
print(adata)
adata.layers["counts"] = adata.X.copy()
adata.var["mt"] = adata.var_names.str.higher().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)
print("nQC summary:")
show(
adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe().T
)
We set up all required dependencies and import the core scientific computing libraries wanted for the evaluation. We configure Scanpy settings, initialize the atmosphere, and cargo the PBMC 3k single-cell RNA-seq dataset. We then compute quality-control metrics, together with mitochondrial gene proportion, complete counts, and the variety of detected genes, for every cell.
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.violin(adata, ["n_genes_by_counts"], jitter=0.4, ax=axs[0], present=False)
sc.pl.violin(adata, ["total_counts"], jitter=0.4, ax=axs[1], present=False)
sc.pl.violin(adata, ["pct_counts_mt"], jitter=0.4, ax=axs[2], present=False)
plt.tight_layout()
plt.present()
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts", colour="pct_counts_mt")
adata = adata[adata.obs["n_genes_by_counts"] >= 200].copy()
adata = adata[adata.obs["n_genes_by_counts"] <= 5000].copy()
adata = adata[adata.obs["pct_counts_mt"] < 10].copy()
sc.pp.filter_genes(adata, min_cells=3)
print("nAfter filtering:")
print(adata)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.uncooked = adata.copy()
sc.pp.highly_variable_genes(
adata,
taste="seurat",
min_mean=0.0125,
max_mean=3,
min_disp=0.5
)
print("nHighly variable genes selected:", int(adata.var["highly_variable"].sum()))
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var["highly_variable"]].copy()We visualize high quality management metrics utilizing plots to examine the distribution of gene counts and mitochondrial content material. We apply filtering steps to take away low-quality cells and genes that don’t meet fundamental expression thresholds. We then normalize the information, apply a log transformation, and determine extremely variable genes for downstream evaluation.
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True)
sc.pl.pca(adata, colour=None)
sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30, metric="euclidean")
sc.tl.umap(adata, min_dist=0.35, unfold=1.0)
sc.tl.leiden(adata, decision=0.6, key_added="leiden")
print("nCluster counts:")
show(adata.obs["leiden"].value_counts().sort_index().rename("cells_per_cluster").to_frame())
sc.pl.umap(adata, colour=["leiden"], legend_loc="on data", title="PBMC 3k - Leiden clusters")
sc.tl.rank_genes_groups(adata, groupby="leiden", methodology="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
marker_table = sc.get.rank_genes_groups_df(adata, group=None)
print("nTop marker rows:")
show(marker_table.head(20))We regress out technical confounders and scale the dataset to organize it for dimensionality discount. We carry out principal element evaluation to seize the dataset’s main variance construction. We then assemble the neighborhood graph, compute UMAP embeddings, carry out Leiden clustering, and determine marker genes for every cluster.
top_markers_per_cluster = (
marker_table.groupby("group")
.head(10)
.loc[:, ["group", "names", "logfoldchanges", "pvals_adj"]]
.reset_index(drop=True)
)
print("nTop 10 markers per cluster:")
show(top_markers_per_cluster)
candidate_markers = [
"IL7R", "LTB", "MALAT1", "CCR7",
"NKG7", "GNLY", "PRF1",
"MS4A1", "CD79A", "CD79B",
"LYZ", "S100A8", "FCER1A", "CST3",
"PPBP", "FCGR3A", "LGALS3", "CTSS",
"CD3D", "TRBC1", "TRAC"
]
candidate_markers = [g for g in candidate_markers if g in adata.var_names]
if candidate_markers:
sc.pl.dotplot(
adata,
var_names=candidate_markers,
groupby="leiden",
standard_scale="var",
dendrogram=True
)
sc.pl.matrixplot(
adata,
var_names=candidate_markers,
groupby="leiden",
standard_scale="var",
dendrogram=True
)
cluster_marker_reference = {
"T_cells": ["IL7R", "LTB", "CCR7", "CD3D", "TRBC1", "TRAC"],
"NK_cells": ["NKG7", "GNLY", "PRF1"],
"B_cells": ["MS4A1", "CD79A", "CD79B"],
"Monocytes": ["LYZ", "FCGR3A", "LGALS3", "CTSS", "S100A8", "CST3"],
"Dendritic_cells": ["FCER1A", "CST3"],
"Platelets": ["PPBP"]
}
We study essentially the most important marker genes detected for every cluster and summarize the highest markers. We visualize gene expression patterns throughout clusters utilizing dot plots and matrix plots for identified immune cell markers. We additionally outline a reference mapping of marker genes related to main immune cell sorts for later annotation.
available_reference = {
celltype: [g for g in genes if g in adata.var_names]
for celltype, genes in cluster_marker_reference.gadgets()
}
available_reference = {okay: v for okay, v in available_reference.gadgets() if len(v) > 0}
for celltype, genes in available_reference.gadgets():
sc.tl.score_genes(adata, gene_list=genes, score_name=f"{celltype}_score", use_raw=False)
score_cols = [f"{ct}_score" for ct in available_reference.keys()]
cluster_scores = adata.obs.groupby("leiden")[score_cols].imply()
show(cluster_scores)
cluster_to_celltype = {}
for cluster in cluster_scores.index:
greatest = cluster_scores.loc[cluster].idxmax().substitute("_score", "")
cluster_to_celltype[cluster] = greatest
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).astype("category")
print("nCluster to cell-type mapping:")
show(pd.DataFrame.from_dict(cluster_to_celltype, orient="index", columns=["assigned_cell_type"]))
sc.pl.umap(
adata,
colour=["leiden", "cell_type"],
legend_loc="on data",
wspace=0.45
)
sc.tl.rank_genes_groups(adata, groupby="cell_type", methodology="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)
celltype_markers = sc.get.rank_genes_groups_df(adata, group=None)
print("nTop markers by annotated cell type:")
show(
celltype_markers.groupby("group").head(8)[["group", "names", "logfoldchanges", "pvals_adj"]]
)
cluster_prop = (
adata.obs["cell_type"]
.value_counts(normalize=True)
.mul(100)
.spherical(2)
.rename("percent")
.to_frame()
)
print("nCell-type proportions (%):")
show(cluster_prop)
plt.determine(figsize=(7, 4))
cluster_prop["percent"].sort_values().plot(variety="barh")
plt.xlabel("Percent of cells")
plt.ylabel("Cell type")
plt.title("Estimated cell-type composition")
plt.tight_layout()
plt.present()
output_dir = "scanpy_pbmc3k_outputs"
os.makedirs(output_dir, exist_ok=True)
adata.write(os.path.be part of(output_dir, "pbmc3k_scanpy_advanced.h5ad"))
marker_table.to_csv(os.path.be part of(output_dir, "cluster_markers.csv"), index=False)
celltype_markers.to_csv(os.path.be part of(output_dir, "celltype_markers.csv"), index=False)
cluster_scores.to_csv(os.path.be part of(output_dir, "cluster_score_matrix.csv"))
print(f"nSaved outputs to: {output_dir}")
print("Files:")
for f in sorted(os.listdir(output_dir)):
print(" -", f)
abstract = {
"n_cells_final": int(adata.n_obs),
"n_genes_final": int(adata.n_vars),
"n_clusters": int(adata.obs["leiden"].nunique()),
"clusters": sorted(adata.obs["leiden"].distinctive().tolist()),
"cell_types": sorted(adata.obs["cell_type"].distinctive().tolist()),
}
print("nAnalysis summary:")
for okay, v in abstract.gadgets():
print(f"{k}: {v}")We rating every cell utilizing identified marker gene units and assign possible cell sorts to clusters primarily based on expression patterns. We visualize the annotated cell sorts on the UMAP embedding and carry out differential gene expression evaluation throughout the anticipated cell populations. Additionally, we compute cell-type proportions, generate abstract visualizations, and save the processed dataset and evaluation outputs for additional analysis.
In conclusion, we developed a full end-to-end workflow for analyzing single-cell transcriptomic information utilizing Scanpy. We carried out preprocessing, clustering, marker-gene evaluation, and cell-type annotation, and visualized the information construction utilizing UMAP and gene expression plots. By saving the processed AnnData object and evaluation outputs, we created a reusable dataset for additional organic interpretation and superior modeling. This workflow demonstrates how Scanpy allows scalable, reproducible single-cell evaluation via a structured, modular Python pipeline.
Try the Full Codes right here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as properly.
The publish A Coding Information to Construct a Full Single Cell RNA Sequencing Evaluation Pipeline Utilizing Scanpy for Clustering Visualization and Cell Kind Annotation appeared first on MarkTechPost.



