Overview

🧹 Ashleys-QC Pipeline

This workflow uses Snakemake perform Quality Control analysis on Strand-Seq single-cell sequencing data. The starting point are single-cell FASTQ files from Strand-seq experiments and the final output produced is a folder with clean selected BAM files. The pipeline can identify automatically high-quality libraries through ML-based analysis tool ashleys-qc. Thus, the workflow goes through the following steps:

FASTQ sequencing Quality Control through FastQC
Mapping FASTQ against a reference genome throught BWA
Sorting, Deduplicating and Indexing of BAM files through Samtools & sambaba
Generating features and use ashleys-qc model to identify high-quality cells

🧩 MosaiCatcher

The starting point in MosaiCatcher are single-cell BAM files from Strand-seq experiments and the final output are SV predictions in a tabular format as well as in a graphical representation. To get to this point, the workflow goes through the following steps:

Binning of sequencing reads in genomic windows of 200kb via mosaic
Strand state detection
[Optional] Normalization of coverage with respect to a reference sample
Multi-variate segmentation of cells (mosaic)
Haplotype resolution via StrandPhaseR
Bayesian classification of segmentation to find SVs using MosaiClassifier
Visualization of results using custom R plots

Figure 1: MosaiCatcher v2 schematic representation and visualizations examples.

MosaiCatcher v2 schematic representation and visualizations examples. (A) MosaiCatcher v2 pipeline schematic representation: On the left part in dimmed orange is represented ashleys-qc-pipeline, a switchable preprocessing optional module that allows to perform standard steps of mapping, sorting, and indexing FASTQ libraries, producing quality control plots and reports as well as identifying high-quality libraries. On the right uncolored part, the MosaiCatcher core part of the pipeline is still usable as a standalone by providing Stand-Seq aligned BAM files. Green boxes correspond to data-conditional dependent execution steps (Snakemake checkpoints) that allow more flexibility and reduce issues when executing the workflow. Orange box corresponds to the multi-step normalization module. Blue box corresponds to ArbiGent mode of execution that allows SV genotyping from arbitrary segmentation. Violet box corresponds to scNOVA SV function analysis mode of execution. Dashed boxes correspond to optional modules. *: Strand-seq QC plot is only produced here if ashleys-qc pipeline is not enabled (B) quality control Strand-seq karyotype visualization based on read counting according to a defined window (here 200 kb). A karyotype plot is available for each single-cell sequenced. Additional statistics are also presented in the upper part of the figure. These plots allow the users to perform QC to assess sequencing integrity. (C) SV call clustering heatmap and chromosome-wise visualizations. The clustering heatmap on the left generates a global SV distribution representation of the sample. Each SV call is represented as a column and each library as a row. Top row annotation corresponds to chromosomes and individual heatmap cell color corresponds to a specific type of haplotype-phased SV defined in the legend. The chromosome-wise representation on the right side of the subfigure summarizes, for a given chromosome (here chr10), the different layers of information computed during the pipeline: binned read counts, joint and cell segmentation, and phased SV calls. (D) Differential nucleosome occupancy heatmap representation: this heatmap computed with the scNOVA downstream module exemplifies gene-body-based differential nucleosome occupancy profiles (Jeong et al. 2023).

📆 Roadmap

Small issues to fix

replace input_bam_location by data_location (harmonization with ashleys-qc-pipeline)
List of commands available through list_commands parameter (1.8.6
Move pysam / SM tag comparison script to snakemake rule (2.2.0)
Reference properly reference genome in IGV session script generation (2.3.5

🛑 Troubleshooting & Current limitations

Do not change the structure of your input folder after running the pipeline, first execution will build a config dataframe file (OUTPUT_DIRECTORY/config/config.tsv) that contains the list of cells and the associated paths
Do not change the list of chromosomes after a first execution (i.e: first execution on chr17, second execution on all chromosomes)

💂‍♂️ Authors (alphabetical order)

Ashraf Hufash
Cosenza Marco
Ebert Peter
Ghareghani Maryam
Grimes Karen
Gros Christina
Höps Wolfram
Jeong Hyobin
Kinanen Venla
Korbel Jan
Marschall Tobias
Meiers Sasha
Porubsky David
Rausch Tobias
Sanders Ashley
Van Vliet Alex
Weber Thomas (maintainer and current developer)

Citing MosaiCatcher

When using MosaiCatcher for a publication, please cite the following article in your paper:

MosaiCatcher v2 publication: Weber Thomas, Marco Raffaele Cosenza, and Jan Korbel. 2023. ‘MosaiCatcher v2: A Single-Cell Structural Variations Detection and Analysis Reference Framework Based on Strand-Seq’. Bioinformatics 39 (11): btad633. https://doi.org/10.1093/bioinformatics/btad633.

📕 References

MosaiCatcher v2 publication: Weber Thomas, Marco Raffaele Cosenza, and Jan Korbel. 2023. ‘MosaiCatcher v2: A Single-Cell Structural Variations Detection and Analysis Reference Framework Based on Strand-Seq’. Bioinformatics 39 (11): btad633. https://doi.org/10.1093/bioinformatics/btad633

Strand-seq publication: Falconer, E., Hills, M., Naumann, U. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat Methods 9, 1107–1112 (2012). https://doi.org/10.1038/nmeth.2206

scTRIP/MosaiCatcher original publication: Sanders, A.D., Meiers, S., Ghareghani, M. et al. Single-cell analysis of structural variations and complex rearrangements with tri-channel processing. Nat Biotechnol 38, 343–354 (2020). https://doi.org/10.1038/s41587-019-0366-x

ArbiGent publication: Porubsky, David, Wolfram Höps, Hufsah Ashraf, PingHsun Hsieh, Bernardo Rodriguez-Martin, Feyza Yilmaz, Jana Ebler, et al. 2022. “Recurrent Inversion Polymorphisms in Humans Associate with Genetic Instability and Genomic Disorders.” Cell 185 (11): 1986-2005.e26. https://doi.org/10.1016/j.cell.2022.04.017

scNOVA publication: Jeong, Hyobin, Karen Grimes, Kerstin K. Rauwolf, Peter-Martin Bruch, Tobias Rausch, Patrick Hasenfeld, Eva Benito, et al. 2022. “Functional Analysis of Structural Variants in Single Cells Using Strand-Seq.” Nature Biotechnology, November, 1–13. https://doi.org/10.1038/s41587-022-01551-4