Usage
Note
- Please be careful of your conda/mamba setup, if you applied specific constraints/modifications to your system, this could lead to some versions discrepancies.
- mamba is usually preferred but might not be installed by default on a shared cluster environment
- You will need to verify that this conda environment is activated and provide the right snakemake before each execution (
which snakemake
command)
Get Started
Note for 🇪🇺 EMBL users
Use the following command for singularity-args parameter: --singularity-args "-B /g,/scratch"
Run pipeline on example data
Run on example data on only one small chromosome (<disk>
must be replaced by your disk letter/name, /g
or /scratch
at EMBL for example)
snakemake \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/mosaicatcher-pipeline/v7/local/conda_singularity/ \
--singularity-args "-B /<disk>:/<disk>"
Then, generate a report on example data.
snakemake \
--cores 1 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/mosaicatcher-pipeline/v7/local/conda_singularity/ \
--singularity-args "-B /disk:/disk" \
--report report.zip \
--report-stylesheet workflow/report/custom-stylesheet.css
Run on your own data locally
- Run your own analysis locally (
<disk>
must be replaced by your disk letter/name,/g
or/scratch
at EMBL for example)
snakemake \
--cores <N> \
--config \
data_location=<PATH> \
--profile workflow/snakemake_profiles/mosaicatcher-pipeline/v7/local/conda_singularity/ \
--singularity-args "-B /<disk>:/<disk>"
Detailed instructions
MosaiCatcher full execution (with ashleys-qc preprocessing module enabled)
Directory structure
The ashleys-qc-pipeline takes as input Strand-Seq fastq file following the structure below:
Parent_folder
|-- Sample_1
| `-- fastq
| |-- Cell_01.1.fastq.gz
| |-- Cell_01.2.fastq.gz
| |-- Cell_02.1.fastq.gz
| |-- Cell_02.2.fastq.gz
| |-- Cell_03.1.fastq.gz
| |-- Cell_03.2.fastq.gz
| |-- Cell_04.1.fastq.gz
| `-- Cell_04.2.fastq.gz
|
`-- Sample_2
`-- fastq
|-- Cell_21.1.fastq.gz
|-- Cell_21.2.fastq.gz
|-- Cell_22.1.fastq.gz
|-- Cell_22.2.fastq.gz
|-- Cell_23.1.fastq.gz
|-- Cell_23.2.fastq.gz
|-- Cell_24.1.fastq.gz
`-- Cell_24.2.fastq.gz
Thus, in a Parent_Folder
, create a subdirectory Parent_Folder/sampleName/
for each sample
. Each Strand-seq FASTQ files of this sample need to go into the fastq
folder and respect the following syntax: <CELL>.<1|2>.fastq.gz
, 1|2
corresponding to the pair identifier.
Samples naming
We advice to limit to the number of 2 the "_" characters in your file names (ashleys-qc tool will react weirdly with 3 or more "_" in the sample name)
MosaiCatcher execution (without preprocessing)
BAM input requirements
It is important to follow these rules for Strand-Seq single-cell BAM data
- BAM file name ending by suffix:
.sort.mdup.bam
- One BAM file per cell
- Sorted and indexed
- If BAM files are not indexed, please use a writable folder in order that the pipeline generate itself the index
.bai
files - Timestamp of index files must be newer than of the BAM files
- Each BAM file must contain a read group (
@RG
) with a common sample name (SM
), which must match the folder name (sampleName
above)
Filtering of low-quality cells impossible to process
From Mosaicatcher version ≥ 1.6.1, a snakemake checkpoint rule was defined (filter_bad_cells) to automatically remove from the analysis cells flagged as low-quality.
The ground state to define this status (usable/not usable) is determined from column pass1 (enough coverage to process the cell) in mosaic count info file (see example below). The value of pass1 is not static and can change according the window size used (the larger the window, the higher the number of reads to retrieve in a given bin).
Cells flagged as low-quality cells are listed in the following TSV file: <OUTPUT>/cell_selection/<SAMPLE>/labels.tsv
.
Directory organisation - Current behavior
In its current flavour, MosaiCatcher requires that input data must be formatted the following way :
Parent_folder
|-- Sample_1
| `-- bam
| |-- Cell_01.sort.mdup.bam
| |-- Cell_02.sort.mdup.bam
| |-- Cell_03.sort.mdup.bam
| `-- Cell_04.sort.mdup.bam
|
`-- Sample_2
`-- bam
|-- Cell_21.sort.mdup.bam
|-- Cell_22.sort.mdup.bam
|-- Cell_23.sort.mdup.bam
`-- Cell_24.sort.mdup.bam
In a Parent_Folder
, create a subdirectory Parent_Folder/sampleName/
for each sample
. Your Strand-seq BAM files of this sample go into the following folder:
bam
for the total set of BAM files
Using the classic behavior, cells flagged as low-quality will only be determined based on coverage see Note here.
Directory organisation - Old behavior
Previous version of MosaiCatcher (version ≤ 1.5) needed not only a all
directory as described above, but also a selected
folder (now renamed bam
), presenting only high-quality selected libraries wished to be processed for the rest of the analysis.
You can still use this behavior by enabling the config parameter either by the command line: =True
or by modifying the corresponding entry in the config/config.yaml file.
Thus, in a Parent_Folder
, create a subdirectory Parent_Folder/sampleName/
for each sample
. Your Strand-seq BAM files of this sample go into the following folder:
all
for the total set of BAM filesselected
for the subset of BAM files that were identified as high-quality (either by copy or symlink)
Your <INPUT>
directory should look like this:
Parent_folder
|-- Sample_1
| |-- bam
| | |-- Cell_01.sort.mdup.bam
| | |-- Cell_02.sort.mdup.bam
| | |-- Cell_03.sort.mdup.bam
| | `-- Cell_04.sort.mdup.bam
| `-- selected
| |-- Cell_01.sort.mdup.bam
| `-- Cell_04.sort.mdup.bam
|
`-- Sample_2
|-- bam
| |-- Cell_01.sort.mdup.bam
| |-- Cell_02.sort.mdup.bam
| |-- Cell_03.sort.mdup.bam
| `-- Cell_04.sort.mdup.bam
`-- selected
|-- Cell_03.sort.mdup.bam
`-- Cell_04.sort.mdup.bam
Using the
input_bam_legacy
parameter, cells flagged as low-quality will be determined both based on their presence in theselected
folder presented above and on coverage see Note here.
Warning
- Using the
input_bam_legacy
parameter, only intersection between cells present in the selected folder and with enough coverage will be kept. Example: if a library is present in the selected folder but present a low coverage see Note here, this will not be processed. ashleys_pipeline=True
andinput_bam_legacy=True
are mutually exclusive- We advice to limit to the number of 2 the "" characters in your file names (Ashleys-qc ML tool will react weidly with more than 3 "")
The --profile
argument will define whether you want to execute the workflow locally on your laptop/desktop (workflow/snakemake_profiles/mosaicatcher-pipeline/v7/local/conda_singularity//
), or if you want to run it on HPC. A generic slurm profile is available (workflow/snakemake_profiles/mosaicatcher-pipeline/v7/HPC/slurm_generic/
).
Local execution (not HPC or cloud):
snakemake \
--cores <N> \
--config \
data_location=<DATA_FOLDER> \
--profile workflow/snakemake_profiles/mosaicatcher-pipeline/v7/local/conda_singularity/ \
--singularity-args "-B /<disk>:/<disk>"
HPC execution (SLURM)
HPC execution (require first that you modify the workflow/snakemake_profiles/mosaicatcher-pipeline/HPC/slurm_generic/config.yaml for your own cluster)
snakemake \
--config \
data_location=<DATA_FOLDER> \
--profile workflow/snakemake_profiles/mosaicatcher-pipeline/v7/HPC/slurm_generic/ \
--singularity-args "-B /<disk>:/<disk>"
Current strategy to solve HPC job Out-of-Memory (OOM)
Workflow HPC execution usually needs to deal with out of memory (OOM) errors, out of disk space, abnormal paths or missing parameters for the scheduler. To deal with OOM, we are currently using snakemake restart feature that allows to automatically double allocated memory to the job at each attempt (limited to 6 for the moment). Then, if a job fails to run with the default 1GB of memory allocated, it will be automatically restarted tith 2GB at the 2nd attempt, 4GB at the 3rd, etc.
EMBL HPC execution
In the EMBL HPC execution profile, disks binding points already set in the snakemake profile
--profile workflow/snakemake_profiles/mosaicatcher-pipeline/v7/HPC/slurm_EMBL/
Generate report (optional)
Optionally, it's also possible to generate a static interactive HTML report to explore the different plots produced by MosaiCatcher, using the same command as before and just adding at the end the following:
--report <report>.zip --report-stylesheet workflow/report/custom-stylesheet.css
, which correspond to the following complete command:
snakemake \
--config \
data_location=<INPUT_FOLDER> \
--singularity-args "-B /<mounting_point>:/<mounting_point>" \
--profile workflow/snakemake_profiles/slurm_generic/ \
--report <report>.zip --report-stylesheet workflow/report/custom-stylesheet.css
Note
The zip file produced can be heavy (~1GB for 24 HGSVC samples ; 2000 cells) if multiple samples are processed in parallel in the same output folder.
Modes of execution
ArbiGent mode of execution
From 1.9.0, it's now possible to run MosaiCatcher in order to genotype a given list of positions specified in a bed file. To do so, arbigent
config parameter need to be set to True
.
Thus, an alternative branch of the pipeline will be executed instead of the classic one, targetting results produced by ArbiGent.
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
arbigent=True \
--profile workflow/snakemake_profiles/local/conda_singularity/
A generic BED file is provided in workflow/data/arbigent/manual_segmentation.bed
. A custom BED file can be specified by modifying arbigent_bed_file
config parameter to the path of your choice.
Note
If you modify the chromosome list to be processed (remove chrX & chrY for instance), a dedicated rule will extract only from the BED file, the rows matching the list of chromosomes to be analysed.
scNOVA mode of execution
Warning
Following Anaconda licensing issues with EMBL, scNOVA can only be executed using profiles based on singularity where the conda environment were already built.
From 1.9.2, it's now possible to run scNOVA directly from MosaiCatcher in order to determine the nucleosome occupancy associated to the SV calls provided during MosaiCatcher execution.
To do so, mosaicatcher must be executed during a first step in order to generate the SV calls and associated plots/tables required to determine subclonality.
Once the subclonality was determined, a table respecting the following template must be provided at this path <DATA_LOCATION>/<SAMPLE>/scNOVA_input_user/input_subclonality.txt
where the different clones are named clone<N>
:
Filename | Subclonality |
---|---|
TALL3x1_DEA5_PE20406 | clone2 |
TALL3x1_DEA5_PE20414 | clone2 |
TALL3x1_DEA5_PE20415 | clone1 |
TALL3x1_DEA5_PE20416 | clone1 |
TALL3x1_DEA5_PE20417 | clone1 |
TALL3x1_DEA5_PE20418 | clone1 |
TALL3x1_DEA5_PE20419 | clone1 |
TALL3x1_DEA5_PE20421 | clone1 |
TALL3x1_DEA5_PE20422 | clone1 |
Once done, you can run the exact same command as previously and set scNOVA
config parameter to True
like the following:
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
scNOVA=True \
--profile workflow/snakemake_profiles/local/conda_singularity/
Multistep normalisation
From 2.1.0, it's now possible to use multistep normalisation (library size normalisation, GC correction, Variance Stabilising Transformation) resulting counts file, not only for visualisation purpose, but also to rely on it during the SV calling framework. To do so
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
multistep_normalisation=True \
multistep_normalisation_for_SV_calling=True \
--profile workflow/snakemake_profiles/local/conda_singularity/
scTRIP multiplot
From 2.2.2, scTRIP multiplot (from Marco Cosenza) is now compatible with MosaiCatcher. The single requirement is to clone scTRIP multiplot repository (please reach out Marco if you want to access the repository, currently private) inside workflow/scripts/plotting/scTRIP_multiplot
.
By default, scTRIP multiplot is set to false, to enable it, please update the config/config.yaml
file or use scTRIP_multiplot=True
in the config section of the command line. An example of a scTRIP multiplot is available here
Warning
multistep_normalisation_for_SV_calling
is mutually exclusive with hgsvc_based_normalisation
IGV tracks upload
To upload IGV tracks generated by MosaiCatcher to the genome browser, you can follow the aforementionned steps:
- Start IGV
- Go to
File
>Open Session
- Navigate to the mosaicatcher output folder and then in
plots/IGV
- Select the
_IGV_session.xml file
The reference genome set during Mosaicatcher execution should automatically be loaded in IGV, as well as Watson/Crick and SV tracks for each cell.