⚙️ Parameters

MosaiCatcher arguments

How to pass configuration arguments?

All these arguments can be specified in two ways:

In the config/config.yaml file, by replacing existing values
Using the --config snakemake argument (--config must be called only one time with all the arguments behind it, e.g: --config input_bam_location=<INPUT> output_location=<OUTPUT> email=<EMAIL>)

General parameters

Parameter	Comment	Default	Example
`email`	Email address for completion summary	None	None

Data location & Input/output options

Parameter	Comment	Parameter type	Default
`data_location`	Path to parent folder containing samples	String	.tests/data_CHR17/
`samples_to_process`	If multiple plates in the data_location parent folder, specify one or a comma-sep list of samples	None	"[SampleA,SampleB]"
`publishdir`	Path to backup location where important data is copied (deprecated — will be removed in a future version)	String

Ashleys-QC upstream pipeline

Parameter	Comment	Parameter type	Default
`input_bam_legacy`	Mutualy exclusive with ashleys_pipeline. Will use `selected` folder to identify high-quality libraries to process	Boolean	False
`ashleys_pipeline`	Allow to load and use ashleys-qc-pipeline snakemake preprocessing module and to start from FASTQ inputs	Boolean	False
`ashleys_threshold`	Threshold for Ashleys-qc binary classification	Float	0.5
`bypass_ashleys`	Set all cells as high-quality (labels to 1)	False
`MultiQC`	Enable or disable MultiQC analysis (includes FastQC, samtools flagstats & idxstats)	Boolean	False
`hand_selection`	Enable or disable hand selection through the Jupyter Notebook	Boolean	False
`keep_ashleys_predictions`	Retain ashleys ML prediction files after pipeline completion	Boolean	True
`split_qc_plot`	Enable or disable the split of QC plot into individual pages plots	Boolean	False
`paired_end`	Enable or disable the use of paired-end data	Boolean	False

Reference data & Chromosomes

Parameter	Comment	Default (options)
`reference`	Reference genome	hg38 (hg19, T2T, mm10, mm39)
`reference_base_dir`	Base directory for reference files (allows multi-user sharing on HPC)	/g/korbel/shared/genomes
`chromosomes`	List of chromosomes to be processed	Human: chr[1..22,X,Y], Mouse: chr[1..20,X,Y]
`chromosomes_to_exclude`	List of chromosomes to exclude	[]

Reference genome support

Fully supported: hg38, hg19, T2T, mm10, mm39
Framework-ready (normalization files pending): canfam3, canfam4
See Reference Genomes for assembly-specific notes and caveats.

BWA Index options

Parameter	Comment	Default
`download_prebuilt_indexes`	Download pre-built BWA indexes from AWS iGenomes instead of building them locally	False

Counts configuration

Parameter	Comment	Default
`window`	Window size used for binning by mosaic count (Can be of high importance regarding library coverage)	100000
`blacklist_regions`	Enable/Disable blacklisting	True

Optional modules

Parameter	Comment	Default
`multistep_normalisation`	Allow to perform multistep normalisation including GC correction for visualization (Marco Cosenza).	False
`ploidy`	Enable ploidy estimation module to detect haploid chromosomes / segments	True
`breakpointR`	Enable breakpointR module to compute breakpoints on Strand-Seq data (David Porubsky).	False

Targeted execution

Parameter	Comment	Default
`ashleys_pipeline_only`	Stop the execution after ashleys-qc-pipeline submodule. Requires `ashleys_pipeline` to be True	Boolean
`breakpointR_only`	Stop the execution after breakpointR submodule. Requires `breakpointR` to be True	False
`whatshap_only`	Stop the execution after whatshap submodule (haplotagging of BAM files).	False

SV calling parameters

Parameter	Comment	Default
`multistep_normalisation_for_SV_calling`	Allow to use multistep normalisation count file during SV calling (Marco Cosenza).	False
`hgsvc_based_normalized_counts`	Use HGSVC based normalisation .	True

SV calling algorithm processing options

Parameter	Comment	Default
`min_diff_jointseg`	Minimum difference in error term to include another breakpoint in the joint segmentation (default=0.5)	0.1
`min_diff_singleseg`	Minimum difference in error term to include another breakpoint in the single-cell segmentation (default=1)	0.5
`additional_sce_cutoff`	Minimum gain in mismatch distance needed to add an additional SCE	20000000
`sce_min_distance`	Minimum distance of an SCE to a break in the joint segmentation	500000
`llr`	Likelihood ratio used to detect SV calls	4

Downstream analysis

Parameter	Comment	Default
`arbigent`	Enable ArbiGent mode of execution to genotype SV based on arbitrary segments	False
`arbigent_bed_file`	Allow to specify custom ArbiGent BED file	""
`scNOVA`	Enable scNOVA mode of execution to compute Nucleosome Occupancy (NO) of detected SV	False
`scTRIP_multiplot`	Enable scTRIP multiplot generation for all chromosomes of all cells	False

EMBL specific options

Parameter	Comment	Default
`genecore`	Enable/disable genecore mode to give as input the genecore shared folder in /g/korbel/shared/genecore	False
`genecore_date_folder`	Specify folder to be processed
`genecore_prefix`	Specify genecore prefix folder	/g/korbel/STOCKS/Data/Assay/sequencing/2023
`genecore_regex_elements`	Specify genecore regex element to be used to distinguish sample from well number	PE20

If genecore and genecore_date_folder are correctly specified, each plate will be processed independently by creating a specific folder in the data_location folder.

Execution profile

Location: workflow/snakemake_profiles/

Parameter	Comment	Conda	Singularity	HPC	Local
local/conda	/	X			X
local/conda_singularity	/	X	X		X
HPC/slurm_generic (to modify)	/	X		X
HPC/slurm_EMBL (optimised for EMBL HPC)	/	X		X