🧬 Reference Genomes
MosaiCatcher supports multiple reference genome assemblies. Each assembly requires specific normalization files, blacklist regions, and a matching BSgenome R package for StrandPhaseR.
Supported Assemblies
| Assembly | Organism | Status | Container tag |
|---|---|---|---|
| hg38 (GRCh38) | Human | Fully supported | hg38-v2.4.0 |
| hg19 (GRCh37) | Human | Fully supported | hg19-v2.4.0 |
| T2T (CHM13v2) | Human | Fully supported | T2T-v2.4.0 |
| mm10 (GRCm38) | Mouse | Fully supported | mm10-v2.4.0 |
| mm39 (GRCm39) | Mouse | Fully supported (see caveats) | mm39-v2.4.0 |
| canfam3 | Dog | Framework-ready (normalization files pending) | — |
| canfam4 | Dog | Framework-ready (normalization files pending) | — |
Selecting a Reference
Set the reference parameter in your config or via --config:
snakemake --cores <N> --config data_location=<PATH> reference=mm39 --sdm conda
For containerized runs, the container image is automatically selected to match the reference value:
snakemake --cores <N> \
--config data_location=<PATH> reference=hg19 \
--sdm conda apptainer \
--apptainer-args "-B /g:/g"
# Uses ghcr.io/friendsofstrandseq/mosaicatcher-pipeline:hg19-v2.4.0
Assembly-Specific Notes
mm39 (GRCm39) Caveats
mm39 is fully supported but has two known limitations to be aware of:
Missing 50 kb / 100 kb normalization files
Normalization files for 50 kb and 100 kb bin sizes are not yet available for mm39. Use window: 200000 (200 kb bins) when analyzing mm39 data:
# config/config.yaml
reference: mm39
window: 200000
HGSVC blacklist — chrX validation needed for XX samples
The blacklist regions file for mm39 was generated using an XO (male) control cell line. As a result, chrX coverage normalization may be biased for XX (female) samples. We recommend:
- Excluding chrX from analysis for XX samples:
chromosomes_to_exclude: [chrX] - Or interpreting chrX SV calls from XX mm39 samples with extra caution
mm39 chrX results
For XX (female) mouse samples on mm39: chrX blacklist regions were derived from a male (XO) control. SV calls on chrX should be validated independently.
canfam3 / canfam4 (Dog)
The reference genome infrastructure (genome download, chromosome lists) is in place for canfam3 and canfam4. However, normalization files and blacklist regions are not yet generated. These assemblies are framework-ready and will be fully supported in a future release.
Pre-built BWA Indexes
To skip building BWA indexes locally, enable the download_prebuilt_indexes parameter:
download_prebuilt_indexes: True
This downloads pre-built indexes from AWS iGenomes. Useful on HPC systems where index building is time-consuming or storage-constrained.
Shared Reference Directory (HPC)
On HPC systems with multiple users, configure reference_base_dir to point to a shared location:
reference_base_dir: /g/korbel/shared/genomes
All users sharing this directory avoid redundant downloads and index building.