Reproduce publication results

Publication full example

If you wish to reproduce results generated in the MosaiCatcher v2 publication, you can follow these different steps:

Download and prepare data

As the complete data was above 50GB, data was stored into 2 separated zenodo repositories: https://zenodo.org/record/7696695 & https://zenodo.org/record/7697329. You can then first create a directory in a location with at least 300GB of free space, download the 3 tar.gz and extract them:

# Specify your location
YOUR_LOCATION=DEFINE_YOUR_PATH_HERE
# Create directory
mkdir -p $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
# Download tar.gz parts
wget https://zenodo.org/record/7696695/files/MOSAICATCHER_DATA_PAPER.tar.gz.part1?download=1 -P $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
wget https://zenodo.org/record/7697329/files/MOSAICATCHER_DATA_PAPER.tar.gz.part2?download=1 -P $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
wget https://zenodo.org/record/7697329/files/MOSAICATCHER_DATA_PAPER.tar.gz.part3?download=1 -P $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
# Extract 
for f in *part*; do tar -xvf "$f" -C $YOUR_LOCATION/mosaicatcher_publication_data; done

You can then verify checksum to make sure files were downloaded properly.

# Verify checksum
for f in C7 RPE-BM510 RPE1-WT LCL; do
    cd $YOUR_LOCATION/"$f"/fastq && md5sum -c *.md5 && cd $YOUR_LOCATION
done

If all the checks were successfull, you can now remove this directory before running the pipeline.

rm -r $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED

Run the pipeline

A generic slurm profile is provided in the the following directory: mosaicatcher-pipeline/workflow/snakemake_profiles/HPC/slurm_generic. Feel free to edit, create another profile specify to your cluster institute.

snakemake \
    --config \
    data_location=$YOUR_LOCATION/mosaicatcher_publication_data \
    reference=T2T \
    split_qc_plot=True \
    ashleys_pipeline=True \
    multistep_normalisation=True \
    multistep_normalisation_for_SV_calling=False hgsvc_based_normalized_counts=True \
    --profile workflow/snakemake_profiles/HPC/slurm_generic/

Generate the report

snakemake \
    --config \
    data_location=$YOUR_LOCATION/mosaicatcher_publication_data \
    reference=T2T \
    split_qc_plot=True \
    ashleys_pipeline=True \
    multistep_normalisation=True \
    multistep_normalisation_for_SV_calling=False hgsvc_based_normalized_counts=True \
    --profile workflow/snakemake_profiles/HPC/slurm_generic/
    --report $YOUR_LOCATION/MosaiCatcher_V2_publication_data.report.zip --report-stylesheet workflow/report/custom-stylesheet.css

You can then extract the zip archive and open the report.html file to access the HTML report.