9. Commands for Hi-C and Micro-C processing

CustardPy internally executes Juicer and juicertools. See the original website for the full description of each command.

9.1. Hi-C analysis

9.1.1. custardpy_juicer

custardpy_juicer is an end-to-end pipeline for Juicer analysis. It executes juicer_map.sh, juicer_pigz.sh, plot_distance_count.sh, juicer_callTAD.sh, call_HiCCUPS.sh, makeMatrix_intra.sh, makeEigen.sh, and makeInslationScore.sh. The results are stored in CustardPyResults_Hi-C/Juicer_$build/$prefix/.

custardpy_juicer [Options] <fastqdir> <label>
  <fastqdir>: directory that contains input fastq files (e.g., "fastq/sample1")
  <label>: label of the input (e.g., "sample1")
  Options:
      -i index : bwa index
      -g genometable : genome table file (describing the chromosome length)
      -e enzyme : enzyme (HindIII|MboI|DpnII|Sau3AI|Arima, default: HindIII)
      -b build : genome build (hg19|hg38|mm10|mm39|rn7|galGal5|galGal6|ce10|ce11|danRer11|dm6|xenLae2|sacCer3)
      -z [_|_R]: if the filename of fastqs is *_[1|2].fastq, supply "_". if *_[R1|R2].fastq, choose "_R". (default: "_")
      -o outputdir : output directory (default: 'CustardPyResults_Hi-C')
      -n [NONE|VC|VC_SQRT|KR|SCALE] : normalization type (default: SCALE)
      -a <refFlat>: gene annotation file
      -r resolutions : resolutions for 1D metrics calculation (default: "25000 50000 100000", should be quoted and separated by spaces)
      -p ncore: number of CPUs (default: 32)
      -m tmpdir: tempdir
      -L: Allocate larger memory ("-Xms1024m -Xmx655360m", default: "-Xms512m -Xmx65536m", for deep-sequenced samples; e.g., Rao 2014)
  Example:
      custardpy_juicer -i bwaindex/hg38 -g genometable.hg38.txt -b hg38 -e HindIII -z _R -a refFlat.hg38.txt fastq/Hap1-A Hap1-A

9.1.2. juicer_map.sh

juicer_map.sh map FASTQ files using BWA and generates .hic file using juicertools. The input FASTQ files can be gzipped (.fastq.gz). The results including the .hic file is outputted in $odir. The BWA index of the reference genome is necessary.

juicer_map.sh [options] <fastqdir> <odir> <build> <gt> <bwaindex> <enzyme> <fastq_post>
  <fastqdir>: directory that contains input fastq files (e.g., "fastq/sample1")
  <odir>: output directory (e.g., "JuicerResults/sample1")
  <build>: genome build (e.g., hg38)
  <gt>: genome table
  <bwaindex>: index file of BWA
  <enzyme>: enzyme (e.g., HindIII, MboI)
  <fastq_post [_|_R]>: if the filename of fastqs is *_[1|2].fastq, supply "_". if *_[R1|R2].fastq, choose "_R".

Options:
    -p ncore: number of CPUs (default: 32)
    -m tmpdir: tempdir
    -L: Allocate larger memory ("-Xms1024m -Xmx655360m", default: "-Xms512m -Xmx65536m", for deep-sequenced samples; e.g., Rao 2014)
Example:
  juicer_map.sh $(pwd)/fastq/Hap1-A/ $(pwd)/JuicerResults/Hap1-A hg38 genometable.hg38.txt bwaindex/hg38 HindIII _R

Note

The input FASTQ files of each sample should be stored in the separated directory. For example, if there are three Hi-C samples (sample1, sample2, and sample3), the fastq files should be in fastq/sample1/, fastq/sample2/, and fastq/sample3/.

  • Output

    • merged_nodups.txt … mapped fragments

    • inter.hic … .hic file (without quality filtering)

    • inter.txt … stats for inter.hic

    • inter_30.hic … .hic file (Q>=30)

    • inter_30.txt … stats for inter_30.hic

We recommend using inter_30.hic for the downstream analysis.

9.1.3. juicer_pigz.sh

Since the output files of Juicer are quite large, CustardPy provide a script juicer_pigz.sh that compresses the intermediate files. This command is optional while custardpy_juicer implements it.

juicer_pigz.sh <odir>
  <odir> output directory of juicer_map.sh (e.g., "JuicerResults/sample1")

Note that some commands provided in Juicer use the intermediate files (e.g, mega.sh). Because these commands do not accept the compressed format, use juicer_unpigz.sh that uncompresses the compressed files.

juicer_unpigz.sh <odir>
  <odir> output directory of juicer_map.sh (e.g., "JuicerResults/sample1")

9.1.4. plot_distance_count.sh

plot_distance_count.sh calcultes the fragment distance and generates a figure (.pdf). The result is outputted in distance/ directory.

plot_distance_count.sh <label> <odir>
  <label>: title of the figure
  <odir> output directory of juicer_map.sh (e.g., "JuicerResults/sample1")
  • Output

    • distance_vs_count.10kb.MAPQ30.pdf … figure of distance plot

    • distance_vs_count.10kb.MAPQ30.txt … values for the plot

    • distance_vs_count.10kb.MAPQ30.log.pdf … figure of distance plot (log scale)

    • distance_vs_count.10kb.MAPQ30.log.txt … values for the plot (log scale)

Alternate

9.1.5. custardpy_cooler_HiC

custardpy_cooler_HiC uses Cooler to generates .cool and .hic files from FASTQ files. The input FASTQ files can be gzipped (.fastq.gz).

BWA and chromap can be used for mapping reads (use -t option). The results are stored in CustardPyResults_Hi-C/Cooler_$build/$prefix/.

The index file of BWA or chromap (-i <index>) and the fasta file of the reference genome (-f <genome>) are required.

custardpy_cooler_HiC [options] -i <index> -g <gt> -f <genome> <fastqdir> <odir>
  fastqdir: Directory that contains input fastq files (e.g., "fastq/")
  odir: Name of output directory

  Options:
    -S stage : steps to be executed [all|pairs|postproc] (default: all)
      all: execute all process (default)
      map: map reads and exit
      pairs: generate .pair file from map file
      postproc: generate .cool and .hic from .pair file
    -i index : bwa index
    -g genometable : genome table file (describing the chromosome length)
    -f genome file : fasta file of the reference genome (original data of the index files)
    -e enzyme : enzyme (HindIII|MboI|DpnII|Sau3AI, default: HindIII)
    -b build : genome build (hg19|hg38|mm10|mm39|rn7|galGal6|ce11|danRer11|dm6|xenLae2|sacCer3|S.pombe|HVAEP)
    -o outputdir : output directory (default: 'CustardPyResults_Hi-C')
    -q qvalue : threshould of mapped fragments (default: 30, for '--min-mapq' of pairtools parse)
    -p ncore : number of CPUs (default: 4)
    -x postfix
      1: '*_1.fastq.gz' and '*_2.fastq.gz' (default)
      2: '*_R1.fastq.gz' and '*_R2.fastq.gz'
    -m max_distance : 8.4 for human, 8.2 for mouse (for pairsqc.py, default: 8.4)
    -n binsize_min : binsize_min (for cooler cload pairix, default: 5000)
    -r binsize_multi : binsize_multi (for multirescool, default: '5000,10000,25000,50000,100000,500000,1000000,2500000,5000000,10000000')
  • Output

    • cool/ … directory for .cool files

    • hic/ … directory for the .hic file

    • log/ … log files of mapping

    • mapfile/ … directory of mapping file

    • pairs/ … .pairs file generated by pairtools

    • qc_report/ … directory for statistics and QC files

9.2. Micro-C analysis

9.2.1. custardpy_cooler_MicroC

custardpy_cooler_MicroC generates .cool and .hic files from FASTQ files using cooltools and JuicerTools. The input FASTQ files can be gzipped (.fastq.gz).

BWA and chromap can be used for mapping reads (use -t option). The results are stored in CustardPyResults_MicroC/Cooler_<bwa|chromap>/$prefix.

The index file of BWA or chromap (-i <index>) and the fasta file of the reference genome (-f <genome>) are required.

custardpy_cooler_MicroC [options] -i <index> -g <gt> -f <genome> <fastqdir> <odir>
  fastqdir: Directory that contains input fastq files (e.g., "fastq/")
  odir: Name of output directory

  Options:
    -S stage : steps to be executed [all|pairs|postproc] (default: all)
      all: execute all process (default)
      map: map reads and exit
      pairs: generate .pair file from map file
      postproc: generate .cool and .hic from .pair file
    -t [bwa|chromap] : tool for mapping (default: bwa)
    -i index : index of bwa or chromap
    -f genome file : fasta file of the reference genome (original data of the index files)
    -g genometable : genome table file (describing the chromosome length)
    -o outputdir : output directory (default: 'CustardPyResults_MicroC')
    -q qvalue : threshould of mapped fragments (default: 30, for '--min-mapq' of pairtools parse)
    -p ncore : number of CPUs (default: 4)
    -x postfix
      1: '*_1.fastq.gz' and '*_2.fastq.gz' (default)
      2: '*_R1.fastq.gz' and '*_R2.fastq.gz'
    -m max_distance : 8.4 for human, 8.2 for mouse (for pairsqc.py, default: 8.4)
    -n binsize_min : binsize_min (for cooler cload pairix, default: 5000)
    -r binsize_multi : binsize_multi (for multirescool, default: '5000,10000,25000,50000,100000,500000,1000000,2500000,5000000,10000000')
  • Output

    • cool/ … directory for .cool files

    • hic/ … directory for the .hic file

    • log/ … log files of mapping

    • mapfile/ … directory of mapping file

    • pairs/ … .pairs file generated by pairtools

    • qc_report/ … directory for statistics and QC files

9.3. Common commands for Hi-C and Micro-C

9.3.1. custardpy_process_hic

custardpy_process_hic takes a .hic file as input and executes juicer_callTAD.sh, call_HiCCUPS.sh, makeMatrix_intra.sh, makeEigen.sh, and makeInslationScore.sh.

custardpy_process_hic [Options] <hicfile> <odir>
  <hicfile>: .hic file genreated by Juicer
  <odir> : output directory
  Options:
      -g genometable : genome table file (describing the chromosome length)
      -n [NONE|VC|VC_SQRT|KR|SCALE] : normalization type (default: SCALE)
      -a <refFlat>: gene annotation file
      -r resolutions : resolutions for 1D metrics calculation (default: "25000 50000 100000", should be quoted and separated by spaces)
      -p ncore: number of CPUs (default: 32)
      -o: Use older version of juicer_tools.jar for old .hic files (juicer_tools.1.9.9_jcuda.0.8.jar, default: juicer_tools.1.22.01.jar)
  Example:
      custardpy_process_hic -g genometable.hg38.txt -a refFlat.hg38.txt Hap1-A/inter_30.hic Hap1-A

Note

custardpy_process_hic fails when .hic files created by the old version of juice_tools.jar are applied because they do not contain the string “chr” in the chromosome name. In such a case, supply -o option in custardpy_process_hic, which uses the old version of juicer_tools.jar (juicer_tools.1.9.9_jcuda.0.8.jar).

9.3.2. makeMatrix_intra.sh

makeMatrix_intra.sh takes a .hic file as input and generates the matrices of intra-chromosomal interactions for all chromsomes. The chormosome Y and M are omited.

makeMatrix_intra.sh <norm> <odir> <hic> <resolution> <gt>
  <norm>: normalization type (NONE|VC|VC_SQRT|KR|SCALE)
  <odir>: output directory (e.g., "JuicerResults/sample1")
  <hic>: .hic file
  <resolution>: resolution of the matrix
  <gt>: genome table
  Options:
    -l: output contact matrix as a list (default: dense matrix)
    -o: Use older version of juicer_tools.jar for old .hic files (juicer_tools.1.9.9_jcuda.0.8.jar, default: juicer_tools.1.22.01.jar)

The resulting observed/oe matrices are output in <odir>/Matrix/intrachromosomal/<resolution>/.

9.3.3. makeMatrix_inter.sh

makeMatrix_inter.sh generates the inter-chromosomal interactions matrix for a specified chromsome pair.

makeMatrix_inter.sh [-l] <norm> <odir> <hic> <resolution> <chr1> <chr2>
   <norm>: normalization type (NONE|VC|VC_SQRT|KR|SCALE)
   <odir>: output directory (e.g., "JuicerResults/sample1")
   <hic>: .hic file
   <resolution>: resolution of the matrix
   <chr1, chr2>: two input chromosomes
   Options:
     -l: output contact matrix as a list (default: dense matrix)
     -o: Use older version of juicer_tools.jar for old .hic files (juicer_tools.1.9.9_jcuda.0.8.jar, default: juicer_tools.1.22.01.jar)

The resulting observed/oe matrices are output in <odir>/Matrix/interchromosomal/<resolution>/<chr1>-<chr2>.

9.3.4. makeEigen.sh

makeEigen.sh generates eigenvector file (compartment PC1) from a .hic file using HiC1Dmetrics. The sign (+-) of the value indicating A/B compartments is adjusted by the number of genes.

makeEigen.sh [options] <norm> <odir> <hic> <resolution> <genometable> <refFlat>
  <norm>: normalization type (NONE|VC|VC_SQRT|KR|SCALE)
  <odir>: output directory (e.g., "JuicerResults/sample1")
  <hic>: .hic file
  <resolution>: resolution of matrix
  <genometable>: genometable file
  <refFlat>: gene annotation file (refFlat format)
  Options:
    -p <int>: the number of CPUs (default: 6)

9.3.5. juicer_callTAD.sh

juicer_callTAD.sh calls TADs from a .hic file using Juicer ArrowHead.

juicer_callTAD.sh [options] <norm> <odir> <hic> <gt>
   <norm>: normalization type (NONE|VC|VC_SQRT|KR|SCALE)
   <odir>: output directory (e.g., "JuicerResults/sample1")
   <hic>: .hic file
   <gt>: genome table
   Options:
     -r resolutions: the resolutions for ArrowHead (default: "10000 25000 50000", should be quoted and separated by spaces)
     -p ncore: number of CPUs (default: 24)
     -o: Use older version of juicer_tools.jar for old .hic files (juicer_tools.1.9.9_jcuda.0.8.jar, default: juicer_tools.1.22.01.jar)
  • Output:
    • \*_blocks.bedpe … TAD regions (BEDPE format, default output of Juicer ArrowHead)

    • \*_blocks.bed … TAD regions (BED format file converted from \*_blocks.bedpe)

    • \*_blocks.merged.bed … Non-overlapped TAD list (overlapped TAD are merged by bedtools merge)

    • \*_blocks.boundaries.bed … TAD boundaries (“inside” window of called TADs, including boundaries of nested TADs)

    • \*_blocks.TADcoverage.bed … Number of TADs that cover the genomic positions (for nested TAD analysis)

    • \*_blocks.TADregions.bed … List of intra-TAD regions (inside of TAD boundaries)

    • \*_blocks.nonTADregions.bed … List of regions that are not covered by any TADs

Note

Because Juicer ArrowHead allows “nested TADs” and “non-TAD regions”, not all genomic regions are included in TADs, and some amount of TAD boundaries may be included in a larger TADs. Make sure that the files you are using meet the criteria of your assumption.

9.3.6. makeInslationScore.sh

makeInslationScore.sh takes the observed matrices files generated by makeMatrix_intra.sh as input and calculates the insulation score for all chromsomes. The chormosome Y and M are omited.

The <odir> directory should be the same with that is specified in makeMatrix_intra.sh.

makeInslationScore.sh <norm> <odir> <resolution> <gt>
  <norm>: normalization type (NONE|VC|VC_SQRT|KR|SCALE)
  <odir>: output directory (e.g., "JuicerResults/sample1")
  <resolution>: resolution of the matrix
  <gt>: genome table

The results are output in <odir>/InsulationScore/<norm>/<resolution>/.

9.3.7. call_HiCCUPS.sh (GPU required)

call_HiCCUPS.sh calls loops using Juicer HiCCUPS. Supply --gpus all for Docker and --nv option for Singularity to activate GPU as follows:

singularity exec --nv custardpy_juicer.sif call_HiCCUPS.sh
docker run --rm -it --gpus all rnakato/custardpy call_HiCCUPS.sh
call_HiCCUPS.sh <norm> <odir> <hic>
  <norm>: normalization type (NONE|VC|VC_SQRT|KR|SCALE)
  <odir>: output directory (e.g., "JuicerResults/sample1")
  <hic>: .hic file
  Options:
     -r resolutions: the resolutions (default: "5000,10000,25000", should be quoted and separated by comma)
     -o: Use older version of juicer_tools.jar for old .hic files (juicer_tools.1.9.9_jcuda.0.8.jar, default: juicer_tools.1.22.01.jar)
  • Output

    • merged_loops.simple.bedpe … loop file

9.3.8. call_MotifFinder.sh

If you have peak files of cohesin and CTCF, you can use MotifFinder by call_MotifFinder.sh:

call_MotifFinder.sh <build> <motifdir> <loop>
  <build>: genome build
  <motifdir>: the directory that contains the BED files
  <loop>: loop file (.bedpe) obtained by HiCCUPS

If the <build> is (hg19|hg38|mm9|mm10), this command automatically supplies FIMO motifs provided by Juicer.

  • Output

    • merged_loops_with_motifs.bedpe

See MotifFinder manual for more information.

Note

Because an error occurs in the latest version of juicertools, CustardPy uses juicertools version 1.9.9 for MotifFinder.

9.3.9. calculate_compartment_strength

calculate_compartment_strength calculates the compartment strength from Hi-C data using GENOVA.

calculate_compartment_strength <coolfile> <sample name>
  coolfile: Input Hi-C data (.cool format)
  sample name: Name of the sample (also used for the output file name)

<Example>
  calculate_compartment_strength Cooler_results/Control/coolfile/Control.25000.cool Control

The output file is [sample_name].GENOVA_compartment_score.txt containing the compartment strength, which is an average score for the chromosomes.

9.3.10. run_3DChromatin_ReplicateQC.sh

Since it is written in Python2.7, we use a virtual environment in the CustardPy docker image. run_3DChromatin_ReplicateQC.sh`` is a script to run it from the default command line. Replace 3DChromatin_ReplicateQC with run_3DChromatin_ReplicateQC.sh in the command line:

$ run_3DChromatin_ReplicateQC.sh -h
usage: 3DChromatin_ReplicateQC [-h]
                              {run_all,preprocess,qc,concordance,summary,cleanup}
                              ...

positional arguments:
  {run_all,preprocess,qc,concordance,summary,cleanup}
    run_all             Run all steps in the reproducibility/QC analysis with
                        this single command
    preprocess          (step 1) split files by chromosome
    qc                  (step 2.a) compute QC per sample
    concordance         (step 2.b) compute reproducibility of replicate pairs
    summary             (step 3) create html report of the results
    cleanup             (step 4) clean up files

optional arguments:
  -h, --help            show this help message and exit

9.4. Utility tools

9.4.1. distance_vs_count.Juicer

Count the genomic distance of read pairs in the input file (supposing align/merged_nodups.txt.gz in Juicer outputs) The output file can be used for the distance plot with plot_distance_count.R.

distance_vs_count.Juicer <file> <winsize> <MAPQ>
    <file>:    Input file  (merged_nodups.txt.gz)
    <winsize>: window size (default: 10000)
    <MAPQ>:    MAPQ threshold (default: 30)

Example:
    distance_vs_count.Juicer align/merged_nodups.txt.gz 50000 30

9.4.2. distance_vs_count.Juicer.log

Count the genomic distance of read pairs in the input file with the log-scale bins. The output file can be used for the distance plot with plot_distance_count.log.R.

distance_vs_count.Juicer.log                                                                                                     (base) Usage: distance_vs_count.Juicer.log <file> <MAPQ>
    <file>:    Input file  (merged_nodups.txt)
    <MAPQ>:    MAPQ threshold (default: 30)

Example:
    distance_vs_count.Juicer.log align/merged_nodups.txt.gz 30

9.4.3. convert_JuicerDump_to_dense.py

Convert interaction frequency file dumped by Juicertools to dense (two-dimensional) matrix.

convert_JuicerDump_to_dense.py <inputfile> <outputfile> <genometable> <chr> <resolution> [--help]

Example:
convert_JuicerDump_to_dense.py \
    Matrix.observed.VC_SQRT.chrX.txt \
    Matrix.observed.VC_SQRT.chrX.matrix.gz \
    genome_table.txt \
    chrX \
    25000