4.12.1. Plot contact distance

The contact distance plot is often used to see the global difference in contact frequency at the different scales between samples. This notebook shows the script to display different types of contact frequency plots.

[1]:
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as sp
import seaborn as sns
import matplotlib.pyplot as plt
import os
[2]:
# Specify samples to plot
samplelist = ['Control', 'siCTCF', 'siRad21']

Here we use distance_vs_count.MAPQ30.txt in the distance directory.

(Note: This filename was changed in v1.9.0. If you are using CustardPy v1.8.0 or earlier, use distance_vs_count.10kb.MAPQ30.txt instead).

Create Jdata object containing the distribution data of all samples.

[3]:
dfarray = []
for sample in samplelist:
    path = f'CustardPyResults_Hi-C/Juicer_hg38/{sample}/distance/distance_vs_count.MAPQ30.txt'
    df = pd.read_csv(path, sep="|", header=None, index_col=0)
    df_normalized = df / df.sum()
    df_normalized.columns = [sample]
    dfarray.append(df_normalized)

Jdata = pd.concat(dfarray, axis=1)
Jdata.index = [n * 10000 for n in range(1, len(Jdata) + 1)]

Jdata = Jdata.T
[4]:
logJdata = np.log10(Jdata)
logJdata
/work3/miniconda3_py310/lib/python3.10/site-packages/pandas/core/internals/blocks.py:351: RuntimeWarning: divide by zero encountered in log10
  result = func(self.values, **kwargs)
[4]:
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 ... 49530000 49540000 49550000 49560000 49570000 49580000 49590000 49600000 49610000 49620000
Control -0.421224 -1.149222 -1.330263 -1.451928 -1.548873 -1.631883 -1.706374 -1.774436 -1.836290 -1.894641 ... -8.504383 -8.504383 -8.504383 -8.027261 -7.902323 -8.027261 -7.726231 -inf -inf -8.203353
siCTCF -0.402073 -1.202064 -1.390722 -1.510106 -1.600896 -1.676076 -1.741246 -1.799517 -1.853175 -1.902432 ... -8.199094 -inf -8.500124 -inf -inf -8.023002 -8.199094 -8.023002 -8.023002 -8.199094
siRad21 -0.385472 -1.266376 -1.533371 -1.701334 -1.822281 -1.914071 -1.988940 -2.050098 -2.103532 -2.149091 ... -8.165282 -8.466312 -inf -inf -inf -inf -8.165282 -8.165282 NaN NaN

3 rows × 4962 columns

Plot a log-log graph of contact distance (x-axis) and contact frequency (y-axis).

[5]:
plt.figure(figsize=(5, 5))
for i, sample in enumerate(samplelist):
    plt.plot(logJdata.T[sample], label=sample)

plt.xscale('log')
plt.xlim(1e+4,1e+8)
plt.ylim(-7,0)

plt.legend(loc="lower left")
[5]:
<matplotlib.legend.Legend at 0x7f2f04982740>
../_images/content_ContactDistancePlot_7_1.png

To compare the difference in perturbation between knockdown samples, it is useful to plot the contact frequency subtracted by the ‘control’ sample.

The obtained plot suggests that in siRad21, contact frequency decreases at the TAD scale and increases at the compartment scale.

[6]:
plt.figure(figsize=(5, 5))
for i, sample in enumerate(samplelist[1:]):
    plt.plot(logJdata.T[sample] - logJdata.T['Control'], label=sample)

plt.xscale('log')
plt.xlim(1e+4,1e+8)
plt.ylim(-1,1)

plt.legend(loc="upper left")
[6]:
<matplotlib.legend.Legend at 0x7f2f0463a5f0>
../_images/content_ContactDistancePlot_9_1.png

4.12.1.1. log-scale bin

Since the contact frequency is very low at the long distance, it is also beneficial to plot the contact distance distribution with the bin sizes adjusted by the log scale (i.e., larger bin sizes at the longer distance). The distribution data is available in the distance_vs_count.10kb.MAPQ30.log.txt in the distance directory.

[9]:
dfarray = []
for sample in samplelist:
    path = f'CustardPyResults_Hi-C/Juicer_hg38/{sample}/distance/distance_vs_count.MAPQ30.log.txt'
    df = pd.read_csv(path, sep="|", header=None, index_col=0)
    df = df/df.sum()
    df.columns = [sample]
    dfarray.append(df)

Jdata = pd.concat(dfarray, axis=1)

def f_str(x):
    return int(str(x).split('-')[1])

Jdata.index =Jdata.index.map(f_str)
Jdata = Jdata.T
Jdata
[9]:
38 49 62 78 99 124 157 198 250 315 ... 31622775 39810716 50118722 63095733 79432822 99999999 125892540 158489318 199526230 251188642
Control 0.000015 0.000062 0.000204 0.000440 0.000714 0.001015 0.001770 0.003336 0.007306 0.016427 ... 0.013897 0.010962 0.008368 0.005970 0.004328 0.003171 0.002690 0.002140 0.001535 0.000750
siCTCF 0.000012 0.000055 0.000194 0.000408 0.000691 0.001037 0.001942 0.003988 0.009664 0.024375 ... 0.013793 0.011830 0.009795 0.007466 0.005428 0.003751 0.002933 0.002160 0.001504 0.000681
siRad21 0.000022 0.000069 0.000211 0.000436 0.000717 0.001024 0.001771 0.003270 0.006985 0.015747 ... 0.021520 0.019387 0.017040 0.014292 0.011054 0.007734 0.005633 0.003672 0.002135 0.000821

3 rows × 69 columns

The result looks like the difference between the samples at each scale is clearer.

[10]:
plt.figure(figsize=(5, 5))
for i, sample in enumerate(samplelist):
    plt.plot(Jdata.T[sample], label=sample)

plt.xscale('log')
plt.xlim(1e+3,1e+8)
#plt.ylim(-1,1)

plt.legend(loc="upper left")
[10]:
<matplotlib.legend.Legend at 0x7f2f00dd3010>
../_images/content_ContactDistancePlot_13_1.png

If the number of samples is large (more than six), it would be difficult to distinguish each sample in the line chart. In this case, it is also good to show the distribution with the heatmap, as shown below.

[11]:
g = sns.clustermap(Jdata.iloc[:,14:70], cmap='jet', figsize=(6, 3), col_cluster=False)
plt.setp(g.ax_heatmap.get_yticklabels(), rotation=0)
[11]:
[None, None, None]
../_images/content_ContactDistancePlot_15_1.png