4.12.1. Plot contact distance¶

The contact distance plot is often used to see the global difference in contact frequency at the different scales between samples. This notebook shows the script to display different types of contact frequency plots.

[1]:

%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as sp
import seaborn as sns
import matplotlib.pyplot as plt
import os

[2]:

# Specify samples to plot
samplelist = ['Control', 'siCTCF', 'siRad21']

Here we use distance_vs_count.MAPQ30.txt in the distance directory.

(Note: This filename was changed in v1.9.0. If you are using CustardPy v1.8.0 or earlier, use distance_vs_count.10kb.MAPQ30.txt instead).

Create Jdata object containing the distribution data of all samples.

[3]:

dfarray = []
for sample in samplelist:
    path = f'CustardPyResults_Hi-C/Juicer_hg38/{sample}/distance/distance_vs_count.MAPQ30.txt'
    df = pd.read_csv(path, sep="|", header=None, index_col=0)
    df_normalized = df / df.sum()
    df_normalized.columns = [sample]
    dfarray.append(df_normalized)

Jdata = pd.concat(dfarray, axis=1)
Jdata.index = [n * 10000 for n in range(1, len(Jdata) + 1)]

Jdata = Jdata.T

[4]:

logJdata = np.log10(Jdata)
logJdata

/work3/miniconda3_py310/lib/python3.10/site-packages/pandas/core/internals/blocks.py:351: RuntimeWarning: divide by zero encountered in log10
  result = func(self.values, **kwargs)

[4]:

	10000	20000	30000	40000	50000	60000	70000	80000	90000	100000	...	49530000	49540000	49550000	49560000	49570000	49580000	49590000	49600000	49610000	49620000
Control	-0.421224	-1.149222	-1.330263	-1.451928	-1.548873	-1.631883	-1.706374	-1.774436	-1.836290	-1.894641	...	-8.504383	-8.504383	-8.504383	-8.027261	-7.902323	-8.027261	-7.726231	-inf	-inf	-8.203353
siCTCF	-0.402073	-1.202064	-1.390722	-1.510106	-1.600896	-1.676076	-1.741246	-1.799517	-1.853175	-1.902432	...	-8.199094	-inf	-8.500124	-inf	-inf	-8.023002	-8.199094	-8.023002	-8.023002	-8.199094
siRad21	-0.385472	-1.266376	-1.533371	-1.701334	-1.822281	-1.914071	-1.988940	-2.050098	-2.103532	-2.149091	...	-8.165282	-8.466312	-inf	-inf	-inf	-inf	-8.165282	-8.165282	NaN	NaN

3 rows × 4962 columns

Plot a log-log graph of contact distance (x-axis) and contact frequency (y-axis).

[5]:

plt.figure(figsize=(5, 5))
for i, sample in enumerate(samplelist):
    plt.plot(logJdata.T[sample], label=sample)

plt.xscale('log')
plt.xlim(1e+4,1e+8)
plt.ylim(-7,0)

plt.legend(loc="lower left")

[5]:

<matplotlib.legend.Legend at 0x7f2f04982740>

../_images/content_ContactDistancePlot_7_1.png

To compare the difference in perturbation between knockdown samples, it is useful to plot the contact frequency subtracted by the ‘control’ sample.

The obtained plot suggests that in siRad21, contact frequency decreases at the TAD scale and increases at the compartment scale.

[6]:

plt.figure(figsize=(5, 5))
for i, sample in enumerate(samplelist[1:]):
    plt.plot(logJdata.T[sample] - logJdata.T['Control'], label=sample)

plt.xscale('log')
plt.xlim(1e+4,1e+8)
plt.ylim(-1,1)

plt.legend(loc="upper left")

[6]:

<matplotlib.legend.Legend at 0x7f2f0463a5f0>

../_images/content_ContactDistancePlot_9_1.png

4.12.1.1. log-scale bin¶

Since the contact frequency is very low at the long distance, it is also beneficial to plot the contact distance distribution with the bin sizes adjusted by the log scale (i.e., larger bin sizes at the longer distance). The distribution data is available in the distance_vs_count.10kb.MAPQ30.log.txt in the distance directory.

[9]:

dfarray = []
for sample in samplelist:
    path = f'CustardPyResults_Hi-C/Juicer_hg38/{sample}/distance/distance_vs_count.MAPQ30.log.txt'
    df = pd.read_csv(path, sep="|", header=None, index_col=0)
    df = df/df.sum()
    df.columns = [sample]
    dfarray.append(df)

Jdata = pd.concat(dfarray, axis=1)

def f_str(x):
    return int(str(x).split('-')[1])

Jdata.index =Jdata.index.map(f_str)
Jdata = Jdata.T
Jdata

[9]:

	38	49	62	78	99	124	157	198	250	315	...	31622775	39810716	50118722	63095733	79432822	99999999	125892540	158489318	199526230	251188642
Control	0.000015	0.000062	0.000204	0.000440	0.000714	0.001015	0.001770	0.003336	0.007306	0.016427	...	0.013897	0.010962	0.008368	0.005970	0.004328	0.003171	0.002690	0.002140	0.001535	0.000750
siCTCF	0.000012	0.000055	0.000194	0.000408	0.000691	0.001037	0.001942	0.003988	0.009664	0.024375	...	0.013793	0.011830	0.009795	0.007466	0.005428	0.003751	0.002933	0.002160	0.001504	0.000681
siRad21	0.000022	0.000069	0.000211	0.000436	0.000717	0.001024	0.001771	0.003270	0.006985	0.015747	...	0.021520	0.019387	0.017040	0.014292	0.011054	0.007734	0.005633	0.003672	0.002135	0.000821

3 rows × 69 columns

The result looks like the difference between the samples at each scale is clearer.

[10]:

plt.figure(figsize=(5, 5))
for i, sample in enumerate(samplelist):
    plt.plot(Jdata.T[sample], label=sample)

plt.xscale('log')
plt.xlim(1e+3,1e+8)
#plt.ylim(-1,1)

plt.legend(loc="upper left")

[10]:

<matplotlib.legend.Legend at 0x7f2f00dd3010>

../_images/content_ContactDistancePlot_13_1.png

If the number of samples is large (more than six), it would be difficult to distinguish each sample in the line chart. In this case, it is also good to show the distribution with the heatmap, as shown below.

[11]:

g = sns.clustermap(Jdata.iloc[:,14:70], cmap='jet', figsize=(6, 3), col_cluster=False)
plt.setp(g.ax_heatmap.get_yticklabels(), rotation=0)

[11]:

[None, None, None]

../_images/content_ContactDistancePlot_15_1.png

4.12.1. Plot contact distance¶

4.12.1.1. log-scale bin¶

Table of Contents

Previous topic

Next topic

This Page