Serpentine binning¶
This tutorial aims at demonstrating use cases and for improving Hi-C contact maps with distribution-aware binning, helping readers reproduce the steps in our papers and documenting readers with the implementation.
Loading the library¶
You can directly use the library’s functions, if you are already manipulating numpy contact maps in your analysis. First, you need to import the library with the following:
[28]:
%matplotlib notebook
[29]:
import numpy as np
from matplotlib import pyplot as plt
import serpentine as sp
After, you need to load your datasets with numpy, we provide a couple of demo datasets in the form of tables, corresponding to Yeast chromosome 7 in two different mutants, in the repository:
[30]:
# Load Yeast data
A = np.loadtxt('../../demos/A.csv')
B = np.loadtxt('../../demos/B.csv')
At this point you are working with raw Hi-C data, serpentine provides convenient functions to visualize your material:
The first dataset:¶
[31]:
fig = plt.figure()
sp.mshow(A)
[31]:
<matplotlib.image.AxesImage at 0x7fea6aad2320>
The second dataset:¶
[32]:
fig = plt.figure()
sp.mshow(B)
[32]:
<matplotlib.image.AxesImage at 0x7fea6ac87cc0>
Filtering of the data¶
The raw data needs to be filtered in order to clean the unmappable rows and columns, this kind of artifacts shows up in the distribution of reads per bin as outliers:
[33]:
plt.figure()
norm = np.log10(np.sum(A + B, axis=0)[np.sum(A + B, axis=0) > 0])
norm = norm[np.isnan(norm) == False]
norm = norm[np.isinf(np.abs(norm)) == False]
plt.hist(norm, bins=50)
plt.axvline(x=np.median(norm), color='g')
plt.axvline(x=np.median(norm) - 3 * 1.4826 * sp.mad(norm), color='r')
plt.axvline(x=np.median(norm) + 3 * 1.4826 * sp.mad(norm), color='r')
[33]:
<matplotlib.lines.Line2D at 0x7fea6a852c50>
with the serpentine library this is easily done achievable using the two included functions
[34]:
flt = sp.outstanding_filter(A) + sp.outstanding_filter(B)
flt = flt == False
A = sp.fltmatr(A, flt)
B = sp.fltmatr(B, flt)
resulting in:
[35]:
fig = plt.figure()
ax1 = fig.add_subplot(1, 2, 1); sp.mshow(A, subplot=ax1)
ax2 = fig.add_subplot(1, 2, 2); sp.mshow(B, subplot=ax2)
[35]:
<matplotlib.image.AxesImage at 0x7fea6a796780>
at this point, additional processing can be done before proceeding, such as iterative normalizations, removing speckles and other such operations.
Finding the binning threshold and the de-trending constant¶
The coverage of the data will impact how much binning is needed. On top of that, when comparing matrices with different coverages, one needs to find the so-called trending constant that need to be subtracted from the result. In order to do this, our library provides a tool in the form of an mean-difference (MD) plot. This graph suggests that the data has a characteristic noise-to-signal ratio at large coverages that becomes much larger at lower coverages due to sampling noise.
The function MDbefore finds the optimal trending and threshold values, the graph higlights the median and the median absolute deviation (MAD) as red and green lines. Both are plotted as a function of the mean contact number:
[36]:
# Find the de-trending and threshold
plt.figure()
trend, threshold = sp.MDbefore(A, B, ylim=[-4, 4])
print(trend, threshold)
0.5109957290996449 49.999999999999986
Serpenting binning the data¶
Finally you can use the function to bin the data. The function takes two parameters: a threshold that constrains the coverage of the bin in at least one matrix, and the minthreshold that constrain it in both. The function uses multiple processors and can be configured by the optional parameters:
[37]:
sA, sB, sK = sp.serpentin_binning(A, B, threshold, threshold / 5)
Starting 10 binning processes in batches of 4...
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
1 Total serpentines: 28227 (69.1770414665229 %)
1 Total serpentines: 28188 (69.08146260170571 %)
1 Total serpentines: 28216 (69.1500833251642 %)
1 Total serpentines: 28315 (69.39270659739242 %)
2 Total serpentines: 12232 (29.977453190863642 %)
2 Total serpentines: 12248 (30.016665032839917 %)
2 Total serpentines: 12181 (29.85246544456426 %)
2 Total serpentines: 12288 (30.11469463778061 %)
3 Total serpentines: 7527 (18.446720909714735 %)
3 Total serpentines: 7475 (18.319282423291835 %)
3 Total serpentines: 7522 (18.434467209097146 %)
4 Total serpentines: 6530 (16.003333006567985 %)
4 Total serpentines: 6534 (16.013135967062052 %)
4 Total serpentines: 6533 (16.010685226938534 %)
3 Total serpentines: 7527 (18.446720909714735 %)
5 Total serpentines: 6459 (15.829330457798255 %)
5 Over: 2018-10-31 13:57:24.042602
5 Total serpentines: 6454 (15.817076757180669 %)
5 Total serpentines: 6450 (15.8072737966866 %)
6 Total serpentines: 6452 (15.812175276933633 %)
6 Over: 2018-10-31 13:57:24.099372
4 Total serpentines: 6522 (15.983727085579845 %)
6 Total serpentines: 6450 (15.8072737966866 %)
6 Over: 2018-10-31 13:57:24.135747
5 Total serpentines: 6447 (15.799921576316047 %)
6 Total serpentines: 6447 (15.799921576316047 %)
6 Over: 2018-10-31 13:57:24.195925
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
1 Total serpentines: 28225 (69.17213998627585 %)
1 Total serpentines: 28226 (69.17459072639937 %)
1 Total serpentines: 28246 (69.22360552886971 %)
1 Total serpentines: 28194 (69.09616704244682 %)
2 Total serpentines: 12231 (29.975002450740124 %)
3 Total serpentines: 7515 (18.417312028232526 %)
2 Total serpentines: 12236 (29.98725615135771 %)
2 Total serpentines: 12225 (29.96029800999902 %)
4 Total serpentines: 6509 (15.95186746397412 %)
5 Total serpentines: 6426 (15.748456033722183 %)
5 Over: 2018-10-31 13:57:24.859513
3 Total serpentines: 7526 (18.444270169591217 %)
2 Total serpentines: 12245 (30.009312812469364 %)
3 Total serpentines: 7542 (18.483482011567492 %)
4 Total serpentines: 6525 (15.991079305950397 %)
4 Total serpentines: 6522 (15.983727085579845 %)
5 Total serpentines: 6453 (15.81462601705715 %)
5 Total serpentines: 6444 (15.792569355945496 %)
3 Total serpentines: 7550 (18.503087932555633 %)
6 Total serpentines: 6453 (15.81462601705715 %)
6 Over: 2018-10-31 13:57:25.030012
6 Total serpentines: 6444 (15.792569355945496 %)
6 Over: 2018-10-31 13:57:25.035744
4 Total serpentines: 6525 (15.991079305950397 %)
5 Total serpentines: 6448 (15.802372316439564 %)
6 Total serpentines: 6447 (15.799921576316047 %)
6 Over: 2018-10-31 13:57:25.130998
0 Total serpentines: 40804 (100.0 %)
0 Total serpentines: 40804 (100.0 %)
1 Total serpentines: 28313 (69.38780511714538 %)
1 Total serpentines: 28130 (68.93931967454171 %)
2 Total serpentines: 12176 (29.84021174394667 %)
3 Total serpentines: 7532 (18.45897461033232 %)
4 Total serpentines: 6509 (15.95186746397412 %)
2 Total serpentines: 12253 (30.028918733457505 %)
5 Total serpentines: 6452 (15.812175276933633 %)
6 Total serpentines: 6450 (15.8072737966866 %)
6 Over: 2018-10-31 13:57:25.767884
3 Total serpentines: 7567 (18.544750514655426 %)
4 Total serpentines: 6581 (16.128320752867367 %)
5 Total serpentines: 6493 (15.912655621997843 %)
6 Total serpentines: 6490 (15.905303401627291 %)
6 Over: 2018-10-31 13:57:25.921796
Checking the results¶
The quality of the binning can be verified with an MD plot. This time, use the MDafter function: if the process was successful you would expect the effect of sampling to be reduced by binning, and an almost constant signal-to-noise value at all coverage values, similar to the one at large coverage:
[38]:
plt.figure()
sp.MDafter(sA, sB, sK, ylim=[-4, 4])
[38]:
(0.5107324222723643, 49.999999999999986)
Matrices have been rebinned, and the characteristic sampling noise present at small coverages has now been smoothed, while the crisp signal at large coverages that conveys the precious biological variations is preserved:
[39]:
fig = plt.figure();
ax1 = fig.add_subplot(1, 2, 1)
sp.mshow(sA, subplot=ax1)
ax2 = fig.add_subplot(1, 2, 2)
sp.mshow(sB, subplot=ax2)
[39]:
<matplotlib.image.AxesImage at 0x7fea6adc91d0>
Checking the differential analysis¶
Similarly, we improved the differential analysis, before the binning, we could have obtained this kind of results:
Before binning:¶
[40]:
plt.figure()
np.warnings.filterwarnings('ignore')
D = np.log2(B/A)
sp.dshow(D, trend)
Now,
After binning:¶
[41]:
plt.figure()
sp.dshow(sK, trend)
[27]:
from scipy.ndimage import gaussian_filter
sA, sB, sK = sp.serpentin_binning(A, B, threshold, threshold / 5)
[ ]:
[ ]: