What is BandHiC?#

../_images/bandhic_illustration.svg

Fig. 1 Data structure of BandHiC. Schematic illustration of converting a dense symmetric matrix \(A\) into a banded representation consisting of a data matrix \(D\), an element-wise mask matrix \(M\), a row/column mask matrix \(X\), and a default value \(d\) for out-of-band entries. Diagonal elements from \(A\) are reorganized into columns of \(D\); \(M\) marks missing or outlier entries; \(X\) indicates masked rows or columns.#

Data structure of BandHiC#

To address the growing memory demands posed by high-resolution Hi-C data, we introduce band_hic_matrix, the core class implemented in the BandHiC package. For a Hi-C contact matrix \(A \in \mathbb{R}^{n \times n}\) at resolution \(r\), band_hic_matrix retains only the diagonals within a user-defined bandwidth \(k\), yielding a compact representation \(D \in \mathbb{R}^{n \times k}\). This format ensures that each column in \(D\) corresponds to a fixed diagonal of \(A\), such that the mapping \(A[i, j] = D[i, j - i]\) holds for \(|i - j| \le k\).

The memory efficiency achieved by this strategy is substantial. When \(k \ll n\), the memory footprint of band_hic_matrix is reduced from \(\mathcal{O}(n^2)\) to \(\mathcal{O}(nk)\). For example, assuming a resolution of 1 kb and a bandwidth of 2 Mb (\(k = 200\)), the representation of chromosome 1 of the human genome (~249 Mb) requires 3.7 GB of memory, less than 1% of the memory required by the dense matrix (~461.9 GB). This compression enables the use of high-resolution Hi-C data on commodity hardware without sacrificing random access efficiency.

To further enhance flexibility of usage, band_hic_matrix integrates an optional two-tier masking mechanism. The element-wise mask matrix \(M \in \{0,1\}^{n \times k}\) allows users to selectively ignore missing or outlier contacts, enabling robust statistical estimation on unmasked subsets. Additionally, a bin-level mask \(X \in \{0,1\}^n\) supports the exclusion of entire rows or columns, which is particularly useful for removing repetitive genomic regions devoid of a valid Hi-C signal. These masking features facilitate downstream tasks such as the estimation of average contact intensity at given distances, while confirming statistical validity.

Lastly, a scalar default value \(d\) is defined to fill in the undefined entries of \(A\) not covered by the band matrix \(D\). This default is typically set to 0, consistent with the assumption that long-range interactions are negligibly sparse. It also ensures that reconstruction of the full matrix \(A\) (if required) can be achieved seamlessly by combining \(D\), \(M\), \(X\), and \(d\). Overall, band_hic_matrix provides an efficient, flexible data structure for scalable Hi-C data analysis.

Functions of BandHiC#

A key feature of band_hic_matrix is its direct coordinate mapping between the banded matrix B and the full dense matrix \(A\). For any pair of genomic loci \((i, j)\) satisfying the band constraint \(|i - j| \le k\), the interaction frequency \(A[i, j]\) can be accessed in constant time via \(D[i, j - i]\). This structure ensures random access in \(\mathcal{O}(1)\) time, which is critical for performance-sensitive Hi-C analyses, particularly when memory constraints prohibit the use of fully dense matrices. Data access in band_hic_matrix is fully consistent with that of a dense matrix, as each entry is accessed via \(B[i, j] = D[i, j - i] = A[i, j]\), allowing users to interact with band_hic_matrix objects as if they were dense matrices without needing to consider the underlying implementation.

Owing to this random-access capability, band_hic_matrix supports NumPy-like indexing semantics, including slicing, boolean indexing, and fancy indexing. This design allows users to easily query local chromatin contacts. For instance, a slice operation such as B[i:j, i:j] retrieves a banded submatrix. Combined with the todense operation, this enables reconstruction of the dense submatrix for downstream analysis or visualization.

In addition to flexible data access, band_hic_matrix also supports a wide range of numerical operations, including element-wise arithmetic operations and reduction operations. These operations are implemented using NumPy for efficiency. Standard reduction operations such as sum, min, and max are supported along conventional axes (rows or columns), as well as along the diagonal axis, which is particularly useful in summarizing interaction intensities by genomic distance. This feature facilitates common Hi-C analyses such as distance-decay profiling.

Taken together, band_hic_matrix combines the memory efficiency of a banded storage model with the expressiveness of NumPy’s interface. By mimicking both NumPy’s ndarray and MaskedArray behaviors, it provides an intuitive and powerful interface for users, substantially lowering the barrier to adoption and promoting integration into existing Hi-C data analysis workflows.