Machine And Deep Studying Reveal Sequence Determinants Encoding Bivalent Histone Modifications

Genome-wide profiling of histone modifications in mouse embryonic stem cells

To systematically examine the regulatory patterns of univalent and bivalent histone modifications, we in contrast the sign intensities and chromosomal distribution preferences of various histone marks. Amongst them, H3K4me3 displayed globally increased enrichment ranges in mESCs in comparison with the 2 repressive modifications, H3K27me3 and H3K9me3. These marks have been comparatively evenly distributed throughout autosomes, whereas fewer peaks have been noticed on intercourse chromosomes, suggesting that the institution of those modifications is basically secure throughout the genome (Fig. 1A). To additional dissect the combinatorial patterns of H3K4me3 with both H3K27me3 or H3K9me3, we categorized the peaks into monovalent, bivalent, and trivalent classes. Nearly all of peaks have been marked by a single histone modification, whereas co-modified areas—these carrying two or extra marks—constituted a smaller fraction of the full (Fig. 1B). Notably, areas concurrently marked by H3K4me3, H3K27me3, and H3K9me3 (trivalent domains) accounted for about 1% of H3K4me3 peaks, 8% of H3K27me3 peaks, and three% of H3K9me3 peaks.

Fig. 1: Genome-wide distribution of H3K4me3, H3K27me3, and H3K9me3 in mESCs.

A Circos plot exhibiting the H3K4me3, H3K27me3, and H3K9me3 peak alerts throughout all chromosomes. B Venn diagram illustrating the genomic overlap amongst H3K4me3, H3K27me3, and H3K9me3 peaks. The noticed three-way overlap is considerably increased than anticipated below a random genomic null mannequin (imply ± SD = 2.7 ± 1.7; empirical p < 0.01), indicating a non-random co-occurrence of histone modification patterns. C Distribution of peak widths for monovalent (H3K4me3-only, H3K27me3-only, H3K9me3-only) and bivalent modifications. D GC content material distribution of genomic areas related to totally different histone modifications. E PhastCons conservation scores for monovalent and bivalent peaks, reflecting evolutionary constraint. F Radar plot exhibiting enrichment rating (log ratio of noticed/random) of H3K4me3, H3K27me3, and H3K9me3 peaks throughout varied genomic parts, together with promoters, coding sequences (CDS), introns, intergenic areas, and main transposable factor (TE) courses (LINE, LTR, and SINE). G Genomic annotation of H3K4me3-only, H3K27me3-only, and bivalent peaks throughout useful options (e.g., promoters, exons, introns, downstream areas). H Proportional overlap of histone modification peaks with main TE courses, together with LINE, SINE, LTR, DNA, and satellite tv for pc parts. I TE household composition (e.g., L1, B2, ERVK, and Alu) related to H3K4me3-only, H3K9me3-only, and bivalent peaks.

We subsequent investigated the underlying DNA sequence traits of monovalent versus bivalent peaks, specializing in peak width, GC content material, and conservation. Peak size evaluation revealed that almost all areas spanned 200 to 1000 bp, with bivalent peaks being barely narrower than monovalent ones (Fig. 1C). Moreover, bivalent peaks exhibited considerably increased GC content material and conservation scores, suggesting that these co-modified areas are preferentially positioned in conserved genomic parts (Fig. 1D, E).

Bivalent domains are sometimes discovered at promoter areas, as exemplified by the overlap of H3K4me3 and H3K27me3 enrichment at gene promoters (Fig. 1F, G). In distinction, H3K9me3 confirmed predominant enrichment inside lengthy terminal repeat (LTR) and lengthy interspersed nuclear factor (LINE) repetitive parts (Fig. 1F), according to its established position in retrotransposon silencing²⁹. We additionally evaluated the enrichment of repetitive DNA parts inside H3K9me3/H3K4me3 bivalent domains. The outcomes confirmed that H3K9me3-marked bivalent areas have been considerably enriched in repetitive parts in mESCs, with H3K9me3/H3K4me3 domains significantly enriched for LINE/L1 parts (Fig. 1H, I).

Bivalent modifications coordinately regulate gene expression

To elucidate the results of distinct histone modification (HM) patterns on gene expression, we quantitatively assessed the sign intensities of H3K4me3, H3K27me3, and H3K9me3 at trivalently marked genes and correlated these with corresponding transcriptional ranges. Our evaluation revealed that elevated H3K4me3 alerts are strongly related to excessive gene expression, whereas enrichment of H3K27me3 or H3K9me3 correlates with diminished expression. These findings counsel that inside co-marked genomic areas, H3K4me3 seemingly capabilities because the principal activation mark, whereas H3K27me3 and H3K9me3 seem to contribute predominantly to repressive regulatory outcomes (Fig. 2A).

Fig. 2: Gene expression patterns associated with bivalent marks. — **Fig. 2: Gene expression patterns related to bivalent marks.**

We additional analyzed the expression patterns of genes related to the 2 varieties of bivalent modifications (H3K4me3/H3K27me3 and H3K4me3/H3K9me3), and carried out useful enrichment analyses. Genes marked by H3K4me3/H3K27me3 bivalency exhibited comparatively low expression ranges, falling between these marked by H3K4me3-only and H3K27me3-only areas, however skewed nearer to the H3K27me3-only profile, indicating a basic tendency towards transcriptional repression (Fig. 2B). We additional categorized bivalent genes into excessive, medium, and low expression teams and noticed a optimistic relationship between bivalent mark depth and gene expression, with extremely expressed genes exhibiting the strongest bivalent alerts (Fig. 2C). Persistently, a genome-wide correlation evaluation revealed a weak however statistically vital optimistic affiliation between bivalent sign depth and gene expression (Spearman’s ρ = 0.186, p = 9.33 × 10⁻³⁴; Supplementary Fig. S1), in settlement with earlier research^30,31. KEGG pathway enrichment evaluation of extremely expressed bivalent genes revealed vital affiliation with the “breast cancer” pathway, implying that some bivalent genes might contribute to oncogenesis. For instance, the bivalent gene POU4F1 has been proven to be important for selling and sustaining basal-like breast most cancers (BLBC), making it a possible therapeutic goal³² Bivalent genes are enriched in majority of the signaling pathways associated to the pluripotent upkeep and improvement, corresponding to signaling pathways regulating pluripotency of stem cells, TGF-beta, Wnt, MAPK, FoxO and Hippo signaling pathways³³ (Fig. 2D).

Subsequent, we tried to outline bivalent genes with totally different sign intensities by overlapping H3K4me3 and H3K9me3 Chromatin Immunoprecipitation sequencing (ChIP-seq) peaks. The overlapping areas have been categorized into 4 teams based mostly on sign energy: Excessive-Excessive, Mid-Excessive, Excessive-Mid, and Mid-Mid. These teams exhibited a linear distribution sample, suggesting a sure diploma of co-regulation between H3K4me3 and H3K9me3 (Fig. 2E). By integrating RNA-seq knowledge, we discovered that genes positioned in areas with each excessive H3K4me3 and excessive H3K9me3 alerts (Excessive-Excessive) exhibited comparatively low expression ranges (Fig. 2F). Bivalent domains are recognized to endure dynamic transitions throughout cell destiny specification. For example, modifications in bivalent marks are implicated within the differentiation of neural stem cells³⁴. Gene Ontology (GO) evaluation of H3K4me3/H3K9me3-associated genes revealed enrichment in neurodevelopmental processes corresponding to synapse meeting—a essential step throughout ESC differentiation into mature and useful neurons (Fig. 2G).

Interpretable mannequin of bivalent histone modifications pushed by sequence options

In embryonic stem cells, bivalent domains marked by H3K4me3 and H3K27me3 are preferentially enriched at promoters containing CpG islands³ and exhibit robust associations with particular DNA sequence options, corresponding to CpG-rich parts and transcription issue binding websites³⁵. A earlier research recognized an enrichment of the TCCCC sequence motif at bivalent promoters in each mouse and human ESCs³⁶. However, the sequence determinants that encode or direct the formation of bivalent histone modifications stay largely elusive. To discover whether or not bivalent chromatin areas harbor distinctive and predictive sequence options, we employed 5 supervised machine studying algorithms in addition to deep studying fashions to categorise bivalent versus monovalent areas. Particularly, overlapping bivalent peaks (200-1000 bp in size) have been chosen for sequence function extraction. For machine studying fashions, we used the broadly adopted ok-mer method, which has confirmed efficient in capturing brief sequence patterns related to chromatin state²¹. Insights from earlier analysis have been leveraged to information the number of parameter ok, and a variety of ok values from 3 to eight was adopted on this research^19,37,38. For deep studying fashions, sequences have been encoded utilizing one-hot encoding, enabling the fashions to mechanically study latent sequence patterns (Fig. 3).

Fig. 3: Schematic representation of the interpretable model for bivalent modifications. — **Fig. 3: Schematic illustration of the interpretable mannequin for bivalent modifications.**

We first constructed a three-class classification mannequin to tell apart amongst H3K4me3-only, H3K9me3-only, and H3K4me3/H3K9me3 bivalent areas. Primarily based on comparative evaluation between Lasso regression and Random Forest for function choice, we adopted Random Forest, which recognized 797 informative 6-mer options (Supplementary Fig. S2). Analysis throughout totally different ok-mer sizes revealed that every one fashions carried out finest at ok = 6 (Fig. 4A), according to prior machine studying research on genomic sequence prediction^18,20,39,40. Amongst all fashions examined, Assist Vector Machine (SVM) yielded the best classification efficiency, as indicated by ROC curves and total accuracy (Fig. 4B, C). The confusion matrix of the SVM mannequin confirmed robust classification capability, significantly for H3K4me3-only areas (Fig. 4D). To exclude the chance that the noticed predictive efficiency arose from random correlations, we additional evaluated a randomized null mannequin wherein sequence options have been shuffled whereas class labels have been preserved. Beneath this management, all fashions exhibited chance-level efficiency (AUROC ≈ 0.5), in distinction to the clear diagonal enrichment noticed when utilizing actual genomic sequences (Supplementary Fig. S5), indicating that the predictive alerts captured by the fashions are genuinely encoded in DNA sequence options. Utilizing the same technique, we additional distinguished H3K4me3/H3K27me3 bivalent areas from monovalent ones. Lasso regression was used to pick out 616 related sequence options. Optimum mannequin parameters have been decided through grid search with 5-fold cross-validation. Amongst all examined classifiers, the XGBoost-6mer mannequin achieved one of the best predictive efficiency (Supplementary Fig. S3). Apparently, H3K4me3/H3K9me3 bivalent areas have been categorized extra precisely than H3K4me3/H3K27me3 areas, indicating that they could harbor extra distinct sequence determinants.

Fig. 4: Interpretable modeling of bivalent histone modifications based on DNA sequence features. — **Fig. 4: Interpretable modeling of bivalent histone modifications based mostly on DNA sequence options.**

Given the big variety of sequence options recognized by machine studying, mannequin interpretation stays difficult resulting from function complexity and potential redundancy. To deal with this, we carried out HOMER enrichment evaluation to prioritize core sequence motifs with potential organic relevance (Supplementary Fig. S6B). For the H3K4me3/H3K9me3 bivalent modifications, 4 key motifs have been considerably enriched. We retrained classification fashions utilizing solely these core options and noticed that every one fashions maintained strong predictive efficiency, with AUC values reaching roughly 0.87. Amongst them, the Random Forest (RF) mannequin persistently outperformed others, suggesting that these core sequence motifs are ample to seize important info for chromatin state classification (Fig. 4E). SHAP (SHapley Additive exPlanations) evaluation revealed that the motif “TCTGAA” exhibited the best optimistic contribution to the prediction of bivalent areas, indicating its robust affiliation with bivalent chromatin states. In distinction, the remaining three motifs predominantly contributed to the classification of H3K9me3-only areas. These findings counsel useful divergence amongst sequence motifs, the place distinct brief DNA sequences preferentially mark totally different histone modification states (Fig. 4F). These findings point out that totally different brief DNA motifs preferentially affiliate with distinct chromatin states.

Equally, 5 enriched motifs have been recognized for H3K4me3/H3K27me3 bivalent domains. As anticipated, the efficiency was barely diminished in comparison with fashions educated with the total function set, but the simplified mannequin gives improved interpretability and highlights biologically related sequence patterns (Supplementary Fig. S4). Primarily based on SHAP values, the motifs TCACAG, TTCAAA, and AGGGCT exhibited the best function significance, serving as key indicators for distinguishing H3K4me3/H3K27me3 bivalent areas from monovalent counterparts (Supplementary Fig. S4C). Collectively, these outcomes spotlight that particular brief DNA sequence motifs not solely carry predictive energy for chromatin state classification but in addition present mechanistic insights into the sequence determinants underlying bivalent histone modifications.

To achieve deeper insights into the sequence determinants regulating bivalent chromatin areas, we constructed deep studying fashions to foretell H3K4me3/H3K9me3 bivalency. We systematically evaluated the impact of various enter sequence lengths (200 bp, 400 bp, 600 bp, 800 bp, and 1000 bp) on mannequin efficiency. General, most fashions exhibited improved AUROC values with growing sequence size. Particularly, the DanQ, CNN+Consideration, and CNN+Transformer fashions achieved optimum efficiency when educated on sequences of 600 bp or 1000 bp (Fig. 4G), suggesting that these lengths are well-suited for this binary classification process. Amongst all evaluated fashions, the CNN+Consideration structure persistently demonstrated one of the best predictive efficiency, adopted by the DanQ and CNN+Transformer fashions, based mostly on complete analysis metrics (Fig. 4H, I). These outcomes counsel that deep studying fashions are able to capturing higher-order and context-dependent sequence options related to bivalent chromatin, with interpretable architectures additional facilitating motif discovery.

Evaluation of sequence regulatory grammar underlying bivalent histone modifications

To elucidate the regulatory grammar of distinct bivalent histone modifications, we carried out a comparative evaluation of sequence options related to H3K4me3/H3K9me3 and H3K4me3/H3K27me3 areas. Histone modification degree was reported to be carefully associated to the general GC content material of the corresponding area⁴¹, we categorized the recognized motifs into three teams based mostly on their GC content material: GC-rich, AT-rich, and impartial motifs. Most motifs have been GC-rich, whereas a smaller proportion have been AT-rich (Fig. 5A). Earlier research have implied the position of specific DNA sequences (or motifs) in setting the boundary of histone modification area or opening chromatin to permit transforming enzymes to bind DNA⁴². We subsequent examined the positional distribution of the chosen motifs inside bivalent sequences. Every bivalent area was divided into ten equal bins, with the outermost bins representing the sides and the central bins representing the middle. We noticed that motifs related to H3K4me3/H3K27me3 bivalency have been predominantly enriched on the edges of bivalent areas, whereas these linked to H3K4me3/H3K9me3 bivalency have been largely categorized as impartial. These outcomes point out that the 2 varieties of bivalency exhibit distinct motif positional preferences inside bivalent domains (Fig. 5B; Supplementary Fig. S6A).

Fig. 5: Sequence-based regulatory features associated with bivalent histone modifications. — **Fig. 5: Sequence-based regulatory options related to bivalent histone modifications.**

Subsequent, we additional examine potential transcription components (TFs) related to these sequence options, we used the TomTom device to match all recognized motifs with a mouse motif database. Pluripotency-related transcription components, together with STAT3, ESRRB, KLF4, OCT4, SOX2, and SMAD1, have been considerably enriched in each varieties of bivalent areas (Supplementary Fig. S6), suggesting shared regulatory packages in embryonic stem cells. Notably, the important thing motif TCTGAA that distinguishes H3K4me3/H3K9me3 bivalent areas was primarily matched to the TF TCFCP2l1 (Fig. 5E). These findings counsel that the DNA sequences co-occupied by H3K4me3/H3K27me3 or H3K4me3/H3K9me3 are doubtlessly regulated by pluripotency components in embryonic stem cells. We additional mapped the recognized sequence options of H3K4me3/H3K9me3 bivalent modifications to particular peak areas, annotated the corresponding genes, and recognized 4 bivalent genes containing all 4 key motifs (Fig. 5C, D). These peaks exhibited intermediate ChIP-seq sign intensities between H3K4me3-only and H3K9me3-only areas (Supplementary Fig. S6E), supporting the notion of a definite chromatin state.

To interpret the realized sequence patterns, we utilized a sequence alignment-based interpretability technique (SABM) to H3K4me3/H3K9me3-labeled areas. This technique identifies subsequences that activate convolutional neurons and stacks place weight matrices (PWMs), analogous to conventional estimation strategies⁴³. We used info content material (IC) to judge every mannequin’s skill to establish useful sequence motifs, with the DanQ mannequin reaching the best IC values, indicating stronger motif definition on this subtype (Fig. 5F). Utilizing the TOMTOM device, we recognized the transcription components related to motifs realized by totally different fashions. These transcription components are able to binding not solely to the promoters of pluripotency-related genes in embryonic stem cells, but in addition to their enhancer areas, thereby sustaining the pluripotent state (Fig. 5G). Subsequently, we utilized the FIMO device to find the precise positions of every motif (motif websites), and annotated the corresponding bivalent genes. KEGG pathway enrichment evaluation of those bivalent genes revealed vital enrichment within the Neuroactive ligand–receptor interplay pathway, which is carefully related to nervous system improvement and stem cell differentiation (Fig. 5H).

In abstract, whereas H3K4me3/H3K9me3 and H3K4me3/H3K27me3 bivalent areas share widespread pluripotency-related regulatory components, they differ in base composition, motif positioning, TF specificity, and related organic pathways. These findings counsel that distinct sequence grammars underlie the 2 courses of bivalent chromatin, reflecting their divergent regulatory roles in embryonic stem cells.

Top Posts

From Local to Global: The Strategy Behind Our 10x Scanning Capacity Breakthrough

I Transformed My Phone Into 35 Essential Measuring Tools Using This Free Android App — Here’s My Complete Test Results

MassRobotics Unveils the 2026 Robotics Medal and Rising Star Award Winners

Machine and Deep Studying Reveal Sequence Determinants Encoding Bivalent Histone Modifications

Google’s Gemini-SQL2 Achieves 80.04% on BIRD Leaderboard with Gemini 3.1 Pro

When PyMuPDF Misses the Table: Unlocking PDF Parsing for RAG with Azure Layout

“Unlock 3 Powerful NumPy Tricks to Supercharge Your Numerical Performance”

Pioneering Otitis Media Diagnosis: The 4DO-DETR Breakthrough

Perplexity Elevates Deep Computer with Research Across 20 Frontier Models for Reports, Decks, and Dashboards

Unlocking the Hidden Connections: Why Relational Shape RAG Demands More Than Flat Text from PDFs

From Local to Global: The Strategy Behind Our 10x Scanning Capacity Breakthrough

I Transformed My Phone Into 35 Essential Measuring Tools Using This Free Android App — Here’s My Complete Test Results

MassRobotics Unveils the 2026 Robotics Medal and Rising Star Award Winners

AI-Powered Portfolio Trading: The Future of Automated Investing

Google’s Gemini-SQL2 Achieves 80.04% on BIRD Leaderboard with Gemini 3.1 Pro

OWL’s Take: Who Does Claude Fable Predicted to Win the 2026 FIFA World Cup?

Shadows of Sabotage: Unmasking Supply-Chain Threats Lurking in the Dark Web

Bridging the Execution Gap: Why Human Talent Is the Missing Link in Modern Government Tech

Trending

From Local to Global: The Strategy Behind Our 10x Scanning Capacity Breakthrough

I Transformed My Phone Into 35 Essential Measuring Tools Using This Free Android App — Here’s My Complete Test Results

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Machine and Deep Studying Reveal Sequence Determinants Encoding Bivalent Histone Modifications

Genome-wide profiling of histone modifications in mouse embryonic stem cells

Bivalent modifications coordinately regulate gene expression

Interpretable mannequin of bivalent histone modifications pushed by sequence options

Evaluation of sequence regulatory grammar underlying bivalent histone modifications

Related Posts