High-throughput functional analysis of lncRNA core promoters elucidates rules governing tissue-specificity

Kaia Mattioli, Pieter-Jan Volders, Chiara Gerhardinger, James C. Lee, Philipp G. Maass, Marta Mele, John L. Rinn

Preprint posted on December 04, 2018

What core promoter sequence features are important for gene activity? Mattioli et al. uncover an important role of overlapping motifs

Selected by Clarice Hong

Categories: genomics, systems biology


Though transcription initiates from mRNA promoters, long non-coding RNA (lncRNA) promoters and enhancers (to produce enhancer RNAs, or eRNAs), each of these classes of genomic sequences have very different expression profiles. Specifically, lncRNAs and eRNAs are less active and more tissue-specific than mRNAs. The different expression patterns must be encoded by the genomic sequence itself, however, it remains unclear what sequence features determine different transcriptional patterns. Furthermore, a subclass of transcribed sequences, known as ‘divergent’ promoters, produce two stable transcripts in the sense and antisense direction respectively. Whether a ‘divergent’ transcript is produced by one promoter with unique sequence features or two proximal promoters remains unknown. Thus, to understand the sequence features underlying different promoter types, the authors used massively parallel reporter assays (MPRAs) to measure the intrinsic transcriptional activity of hundreds of promoters and enhancers in different cell types.

Key findings

The authors first grouped the genomic sequences that initiate transcription into 5 categories: eRNAs, intergenic lncRNAs (lincRNAs), divergent lncRNAs, mRNAs and divergent mRNAs. They then selected high-confidence transcription start sites (TSSs) for each category from 3 different cell lines (K562, HepG2 and HeLa) and designed sequences covering the core promoter to test for transcriptional activity. For the MPRA, each core promoter is linked to a unique barcode sequence that is transcribed. The activity of each promoter is then calculated by taking the RNA barcode counts divided by the DNA input barcode counts. Using this method, the authors found that both divergent mRNA and lncRNA promoters tended to be more active than their non-divergent counterparts, suggesting that divergent promoters are intrinsically stronger than non-divergent promoters. Furthermore, at least part of the tissue-specificity of core promoters appears to be encoded in the core promoter sequence itself, since the MPRA was able to recapitulate tissue-specific expression. Thus, the core promoter sequence alone can explain some of the differences between the different classes of promoters.

To determine the sequence features that discriminate between different promoters, the authors looked at two main features: the transcription factor (TF) motif architecture (the suite of TFs that binds to sequence) and the cell-type-specificity of the TFs that bind to the core promoter. TF motif architecture was further subdivided into two parts: number of independent binding sites in the sequence and the number of overlapping motifs. Using these three features, they fit a linear model to the MPRA data to see which feature contributes the most to core promoter activity. They found that while the number of binding sites and number of overlapping motifs (both under TF motif architecture) could explain some of the variation, cell-type-specificity of the TFs contributed almost nothing to core promoter activity. This suggests that the strength of a core promoter is dependent on its TF motif architecture, but this itself is not sufficient since they each only explain less than 20% of the variation.

Using the same metrics, the authors then looked at publicly available CAGE data (which measures the activity of each TSS in the genome) and found that overlapping TF motifs is correlated with higher core promoter activity and lower tissue-specificity. They thus hypothesised that disruptions in overlapping motifs would have a larger effect size than disruptions in individual motifs, since they are likely to have more severe consequences on promoter activity. To test this, they designed a second library of core promoter sequences from 21 disease-associated genes and 5 nearby lncRNAs and eRNAs with single nucleotide deletions spanning the core promoter. Indeed, the effect size of each deletion is somewhat correlated with the number of motifs it is predicted to disrupt, suggesting that overlapping TF motifs are indeed predictive of stronger promoter activity. This was also true for disease-associated single nucleotide polymorphisms (SNPs), as SNPs in overlapping motifs led to larger expression changes. From these results, the authors concluded that overlapping binding sites for different TFs allow a core promoter to be ubiquitously expressed across cell types and maintain high expression (Figure 1).

Figure 1: Summary of gene expression regulation by core promoters (Figure 5 from preprint). High and ubiquitous expression is associated with more overlapping TF motifs, while low and tissue-specific expression tends to have fewer TF motifs.

What I liked

As a student trying to understand the regulation of gene expression, the question of what sequence features of core promoters determine their activity is very interesting to me. This is especially exciting since we found out that so much more of the genome than we expected is transcribed. Since different groups of genes clearly have very different expression patterns, we need to find the rules governing these patterns. In this preprint, the authors took this one step further, and used some of the rules they learnt (overlapping TF motifs) to identify and determine the function of known SNPs in core promoters, which will be very useful for the understanding of non-coding disease variants. Furthermore, the MPRA is a powerful technique used to assay the activity of many DNA sequences, so I like that MPRAs are being used for this purpose. This also provides a great tool for the further study of TF binding sites and how variants affect TF binding and expression.

Future directions and questions

The biggest question that I have is what else is causing the differential expression levels and tissue specificity, since the features tested did not explain at least half of the variance. Can we consider other sequence features, for example, shape of the DNA? The specific combinations of TF motifs might also be important, since low affinity binding sites that are not usually picked up by motif finders can be used in the genome in combination with the right partners. Furthermore, are there any sequence features that might lead to a divergent vs non-divergent promoter? It also appears that the same rules used to explain the difference between categories of promoters can also be applied within each group of promoters, which suggests that perhaps things like TF motif architecture do not distinguish between the different promoter categories, but simply discriminates between high/low expression and tissue-specific expression. This begs the question of whether lncRNA and mRNA promoters and even eRNAs are categorically different, or whether they simply are transcribed according to the same rules to produce transcripts of different functions.


Posted on: 29th January 2019 , updated on: 30th January 2019

Read preprint (No Ratings Yet)

  • Have your say

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    Sign up to customise the site to your preferences and to receive alerts

    Register here