Bioinformatics and systems-level genomics

As we generate increasing amounts of genomic data we have to expand our repertoire of analytical tools. We are actively developing and applying novel analytical pipelines to deal with increasingly complex study designs that include multiple cell and tissue-types, longitudinal sampling and multi-omic data.

Online databases

Large Multi-Cohort Epigenome-Wide Association Studies

It is likely that most epigenetic differences associated with complex disease are small in magnitude, and adequately powered studies will require large samples sizes. We are implementing novel meta-analysis approaches for EWAS data to maximise our chances of identifying robust associations between the epigenome and disease or development. Our recent EWAS meta-analysis of schizophrenia, for example, has identified many differentially methylated positions significantly associated with disease (see Figure). We are currently involved in EWAS meta-analyses of many other phenotypes including cognitive ability, Alzheimer’s disease, and ALS.

Identifying Molecular Quantitative Trait Loci (QTLs)

DNA sequence variation is widely associated with both epigenetic and transcriptomic variation. We are characterising the relationship between genetic variation and multiple epigenetic markers including DNA methylation, DNA hydroxymethlation and various histone modifications. We have generated mQTL datasets on multiple tissues including fetal/adult brain and blood and are using these to interpret findings from GWAS. Importantly, we have found that mQTLs can be both tissue- and developmental-stage-specific (see Figure). We are involved in coordinating the Genetics of DNA Methylation Consortium (GoDMC ) which aims to bring together datasets with both genetic and DNA methylation data in > 20,000 samples.

Using Molecular QTLs to Refine Genetic Association Signals

There has been major progress in the identification of genetic variants influencing a diverse range of complex human phenotypes. The challenge is now to improve our understanding of the biological effects of these genetic risk factors, especially because the actual genes involved in mediating phenotypic variation are not necessarily the most proximal to the lead SNPs identified in GWAS. We know that GWAS variants are preferentially located in enhancers and regions of open chromatin, and the majority of common genetic risk factors are predicted to influence gene regulation rather than directly affect the coding sequences of transcribed proteins. We are developing methods to integrate genetic and epigenetic data to interpret findings from genetic studies of disease. For example, we have used methods such as Bayesian colocalisation and Summary data-based Mendelian Randomization (SMR) to refine genetic association signals and prioritise loci for future investigation.

Molecular Variation Associated With High Genetic Burden For Complex Disease

Most complex diseases are polygenic, meaning multiple genes and genetic variants contribute to an individual’s risk. This risk can be estimated using a polygenic risk score (PRS) using genetic variants identified in large GWAS. Because PRS-associated epigenetic variation is potentially less affected by factors associated with the disease itself (e.g., medication exposure, stress, and smoking), which can confound case–control analyses, we are exploring how this variable can be used as an outcome in epigenetic studies. For example, we have shown how elevated polygenic burden for schizophrenia and autism is associated with variable DNA methylation.

Inferring Causality Using Regulatory Genomic Data

Regulatory genomic variation associated with disease may represent a consequence of pathology rather than part of the causal process. Where epigenetic variation is proposed as a mediator between an exposure and an outcome we are adopting statistical methodologies (such as two-step Mendelian randomisation) to enable us to separate causal relationships from reverse causation or confounding.

Exploring Inter-individual Regulatory Genomic Variation Across Tissues

How informative are epigenetic studies in peripheral tissues such as blood for diseases affecting more inaccessible tissues such as the brain? It is well established that epigenetic marks differ between cell types and tissues, but the extent to which inter-individual variation is correlated across tissues is not known. We are undertaking a series of analyses using multiple tissues from individual donors to explore the extent to which methylomic variation in blood is predictive of interindividual variation identified in different regions of the brain . Our data suggest that for the majority of the genome, a blood-based EWAS for disorders where brain is presumed to be the primary tissue of interest will give limited information relating to underlying pathological processes. These results do not, however, discount the utility of using a blood-based EWAS to identify biomarkers of disease.

Systems Biology Approaches

We are utilising systems biology approaches to model the epigenome as a network. By connecting genomic loci where variation co-varies we can explore the high level organisation of gene regulation, identify coordinated effects on particular biological process and facilitate the functional interpretation of associations with disease.

Inferring Cellular Composition From Epigenetic Profiles

Profiling bulk tissues is potentially confounded by the compositions of cell types, and therefore requires additional covariates in the analysis to control for any bias between cases and controls. In tandem with our cell sorting work(link), we are generating reference datasets of isolated cell types which can be used with deconvolution algorithms to estimate the cellular proportions present in epigenetic profiles generated from samples obtained from bulk tissue including both whole blood and brain.

Keeping Up With Biotechnology And Reproducible Workflows

As we introduce new technologies into our experiments – for example the characterisation of additional modifications, long read sequencing approaches, and single-cell analyses, we need to continually develop new pipelines to process these new types of data. This includes implementing new quality control checks and data normalisation prior to analysis. We are active users of version control and use online software development platforms (e.g. GitHub) to share our code and facilitate the reproducibility of our research.