2018 research projects

900 Words | Approximately 5 Minutes Read | Last Modified on December 31, 2018

Current research

My work spans a range of subjects at the interface of genetics, bioinformatics and statistics. I focus on tackling “heavy-lifting” computational problems, that is, problems of such intricacy that requires development and application of new statistical approaches, or involving exploring large-scale data that requires sophisticated computational infrastructure. On the scientific side, I am interested in understanding how genome works by integrating genetic and phenotypic information at various levels – in particular DNA and RNA data from different cell types, tissues, and disease studies. On the technical side, I enjoy writing statistical software and bioinformatics pipelines, putting them in action to make new discoveries in real-world problems.

Statistical modeling for RNA-seq and GWAS data

Variations of DNA sequence in human genes can “directly” cause diseases ranging from severe rare genetic disorders such as cystic fibrosis, to common complex health problems such as coronary artery disease, diabetes or cancer. Variations outside genes can increase or decrease gene expression levels, an important potential underlying cause of other types of variability, disease included. My work aims to understand both the properties of such variations as well as their consequences, in particular their role in gene regulation and disease etiology.

I work on a statistical method for variable selection in regression and genetic fine-mapping (with Matthew Stephens, Abhishek Sarkar and Peter Carbonetto). Fine-mapping is a technique analogous to feature selection in high dimensional data analysis, that we want to jointly consider numerous variables possibly “causing” an outcome, and identify these variables. However unlike generic variable selection problems which typically aim at making predictions with minimum errors without worrying about whether or not the selected feature is of direct causality, feature selection in genetics focuses on identifying the “causal” set in order to guide the pursuit in understanding of their underlying biological mechanisms. Even though a number of Bayesian methods have been specifically designed for fine-mapping over the past decade, fine mapping of genetic associations continues to pose major statistical and computational challenges. We developed a Bayesian feature selection method particularly well suited to settings where variables are highly correlated and true effects are very sparse. When applied to fine-mapping problems the method is orders of magnitude faster than many existing methods, and gives genetic mapping results at finer resolution.

Combining the aforementioned work on fine-mapping and a previous work on an empirical Bayes method to jointly analyze multi-phenotype data (led by Sarah Urbut, with Matthew Stephens and Peter Carbonetto) in the context of gene expression quantitative loci mapping (eQTL), I am currently working on a multivariate model to efficiently perform fine-mapping across multiple traits, with applications to joint analysis of molecular QTL data from private and public sources.

I also work on paired factor analysis (with Kushal Dey and Matthew Stephens) a dimensionality reduction technique using a factor analysis model with a graph-like structured. We use this model to learn about patterns of gene expression, particularly in the context of single-cell studies.

Bioinformatics software development and reproducible research

I am a developer of Script of Scripts, a scientific workflow system (developed at Dr. Bo Peng’s lab at MD Anderson). This is a software system to bridge the gap between interactive analysis and workflow systems, with strong emphasis on readability, practicality, and reproducibility in daily computational research. For exploratory analysis, SoS has a multilanguage scripting format that centralizes otherwise scattered scripts and creates dynamic reports for publication and sharing. As a workflow engine, SoS provides an intuitive syntax for creating workflows in process-oriented, outcome-oriented, and mixed styles as well as a unified interface for executing and managing tasks on a variety of computing platforms with automatic synchronization of files among isolated file systems. In particular, SoS can be easily adopted in research projects utilizing existing scripts yet substantially improve organization, readability, and cross-platform computation management.

I also work on Dynamic Statistical Comparison, a statistical benchmarking framework (with Matthew Stephens and others in Stephens Lab). DSC is an attempt to change the way that researchers perform statistical comparisons of methods. Currently, methods comparisons are usually performed by the research group that developed one of the methods, which almost inevitably favors that method. Furthermore, performing these kinds comparisons is incredibly time-consuming, requiring careful familiarization with software implementing the methods, and the creation of pipelines and scripts for running and comparing them. In fast-moving fields such as in genomics, new methods or software updates appear so frequently that comparisons are out of date, unless can be easily extended to incorporate new developments. We have developed the DSC system to aid in efficient statistical comparisons. The system, when properly used, ensures reproducibility, and facilitates sharing, adaptation and extension of existing statistical comparisons for new method development projects. DSC is particularly well-suited to run methods implemented in R and Python, the two predominant interactive programming languages in scientific research. DSC also provides support for experiments that combine R and Python.

Past research

Prior to research described on this page I have worked on statistical methodology and software development for gene-mapping of rare variants in human genome, using linkage and association study designs. Please find in here a list of my past research projects.

  Lab
  Github
  Email
  @gaotwang