Research Interests

  • Statistical genetics – such as research in methodology for relatedness, population structure and imputation
  • Machine learning / Data mining
  • Bioinformatics, high-dimensional data analysis
  • Parallel / high-performance computing

Representative Projects

Gene Environment Association Studies project (GENEVA)
  • A NIH-wide collaboration that aims to accelerate understanding of genetic and environmental contributions to health and disease using GWAS with thousands of samples and millions of SNPs
  • http://www.genevastudy.org
  • Performed data cleaning and analysis on large-scale genotypic data, and involved in preparation of manuscripts
Human Leukocyte Antigen (HLA) prediction project
  • Collaborated with GlaxoSmithKline (GSK) for a study of statistical prediction of HLA alleles
  • Applied and developed machine learning algorithms (random forest and attribute bagging), and prepared manuscripts
CoreArray high-performance computing project
  • Developed parallel computing algorithms using C/C++ for relatedness and principal component analysis in GWAS, and prepared manuscripts
  • My algorithms achieve up to a 300-fold speedup over the original serial implementations
The electronic Medical Records and Genomics (eMERGE) network project
  • The aim is to identify genetic variants associated white blood cell count differential leukocyte types in 13,923 subjects in the eMERGE network
  • Performed data analysis and involved in preparation of manuscripts
SNP microarray project
  • Mosaics for large chromosomal anomalies were detected using SNP microarray data from over 50,000 subjects of GENEVA
  • Performed data analysis and involved in preparation of manuscripts

Publication / Supplementary Information