Exploring the feasibility of data augmentation while using smaller biobank data sets
學年 108
學期 1
發表日期 2019-10-15
作品名稱 Exploring the feasibility of data augmentation while using smaller biobank data sets
著者 Chia Jung Lee; Ai Ru Hsieh(謝璦如); Pui Yan Kwok; Cathy SJ Fann
會議名稱 2019 Annual Meeting of the American Society of Human Genetics
會議地點 Houston, United States
摘要 Empowered by new computing technology and low genotyping cost, large biobank projects like UK Biobank (UKB) have had fruitful results in the advancement of biomedical sciences. However, there are several smaller biobanks sampling from different ethnic groups and the statistical power to detect any association from these datasets is lower. Data augmentation by synthesizing unobserved samples show promising results in the application of machine learning algorithms. Here, we hypothesized that augmentation of small biobank data can increase statistical power and detect reliable association signals. A two-step strategy was adopted. First, control samples were filtered using Partition Around Medoids Algorithm, using the entire phenome to divide controls into clusters according to comorbidity. To reduce the heterogeneity, only samples not in the same cluster for the phenotype of interest were used as controls. Second, cases and controls were stratified by age and gender. By applying Synthetic Minority Oversampling Technique on each stratum, artificial cases and controls were generated. In this study, we chose to use asthma as the phenotype. Dataset from Caucasians in UKB (UKB-C, NCtotal=204,893, NC-case=31,303) and a random sample were selected (UKB-CS, NCS-total=24,000, NCScase=3,612). Fourteen linkage disequilibrium peaks (p≤10-8) from UKB-C GWAS were used as targets for comparison. Only HLA region was replicated using UKB-CS. Our strategy was then applied to UKBCS. The real-to-artificial sample ratio (RAR) ranged from 4 (4 real and one artificial sample) to 1. Compared to targets from UKB-Cdata, 4 peaks were replicated when RAR=4, 5 when RAR=3, 6 when RAR=2 and 11 when RAR = 1. HLA region was prominent for every RAR. When RAR=2, false positive peaks seemed modest; almost half of the signals could be replicated when roughly 1/9 of the UKB-C samples were used. The above procedure was applied to data from Taiwan Biobank (TWB, NT-total=23,942, NT-case=2069). Without augmentation, only HLA region was significant. When RAR=2 for TWB and UKB-CS, GWAS results showed a similar trend. In addition to HLA region, only two other regions were replicated for TWB. Population heterogeneity may contribute to this discrepancy. Our results showed that data augmentation is promising, however caution needs to be taken with respect to input data quality and possible stratification, etc. More testing of augmentation algorithms should be done to further evaluate for performance.
關鍵字 Computational tools;Bioinformatics;Genetic epidemiology;Genotype-phenotype correlations;Phenome-wide association
語言 en_US
會議性質 國際
研討會時間 20191015~20191019
國別 USA

機構典藏連結 ( http://tkuir.lib.tku.edu.tw:8080/dspace/handle/987654321/119097 )