教師資料查詢 | 類別: 會議論文 | 教師: 謝璦如 HSIEH, AI-RU (瀏覽個人網頁)

標題:Exploring the feasibility of data augmentation while using smaller biobank data sets
學年
學期
發表日期2019/10/15
作品名稱Exploring the feasibility of data augmentation while using smaller biobank data sets
作品名稱(其他語言)
著者Chia Jung Lee; Ai Ru Hsieh(謝璦如); Pui Yan Kwok; Cathy SJ Fann
作品所屬單位
出版者
會議名稱2019 Annual Meeting of the American Society of Human Genetics
會議地點Houston, United States
摘要Empowered by new computing technology and low genotyping cost, large biobank projects like UK
Biobank (UKB) have had fruitful results in the advancement of biomedical sciences. However, there
are several smaller biobanks sampling from different ethnic groups and the statistical power to detect
any association from these datasets is lower. Data augmentation by synthesizing unobserved samples
show promising results in the application of machine learning algorithms. Here, we hypothesized that
augmentation of small biobank data can increase statistical power and detect reliable association
signals.
A two-step strategy was adopted. First, control samples were filtered using Partition Around Medoids
Algorithm, using the entire phenome to divide controls into clusters according to comorbidity. To
reduce the heterogeneity, only samples not in the same cluster for the phenotype of interest were
used as controls. Second, cases and controls were stratified by age and gender. By applying Synthetic
Minority Oversampling Technique on each stratum, artificial cases and controls were generated. In
this study, we chose to use asthma as the phenotype. Dataset from Caucasians in UKB (UKB-C, NCtotal=204,893, NC-case=31,303) and a random sample were selected (UKB-CS, NCS-total=24,000, NCScase=3,612). Fourteen linkage disequilibrium peaks (p≤10-8) from UKB-C GWAS were used as targets
for comparison. Only HLA region was replicated using UKB-CS. Our strategy was then applied to UKBCS. The real-to-artificial sample ratio (RAR) ranged from 4 (4 real and one artificial sample) to 1.
Compared to targets from UKB-Cdata, 4 peaks were replicated when RAR=4, 5 when RAR=3, 6 when
RAR=2 and 11 when RAR = 1. HLA region was prominent for every RAR. When RAR=2, false positive
peaks seemed modest; almost half of the signals could be replicated when roughly 1/9 of the UKB-C
samples were used.
The above procedure was applied to data from Taiwan Biobank (TWB, NT-total=23,942, NT-case=2069).
Without augmentation, only HLA region was significant. When RAR=2 for TWB and UKB-CS, GWAS
results showed a similar trend. In addition to HLA region, only two other regions were replicated for
TWB. Population heterogeneity may contribute to this discrepancy. Our results showed that data
augmentation is promising, however caution needs to be taken with respect to input data quality and
possible stratification, etc. More testing of augmentation algorithms should be done to further
evaluate for performance.
關鍵字Computational tools;Bioinformatics;Genetic epidemiology;Genotype-phenotype correlations;Phenome-wide association
語言英文(美國)
收錄於
會議性質國際
校內研討會地點
研討會時間20191015~20191019
通訊作者
國別美國
公開徵稿
出版型式
出處
相關連結
SDGs
Google+ 推薦功能,讓全世界都能看到您的推薦!