## Identifying direct risk factors in UK Biobank with simultaneous Bayesian-frequentist model-averaged hypothesis testing using Doublethink

Arning, N., Fryer, H. R. and D. J. Wilson (2024)

*medRxiv* **doi**: 10.1101/2024.01.01.24300687 (preprint)

Big data approaches to discovering non-genetic risk factors have lagged behind genome-wide association studies that routinely uncover novel genetic risk factors for diverse diseases. Instead, epidemiology typically focuses on candidate risk factors. Since modern biobanks contain thousands of potential risk factors, candidate approaches may introduce bias, inadequately control for multiple testing, and miss important signals. Bayesian model averaging offers a solution, but classical statistics predominates, perhaps because of concern that the prior unduly influences results. Here we show that simultaneous Bayesian and frequentist discovery of direct risk factors is possible via a model-averaged hypothesis testing approach for large samples called ‘Doublethink’. Doublethink produces interchangeable posterior odds and *p*-values that control the false discovery rate (FDR) and familywise error rate (FWER). We implement the Doublethink approach in R and apply it to discover direct risk factors for COVID-19 hospitalization in 2020 among 1,912 variables in UK Biobank. We find nine exposome-wide significant variables at 9% FDR and 0.05% FWER. These include several commonly reported risk factors (e.g. age, sex, obesity) but exclude others (e.g. diabetes, cardiovascular disease, hypertension) which might be mediated through variables measuring general comorbidity (e.g. numbers of medications). We identify significant direct effects among infrequently reported risk factors (psychiatric disorders, infection, dementia and aging), and show how testing groups of correlated variables is a useful alternative to pre-analysis variable selection. We discuss the potential for impact and limitations of joint Bayesian-frequentist inference, and the mutual insights afforded into the long-standing differences on statistical approaches to scientific discovery.

See also Fryer, Arning and Wilson (2024).