Skip to main content

Applied Math Seminar

Date:
-
Location:
POT 745
Speaker(s) / Presenter(s):
Katherine Thompson, University of Kentucky

Title: Correct Model Selection in Big Data Analysis

Abstract: Although recent attention has focused on improving predictive models, less consideration has been given to variability introduced into models through incorrect variable selection. Here, the difficulty in choosing a scientifically correct model is explored both theoretically and practically, and the performance of traditional model selection techniques is compared with that of more recent methods. The results in this talk show that often the model with the highest R-squared (or adjusted R-squared) or lowest Akaike Information Criterion (AIC) is not the scientifically correct model, suggesting that traditional model selection techniques may not be appropriate when data sets contain a large number of covariates. This work starts with the derivation of the probability of choosing the scientifically correct model in data sets as a function of regression model parameters, and shows that traditional model selection criteria are outperformed by methods that produce multiple candidate models for researchers' consideration. These results are demonstrated both in simulation studies and through an analysis of a National Health and Nutrition Examination Survey (NHANES) data set.