Finding Data to Enhance Your Teaching

Aug 1, 2025
3 min read

TSHS Resource Review Post, by Jim Dignam, University of Chicago

Quality datasets are key to effective teaching of statistics concepts. Having data that brings to life the statistical analysis of a real-life problem, illustrating the steps of selecting methods and carrying out analyses, interpreting findings, and communicating results is motivating for both the instructor and student. Too often, we may find ourselves falling back on well-worn convenient examples from older texts or prior course notes, when in fact a wealth of ‘real life’ data is at this point overwhelmingly available.

For example, I needed a fresh illustrative example of the difference between confounding and interaction/effect modification (recognizing that the two are not mutually exclusive). This is a difficult concept in an introductory course, and finding an example in the linear regression context proved elusive (there are plenty for discrete disease outcomes in epidemiology). From randomly searching around I found data from a study of lung capacity measures (via Forced Expiratory Volume (FEV) measurement) in over 700 young children through young adults. The relationship of age to FEV was of principal interest, but the data also contained smoking status. Interestingly, when considered alone, smoking had a paradoxically ‘positive’ effect on FEV, but of course when age was added to the model, the expected negative effect of FEV on smoking emerged, coupled with a positive age effect (as children grow, FEV naturally increases). Smoking is confined to older individuals in the dataset, producing the strongly confounded effect when considered alone. However, slopes for age were about the same within smoking status groups (illustrating that confounding is distinct from interaction). So, this was a perfect dataset for this purpose. I tracked the dataset back to the website Kaggle and located the specific dataset: FEV Dataset on Kaggle. This allows interested readers to access the exact data used in the example. Kaggle is primarily identified as an artificial intelligence (AI) / machine learning (ML) learning and testing site, promoting model development, new methods, and offering prediction challenges in which users can participate. However, the site contains an enormous store of datasets that can be used for any statistical analysis purpose. Currently, the site boasts about ½ million datasets available, an astonishing number. Each dataset has a description and ‘data card’ showing the data element names and values, as well as a numeric usability rating consisting of several evaluation domains. Several accompanying meta-data features may also be found (completeness varies across datasets), which may consist of analysis code, discussions among prior users, and suggestions. In perusing a few entries, I did find datasets with little or no descriptive information and thus low usability ratings, although in some cases the data elements were self-explanatory enough to be useful in a course context. Many datasets have a 10.00 (highest) rating. The range of topic areas is truly vast, and include datasets with continuous, discrete, and censored time to event outcomes. A search feature allows one to navigate the databank, and searching by either data topic area or statistical analysis technique produces a list of candidate datasets. For full dataset access, Kaggle requires a (free) account set-up, consisting of an email and password. When using “found” data from sites like Kaggle or other online repositories, it’s important to review accompanying data dictionaries carefully, perform any necessary data cleaning, and consider privacy or IRB requirements if students will be using these data for projects or presentations.

Of course, the TSHS Resources Portal provides a bank of thoroughly vetted and documented datasets ideal for teaching in the health sciences. For example, the TSHS portal includes datasets on clinical trials, survival analysis, infectious disease outbreaks, and cardiovascular outcomes, each accompanied by detailed data dictionaries. This resource has the added advantage of datasets ready in multiple analysis packages, as well as a detailed data dictionary and references to the data source. It is anticipated that the Committee will continue to grow this data bank into a rich resource for use by ASA members and others. A similar repository is the Vanderbilt Biostatistics Datasets site, containing numerous datasets from Dr. Frank Harrell’s extensive instructional material on statistical modeling methods. These datasets likewise are available in multiple software formats (each has one or more available) and include an accompanying data dictionary for each. Representative datasets include studies of birthweight, kidney disease progression, and time-to-event data in heart failure—offering instructors a wide variety of real-world examples to illustrate diverse methods.

The days of the same data examples recycled from textbooks and course notes are over, as these sources and others provide an inexhaustible supply of datasets on which to motivate our statistical methodology instruction. The diverse range of data domains also provides the opportunity to raise the interest level of our students, leading to a more engaging experience and more impactful teaching.

JSM 2024 Section Awards: Biographies

JSM 2025 Section Awards: Biographies

TSHS Resource Portal- 2026 Call for Dataset Submissions

Finding Data to Enhance Your Teaching

Comments