Overcoming Fears (my own): Teaching Reproducible Research, Big Data and Data Mining
Nursing and public health researchers increasingly work with large datasets with high dimensions (millions of records with thousands of variables or fields) from public health applications, electronic health records, microbiome and –omics datasets. Additionally, the need for students and researchers to understand and incorporate reproducible research workflow practices and principles is growing.
Understanding version control and literate programming is becoming more important given the increasing demand that all data, code and documentation associated with the research be made available and easily understood so that others can adequately review and potentially reproduce the research.
A new Nursing PhD elective course:
“Big Data Analytics for Healthcare”
Given these trends, Emory University Nell Hodgson Woodruff School of Nursing recognized the importance of “Big Data” for nursing and related public health research and the need for nurse scientists to become conversant in this area. In 2017 we (Dr. Vicki Hertzberg and myself) launched a new Nursing PhD elective course: “Big Data Analytics for Healthcare.”
This course describes the concepts underlying the field of study identified as big data analytics along with its applications in healthcare. The theoretical underpinnings of these concepts are presented along with applications in healthcare, including knowledge discovery, precision medicine/nursing, and the development of targeted interventions to improve health outcomes. Commonly used methods in big data analytics are reviewed, and the challenges related to gathering, analyzing, visualizing, and interpreting big data are discussed.
The only prerequisite of this Spring semester course is that the students have completed their first semester of biostatistics in the previous fall and are currently co-enrolled in the second semester of biostatistics or have completed one year of biostatistics previously.
However, from an analytical perspective, the core technical requirements of the course focus on learning the R open source software using the RStudio interface. In addition to training the students to work with open source software and learning to program in R, they are taught how to link “cloud-based” Github repositories to their RStudio data analysis projects employing version control using Git. They also learn Rmarkdown which allows them to completely link their data, analysis codes and final reports seamlessly following dynamic documentation and literate programming principles supporting reproducible research workflows.
We had high expectations,
not only for the students
but also for ourselves
When we designed this course, we had high expectations, not only for the students but also for ourselves. I personally had several key worries that I feared might undermine the excitement for the course: first, would the numerous course topics on data mining overwhelm students with only one previous semester of introductory statistics; second, would installing and managing multiple open source software packages (R, R packages, RStudio, and Git) be too complicated for students to manage on their own computers; and third, would they embrace the multiple steps involved supporting dynamic documentation and reproducible research workflows? It turns out that after completing two successful semesters for this course (Spring 2017 and Spring 2018), my fears have been allayed.
Not only have the students (which have included nursing PhD students, nursing post doctoral students, public health masters degree students, and students from other majors such as medicine and sociology) exceeded our expectations, but they have fully embraced the open source software platforms and reproducible workflow environments using version control and dynamic documentation.
Often the hesitation and fears come
from us as instructors worrying...
It hasn’t all been smooth sailing, but the struggles the students have encountered have fostered the learning process. For example, even though the students did struggle some with installing R, RStudio and Git on different operating systems and versions (mostly Mac’s and PC’s), they walked away with valuable skills and the knowledge that they can manage their open source computing environment in the future. They also often worked together to solve problems encouraging team science and collaboration. These experiences further highlighted the challenges of reproducible research given varied computing environments and approaches taken by different students.
At the end of the course they each had to present an independent “big data” analysis project. Their final projects have been wide ranging and challenging – some examples include: analyses of microbiome data from the “American Gut Project;” text mining of clinical notes from electronic health records (in coordination with our Nursing Center for Data Science); web scraping and text mining analysis of mental health support blogs; and classification trees and random forests models exploring linkages between local weather and heat-related illnesses.
I highly recommend teaching open source computing software, reproducible research principles, and data mining to nursing, public health and health related science majors. Students are eager to embrace these methods and will surprise you on their ability to handle these topics. Often the hesitation and fears come from us as instructors worrying too much about overwhelming our students when they are indeed up for the challenge.
...when our students are indeed
up for the challenge.
This post is based on the talk of the same name from JSM 2018. Slides can be found in the JSM 2018 archive or the TSHS community page.
The course website (created with Rmarkdown and hosted on Github) can be found here.