Prompt: “Describe your favorite project or presentation that you have completed for your internship”
The beautiful aspect of my summer research program is the “open-ended-ness” of the research projects. We are given a large, messy data set and we get to explore many different ways we can draw some inference out of the data.
My specific research focuses on working with Electronic Health Records, which are very, large complex data sets that were generously donated by the Michigan Genomics Initiative at the University of Michigan. The raw data set had a collection of approximately 3 million observations which is a lot! Definitely the largest data set I have ever worked with it.
Initially, staring at the data set was very daunting and intimidating. I didn’t really know what I wanted to do with the data or what the data even encompassed. The data set was also new to everybody including the supervisors. It was never explored beforehand and nobody knew what it entailed.
My team and I decided to focus on cardiovascular diseases and spent about two weeks just looking through the data set to see what type of information we had. It was actually very fun doing this because I was able to apply a lot of different data exploration methods I’ve learned from my statistical courses at the University of Michigan.
It took another two weeks to clean the original data set into something that my team could use for analysis. There were so many roadblocks during this process such as dealing with “weird” data inputs, missing values, and incorrect labeling. To clearly explain all of the data manipulation we had to do with probably that an essay and I simply just don’t want to fill this blog post up with that. But, I hope the message is conveyed.
Some people say this is the worst part about data science: Spending countless hours on cleaning data. However, I really enjoyed it and can see myself doing this in the future. I also learned that about 80% of a statistician’s work is data processing!
Isn’t that wild?