Member-only story
Using R Coding and Then ChatGPT to Figure Out an Outbreak
Ultimately, you’re the human with the lived experience, and GPTs are tools. Use them (your experience and the tools) wisely.
In my last post, I wrote about investigating an outbreak of food borne disease, and how the p-value for the statistical analysis I did while doing so can be changed. The p-value was initially not significant, but increasing the sample size made it significant without changing the measure of association. It’s a form of p-hacking if you’re a little too obsessed with p-values.
For this post, I will use a sample of 1,000 fake people who all ate at a fictional big event and became sick. I will then show you how we can determine the most likely food source of the outbreak. And I will finish with a discussion of the statistical analysis I did using ChatGPT’s Advanced Data Analysis tool.
I will link to the data and code at the end of the post.
First, Look At Your Data
One of the first things you should do with your data is to look at it. Open it up in your software of choice and look around. Are there any glaring errors, like weights for humans in the thousands or blood glucose measurements in the decimals? Yes, that’s right… You need to know the ins and outs of what you’re looking at.
Then do some exploratory data analysis. Do you have missing data? Is there a pattern to the missing data, or something that could explain it? For example, are you missing answers for drinking bottled water based on drinking agua fresca? Or were people who ate the fish less likely to tell you if they ate the salad?
Then just do a summary of the responses.
> summary(df)
personID illness baked_potato salad soda bottle_water agua_fresca dessert beef chicken fish
Length:1000 N:681 N:650 N:599 N:485 Y:1000 N:516 N:101 N:727 N:683 N:528…