Member-only story
Be Careful With Software Black Boxes
Know what the output should be, and you will recognize when the output is “off.” Doing so, can have big implications for the people you serve.
I recently encountered a problem while teaching a course on epidemiological applications of geographic information systems. One of the assignments for our students asked them to use a Python-programmed plug-in for QGIS to calculate the index of unmet health needs of Baltimore in 2010. I decided to re-create the assignment in R, and I came up with different values for the index.
The order of the Community Statistical Areas (CSAs) in terms of their unmet needs did not change between the two methods, but their index values did. As any good epidemiologist would, I dove into the data and code to find out what the discrepancy was. It all came down to how Python and R calculate the variance of a vector of values.
The formula for a Z score of a number is the number minus the mean of the set of numbers from where the number came, divided by the standard deviation of the set of numbers.
The sample standard deviation is the square root of the variance. And the variance is calculated with this formula: