The Data Detective: Rules Three, Four, + Five

As you read through the Introduction and Rules Three, Four, and Five feel free to use this as a place to share your thoughts, questions, ideas, and reflections!

Remember that we’re all reading asynchronously, so if you aren’t caught up yet it’s not a problem! You can always post in the previous topic for the Introduction, Rules One + Two.

Plus! This book doesn’t necessarily need to be read in chronological order, so hop in on the chapters that interest you!

Avoid Premature Enumeration

The author focuses here on understanding what the data mean – which I interpret as you making sure you understand the definitions and what is being measured. There are many cases where study results have a completely different interpretation when you understand what the data consists of.

One personal example is a discussion I had with a firearms executive quoting a paper claiming Europe had more mass killings than the US. After our discussion, I looked up the paper and saw a horrible example of narrowly defining terms (torturing the data) to get the desired result (aka motivated reasoning). There were also analysis methodology issues. Of course, the paper could not get published in a peer-reviewed journal.

Step Back and Enjoy the View

Here the author is focused on putting the data into context. Make sure that you understand the priors. Ask questions about what else was occurring at the time. Understand the effect of scale.

Get the Backstory

Here the author lays out examples of survivorship bias and the need to understand motivations for a claim – things like money. I often see industry-funded studies that make claims that seem self-serving and questionable. This chapter has a message similar to Step Back and Enjoy the View. Understand the context of the data.

I certainly would question the results of a study that claimed eating lots of processed sugar was good for you if the sugar industry funded it.

2 Likes

I really liked the Avoid Premature Enumeration chapter.

Coming from research in social psych, this is something we discussed a lot, often with the term validity of the measures. Psychologists would often build measures that are easier to work with in the lab, but fail to really capture a concept in the wild. I think that one example given in the chapter is aggressiveness (measured with a hot sauce paradigm), but other topics are impacted.

Another example is implicit bias. In the wild, we have plenty of evidence for group discrimination without discrimination intent, but in the lab researchers are mostly working with difference of reaction task in a categorization task (see here). Which is fine if you use this measure as a proxy. Now, there’s plenty of companies that are trying to provide training to reduce implicit bias, but the problem is that these trainings oftentarget the reaction time task and not actual every day discrimination behaviors.

Obviously I’m passionate about this, but, to cut a long story short, really liked this chapter.

2 Likes

I think about things that you’ve mentioned in “Get the Backstory” quite a bit - because it feels like there’s a constantly growing barrage of information that taking the time to get the background for some things becomes so time-consuming that it’s easier to just “accept” something because it resonates with what we believe to be true.

I’m sure this has gotten conflated (and taken advantage of!) quite a bit in the age of social media, too - for example, I’m still not quite sure if egg yolks are the worst thing I could ever eat, or will actually be good for my cholesterol! depending on where I look I can find an argument for just about every extreme, not to mention all the in-between points.

this makes me really excited to dig into the chapter! I’ve always found it interesting how we need (and value!) research, but striking that balance between being specific enough to capture the information we’re looking for while also being generalizable enough to carry out is so incredibly difficult!

I really liked these chapters, summarising what I’d like to share:

Avoid premature enumeration

When comparing two distributions or measures, it’s crucial that the numbers are comparable, and the context of what that numbers represent is explicitly clear. For example, when comparing two countries or cities, the population must be considered.

Step back and enjoy the view

The results can depend on the time window used. It’s different making an inference in a week, month, year or even a decade period. A punctual increase is not necessarily an increase of the full trend, I think thats’ why you should “step back and enjoy the view”.

One think I also got from this chapter is how difficult can be when you’re facing a problem and have to wait for more data to make a conclusion. As when COVID started, there was little data available, and that data was biased towards symptomatic persons who got the test.

As one approach to overcome that limitation of not available data, I remember that researchers were using SIR-kind models to predict future data of infected people, and that was used as a mean of a Poisson regression to predict the number of infected people arriving at a hospital.

Get the backstory

The importance of reproducibility in science ir order to avoid getting “results obtained by pure chance”. I liked this idea, because in data science all the experiments and conclusions from the data have to be reproducible.

1 Like

this! it’s such a great example, and while I’m sure there have been thousands of others in my lifetime, this one really stands out to me because I experienced it in real time, and was making decisions based on what information was available.

1 Like