Saturday, May 17, 2014

Some 'data journalism' out there is just 'datum journalism'

(UPDATE: Don't miss this storified Twitter conversation.)

I tweeted this a few days ago:
Perhaps I was exaggerating, but I'm seeing too much shoddy stuff in websites like and FiveThirtyEight. They do publish interesting stories, but a very visible portion of their output is dubious.

I'm not just talking about the foolish comparisons of health care prices. That's old news. It's also the speculations about kidnappings in Nigeria (at least they corrected this one properly; good for them.) Or making long-term linear predictions while forgetting about xkcd (see the cartoon below) and black swans.

I should also mention this piece on helmets and bikes, an example of Gladwellism gone crazy: "Hey, here's a counterintuitive idea, and here you have a handful of papers and small datasets —you don't mind if they aren't that significant, if they say the opposite to what I claim, if I jump to conclusions, or if one of the studies has a sample size of one, do you?". See more (and more) details about this case, which is an absolute shame.

Data journalism, at least in some of the stories and blog posts that these organizations are publishing, has become 'datum journalism,'* a term that I've stolen from Census Reporter's Ryan Pitts. It's a pity. Some of us had great hopes. We still do, I guess, just because we want these publications to thrive, but patience isn't an infinite resource. As I said, at least FiveThirtyEight is willing to publish straightforward corrections. It's a start. The next step could be to enforce stricter editing and vetting.

(*As "datum" is the singular of "data", a "datum journalism" story is one in which grand theories are derived from insufficient evidence.)

On a positive note, Andrew Whitby, an economist at Nesta, has proposed four principles for better data journalism. Please read the entire presentation he designed. Here are the main takeaways:
1. Choose the right stories: In cases like this, a well-written review of the scholarly literature is likely to better inform public debate. Otherwise, stick to (a) lightweight but fun topics or (b) fast moving topics yet to attract academic attention. 
2. Embrace complexity: No interesting causal relationship involves only two variables.
3. Use statistics intelligently: A scatterplot of two variables with a least-squares regression line is not ‘doing statistics’. (…) Bad statistics is worse than no statistics. 
4. Finally, be modest: If you have so many caveats as to completely undermine any conclusion, then don’t offer a conclusion.
Finally, some deep xkcd wisdom: