Monday, April 7, 2014

Weekly resources (2): Against Big Data, Analytically Speaking Webcast, visualization, and data journalism

Last week I began a series of posts to collect resources related to visualization, infographics, and data. Here you have some more:


• The Parable of Google Flu:  Traps in Big Data Analysis, an article in Science magazine that has unleashed a lively debate. Quoting:
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. (...) There are enormous possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data. The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.
Comments here, here, here, and here.

• In Defense of Google Flu Trends. Alexis Madrigal says that the backlash is going too far. Highlights are mine:
Google Flu Trends is not a magical tool that replaces the CDC with an algorithm. But, then, the people who built it never imagined that it was. If it failed, it did so, more than anywhere else, in the popular imagination—and in the wishes of superficial Big Data acolytes. (…) I also think that the Google Flu Trends story is a parable, but it goes more like this: New technology comes along. The hype that surrounds it exceeds that which its creators intended. The technology fails to live up to that false hope and is therefore declared a failure in the court of public opinion.

True. We have seen this pattern before. We may well be on the downward slope of the "peak of inflated expectations" of the hype curve.

• Eight (No, Nine!) Problems With Big Data. More jumping-on-the-bandwagon, this time by NYU's Ernest Davies and Gary Marcus, author of Kluge.
A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors. (...) Champions of big data promote it as a revolutionary advance. But even the examples that people give of the successes of big data, like Google Flu Trends, though useful, are small potatoes in the larger scheme of things. They are far less important than the great innovations of the 19th and 20th centuries, like antibiotics, automobiles and the airplane.
This is all spot-on, but there may be more to 'Big Data' (whatever this term means) than Google Flu Trends.

• Big data: are we making a big mistake? This article by Tim Harford is arguably the best of the lot. If you're going to read just one of the recommendations in this post, read this one. Quoting:
(A) theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. (...) Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not. 
Because found data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is. 
(Big) data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.


In the last article linked above, Tim Harford quotes Kaiser Fung. Last Friday, we were both interviewed by JMP's Anne Milley for the Analytically Speaking series of webcasts. It was fun. The video of the conversation is already online.


• Dealing With Data: Be Very, Very Skeptical. A summary of what NYT's Derek Willis at the Journalism Interactive 2014 conference.

• Columbia's Tow Center for Digital Journalism has begun publishing a series of profiles of data journalists. I've spotted three so far. They are all worth reading: 1, 2, 3. The author, Alexander Howard, has also written a good story about how Oakland Police Beat, a nonprofit website, uses data to hold local law enforcement officers accountable.

• A few weeks ago, Jonathan Corum gave a great talk at the Malofiej infographics conference. He has made his his slides and notes available.

• Paul Bradshaw has announced a new e-book, titled Finding Stories in Spreadsheets. I can't wait.

• Finally, while doing some research for a few book chapters I stumbled upon Steve Haroz's website. Haroz is a Psychology postdoc at Northwestern University who studies visual perception and cognition applied to visualization. He hasn't updated his blog for a while, but I added it to my RSS feed anyway. It links to many research papers that look intriguing, like this one, about narrative visualization, or this one, co-written by Haroz himself, on attention.