Friday, July 5, 2019

Visualizing the unfamiliar

BBC World Service's Josh Rayman has published an article about their efforts to visualize the parliamentary elections in Afghanistan. The piece explains how he and his team overcame the challenge of being unfamiliar with the story they wanted to cover.

This is a very common problem, so the article is educational: it demonstrates how thorough and careful we need to be when facing a situation like that. Josh explains:
We started data analysis a long time before approaching our language teams about working on the project, looking for interesting data stories and then reaching out for the context. The initial data set was tens of thousands of lines: one row for each candidate voted on in each ballot box, with a great number of empty rows. Over half the candidates appeared on at least 100 ballot locations, with some candidates in Kabul standing in over 400. It was daunting to know where to start. 
The raw data didn’t indicate winners, so we created a script that calculated the winning candidates, and accounted for the women’s quota. We also had candidate gender stats manually calculated by our producer from BBC Pashto who went through all 2,500 candidates to create region-by-region numbers. 
We aggregated the data using Python (plus pandas) and node.js to make more manageable JSON files for the dashboard. We were able to slot in new data as it arrived over a long period of time, and we could combine the partial 2018 dataset with the existing 2010 data on the fly.

I think that it was NYT's Amanda Cox who suggested in a talk that the best visualization stories often lurk in data that is hard to obtain, that you need to generate yourself, or that is publicly available, but also difficult to understand. Josh's article is yet another example of that.