Friday, September 25, 2020

'How Charts Lie': a few clarifications and edits

(Update 01/08/2020: you can now DOWNLOAD most figures from the book in high resolution and in two different color schemes.)


CLARIFICATIONS FOR ALL EDITIONS

A clarification for page 44: where it says “people became richer or poorer” or “people in those countries contaminated more or less,” I think I'd add a qualifier such as “on average” just in case, as these are per capita numbers.

Second clarification, page 73: the Richter scale is based on tenfold increments of wave amplitude. That's what I mean by “stronger”; in other words: it's not the possible energy released, which increases at higher rates.

Third clarification, page 74: in the fictional example about gerbil population growth, I assumed that the parents die shortly after giving birth; otherwise, by the second generation we wouldn't have double the gerbils (8 children) but triple, a total of 12: 8 children and their 4 parents.

Fourth clarification: whenever you see Kaplan-Meier charts in the book, assume that lines have been smoothed; actual Kaplan-Meier estimators create lines that look like staircases.


EDITS FOR THE FIRST PRINT EDITION (hard cover)

If you read the first print edition of How Charts Lie you may notice a few printing and layout errors. These should have been corrected in the e-book already, and also for the paperback version, to be released in October of 2020. If you detect anything that looks strange other than these, please let me know.

I've printed out cards containing the main corrections. If you want to receive one and keep it inside your copy of the print book, contact me.

On page 24 the transparency effects that should emphasize or hide parts of the charts disappeared between the galleys—where the graphic was perfect—and the final printing. Mysteries. Here's how that graphic should look like:


On page 45 the line corresponding to the United States didn't show when printed. Here's the graphic:


There's a minor issue with the gradient on the second bar of the chart on page 142: it doesn't fade to white. It should look like this:



A chart on page 116 is slightly misplaced.

On page 49, the second paragraph should read: “Imagine that a district's circle sits on the +20 line above the baseline. This means that Republicans lost 10 percentage points, which went to Democrats, for a total of +20 percentage point change in their favor (there weren't third-party candidates, I guess.)”

On page 92 there's a needless “is” in a sentence that should read “Unless a crime is premeditated...” On page 104 there's an “s” missing at the end of “assess” (this one made me giggle.) At the bottom of page 128 there's an “as” missing before “Assange”. And on page 157 there's a tiny label that should read MS instead of MI. The last label on the Y-scale of the chart on page 172 should read “600” instead of “490”.

Thursday, September 24, 2020

U.S. version of the 'What if all COVID-19 victims where your neighbors?' project

The Washington Post has just published the U.S. version of the project I mentioned in the previous post: what if all victims of COVID-19 lived around you? Working with The Washington Post graphics folks was an honor and a pleasure. Here are all people involved (I copied this from the project page itself):

This is the U.S. version of a project originally created in Brazil, as a partnership by Agência Lupa and Google News Initiative.

Art direction by Alberto Cairo. Data and storytelling by Rodrigo Menegat. Design by Vinicius Sueiro and Vallery Nascimento. Development by Tiago Maranhão and Vinicius Sueiro. Distribution strategy by Gilberto Scofield Jr. Editing by Natália Leal. Google News Initiative: Simon Rogers and Marco Túlio Pires.

Additional editing by Ann Gerhart. Copy editing by Anne Kenderdine. Additional data support by Dan Keating. Additional design and development by Lucio Villa, Matt Callahan, Simon Glenn-Gregg and Armand Emamdjomeh.

Friday, July 24, 2020

New project: What if all COVID-19 victims were your neighbors?

The latest project I've art-directed for the Google News Initiative is titled No Epicentro (“At the Epicenter.”) It asks: what if all confirmed COVID-19 victims in Brazil were your neighbors?

No Epicentro has just been published by our media partner, Agência Lupa, and was developed by data journalists Tiago Maranhão, Rodrigo Menegat, and Vinicius Sueiro, with advice from Marco Túlio Pires.

No Epicentro is available just in Portuguese for now—there'll be an English version soon,—but you can run it through an automated translator, and it's very easy to understand anyway (Update August 3: the project now has an English version.). Learn more about the project in this article and in the making-of; the data and code are free to download.

Our goal was to make data feel a bit more personal: 93,000 people, the number of confirmed COVID-19 deaths in Brazil at the moment of this writing, is a cold figure, something that makes the scope of the tragedy hard to grasp—please read Numbers and Nerves to learn about “statistical numbing”. But what if we force you, the reader, to imagine those 93,000 as people you know, see every day, or interact with?

Begin by typing any address in Brazil. This was mine when I lived in São Paulo:


No Epicentro then reveals more than 93,000 white dots—again, the number of COVID-19 deaths at the moment—inside a circle with your address as the center. The visualization uses census track data, so every white dot represents a person living around you; therefore, if you are in a densely populated area, the circle will be small, and if you are in the countryside, it'll be larger:


At any point you can generate a customized infographic with the map of the region where you live to share in social media:


Next, the visualization compares the number of COVID-19 deaths to the population of a city that is near you. In my case, it's Rio Grande da Serra. If all deaths had happened there, this city and some of its surrounding areas would have been wiped out:


No Epicentro ends with other comparisons, and it shows where COVID-19 deaths have actually been registered in Brazil:


Wednesday, June 10, 2020

Visualizing police killings in Kenya

Missing Voices is a collaboration between numerous NGOs that tracks police killings, enforced disappearances, and extrajudicial executions in Kenya.

The visualization firm OdipoDev, founded by designers Odanga Madung and Samer Costello, has created a data-driven story that displays their main figures (methodology page here). Missing Voices also contains a gallery of obituaries of every individual who's died at the hands of the police.

This is important work.

Monday, June 8, 2020

U.S COVID-19 tracker by ProPublica

The big U.S. newspapers get most of the attention when it comes to news visualization, but other players are producing excellent work, as well. ProPublica, a nonprofit investigative journalism organization I donate to every year, has a very good tracker of COVID-19 cases in the U.S, designed by by Lena Groeger and Ash Ngu. The geographically arranged animated arrows on top are lovely.


Friday, June 5, 2020

Psychopathic charts, lines that should be bars, and picking cherries

I was hoping to withdraw from the world during the Summer and devote time to activities that require deep concentration—reading, writing, designing, and also this,—but it seems that the run-up to the November election will bring a deluge of bad charts. I should have known better; How Charts Lie may need a sequel soon.

The following are just from today:

This is a truly psychopathic chart (UPDATE: Fox News has apologized):


Next, here's one of the best examples of a grossly misleading chart that, at the same time, isn't technically incorrect (this is the author):


The chart only looks like a V-shape because the data is encoded as a line, when a bar graph would've been more appropriate. The impression it creates is entirely different (source):


Finally, the President, always a reliable source of examples of convenient data cherry-picking, entertains us with this beauty (and it turns out that there is a huge glitch in the data):


Jon Schwabish has some things to say about it:




Wednesday, May 20, 2020

About that weird Georgia chart

Visualization social media has been busy mocking the following chart by the Georgia Department of Public Health. Pay attention to its horizontal axis:


I never attribute malice when sloppiness is a more parsimonious explanation. I guess that whoever designed this chart thought that sorting the bar groups from highest to lowest, instead of chronologically, was a good idea.

This is not wrong per se; it's possible to think of situations when it's useful to arrange your data like this during analysis. As it always happens in visualization, design choices depend on purpose.

In this case, though, the purpose is to show “the most impacted counties over the past 15 days and the number of cases over time,” so separating the counties and then sorting their bars chronologically seems to make more sense. Something like this:


Visualization books, including mine, spend many pages discussing how to choose encodings to match the intended purpose of every graphic, but we pay too little attention to the nuances of sorting: should we do it alphabetically, by geographic unit, by time, from highest to lowest, from lowest to highest—or do we need an ad-hoc criterion? Or should we make the graphic interactive and let people choose? As always, the answer will depend on what we want the reader to get from the visualization.

Thursday, May 7, 2020

The problem with inconsistent and unlabeled scales

One of the strategies to come up with novel ways to display data is to combine existing graphic forms. This morning The New York Times published a story titled 'Most States That Are Reopening Fail to Meet White House Guidelines' that contains a series of square equal area cartograms that are, at the same time, trellis charts. The piece is really nice.

There's something about it that worries me a bit, though: charts don't have scales. Removing scales from graphics seems to be getting more popular lately among data journalists, and it works in some cases here: some of these charts have horizontal reference lines—see animation on the right—that help you get a sense of proportion and variation.

But the following set of line charts lacks any reference and, moreover, it seems that each one is based on a different scale: New York has more daily confirmed cases than Florida—thousands versus hundreds—but the last point on Florida's line is higher than New York's. New Hampshire has a 7-day average of around 100 cases; Maine has a bit more than 20. I understand that the goal of these graphics is to reveal upward and downward trends, not the case count itself, but I fear that this design choice may mislead some readers:



Here are those charts with scales:


What could be an alternative here, I wonder? It's tricky. There might not be an ideal solution, as it often happens in visualization; adding detailed labels would clutter these tiny charts. Perhaps not to show daily new confirmed cases, but some sort of index—percentage change based on a common starting point for all states,—or the variation in comparison to the previous day or week?

Monday, May 4, 2020

The Dawn of a Philosophy of Visualization

Cover illustration by Prisca Schmarsow of Eyedea Studio
My new article for Nightingale, the online magazine of the Data Visualization Society, has just been released. It's titled The Dawn of a Philosophy of Visualization, and it adapts the foreword that I wrote for a new book, Data Visualization in Society, published by Amsterdam University Press.

The book will be presented in a public e-event on Wednesday. Sign up for it here. To read the book, follow this link (it's open access.)

Monday, April 27, 2020

Latest project: Search waves during a pandemic

The latest project I've art-directed has just been launched. It's titled Searching COVID-19 and it was produced by Schema Design in collaboration with the Google News Initiative and Axios, which has published its own version.

The visualization consists of a dynamic beeswarm plot in which each bubble represents a query reaching the top searches in a U.S. state (states may appear more than once in the plot.) Bubbles ungroup little by little to reveal the patterns mentioned in the project's opening copy; for instance, in the early stages of a pandemic people often look for information about the disease itself—“what is” searches,—and later we clearly see an increase in searches for how to prepare for the disease or for its consequences—“how to” searches.

You can find many previous projects here, and this is a 2017 article about the collaboration.

Friday, April 24, 2020

Latest project: Visualizing fact-checks

The latest project in the ongoing collaboration between the Google News Initiative and several designers and studios is this large visualization for Poynter's International Fact-Checking Network (IFCN). It shows nearly 4,000 fact-checks from more than 70 countries.

The graphics were designed by Polygraph, mainly by Amelia Wattenberger (don't miss her book about D3.js.) The project has two parts: the interactive visualizations themselves and a searchable database.

My role, as always, was limited to offering suggestions and feedback.

You can see other projects in collaboration with many other designers here.


Tuesday, April 21, 2020

Paul Mijksenaar, information designer

During my formative years, back in the late 1990s, I read widely into the literature of information design, wayfinding, and signage. I still own a small collection of books about those topics; among them, there are two by Dutch designer Paul Mijksenaar that I've loved for more than two decades: Visual Function and Open Here: the Art of Instructional Design.

Roots, a publication that celebrates the history of Dutch graphic design, has just published a nice profile of Mijksenaar. It contains many details I hadn't heard about; for instance, I knew about Mijksenaar's wayfinding signage work for airports, particularly Schiphol, but what I didn't know is that he also created the signs that appear in Steven Spielberg's movie The Terminal.

Tuesday, April 7, 2020

Interview in The effective statistician

As a journalist and a designer who writes for other journalists and designers, I'm always puzzled—and also flattered and a bit terrified—when I get invited to conferences or podcasts run by statisticians, data scientists, of information library science types, but here we are. A couple of weeks ago I had a nice conversation with Alexander Schacht, a mathematician and biostatistician at UCB, who has a podcast called The Effective Statistician.

We talked about charts, the coronavirus, the importance of design for scientists and researchers, and the importance of science for designers. I mentioned tools I love, such as DataWrapper, Flourish, iNZight, books such as Steve Few's Show Me the Numbers, Cole's Storytelling With Data, Tufte's, Tukey's, Cleveland's, and Michael Friendly's work (1, 2, and his upcoming book,) on the history of visualization. I hope you'll enjoy it.

Friday, April 3, 2020

Special issue of the Information Design Journal

The Information Design Journal has a free special issue out collecting the submissions to the 2018's Information+ conference (the 2020 edition has been postponed for obvious reasons.)

I have not read this yet, but I took a look at the table of contents and spotted a few papers that sound very promising, such as “Feeling numbers: The emotional impact of proximity techniques in visualization” (PDF), “Belief at first sight: Data visualization and the rationalization of seeing” (PDF).

Good weekend reading.

Wednesday, April 1, 2020

Interviews in CNN and CBC

In case you're interested, I've just been interviewed by CNN in Spanish about how to be more attentive to the graphics we're seeing these days. I went over some basics, such as verifying the sources of the data, carefully assessing what it is that graphics measure, etc. It's essentially what I explained in How Charts Lie.

I've also appeared in a piece by CBC's Roberto Rocha titled 'The flurry of daily pandemic data can be overwhelming. Here's how to make sense of it.'

I commented on Johns Hopkins University's famous dashboard of confirmed coronavirus cases. I like that visualization, and we should all praise its authors, but I have some doubts about about the decision to represent cases by country (Europe and Africa) or by province/state (the United States, China). I think this can be easily misread, as it seems that the U.S. has many more cases than it has, in comparison to Europe, for instance. This is besides a pretty common problem in proportional symbol maps like this: overlap.

Saturday, March 28, 2020

Rates of change are tricky

Let me begin by saying that (a) we should all appreciate the effort that so many journalists are making to keep the public informed about the coronavirus pandemic; shifts of 10, 12, 14 hours and more are common (subscribe to your favorite news publications, people, be responsible!) (b) Commenting on graphics is easier than making those graphics. As a designer myself, I know how hard it is to navigate the many challenges and trade-offs visualization poses.

This said, I often ponder how we can make visualizations more approachable and understandable. Take the following graph from this New York Times story:


The vertical position of the points on the line represents the percentage change of confirmed cases over the previous 7 days. There are other ways to show change—think of bars with arrowheads pointing up,—but they are clunkier. This graph, if you know how to read it, works fine: the goal is to bring those lines down to the +0% baseline, or close. This point is explained in the body of the story.

However, imagine the following realistic scenario: someone takes a screenshot of this graph and publishes it in social media, adding some personal comments, or wild inferences (1, 2). I wonder whether graphs like this, when isolated from what surrounded them originally, might make some readers reach dubious conclusions or feel too optimistic and confident (“most lines are going down! You're all overreacting! Time to stop worrying and go back to work!”)

Those readers would be missing a crucial point: a 33% increase (line is low) is, in general, better than a 80% one (line is high) indeed, but we need to know more. Prior conditions matter. At the beginning of the chart, the curves are pretty high probably because those are the early stages of each outbreak; few cases were detected. If a city begins with 10 confirmed cases, and later detects 8 more, for a total of 18, it has an 80% increase.

But if we already have many confirmed cases, for instance 1,000, and later we end up with 1,300, we've experienced an increase of +33%. It is better to have +33% than +80%, as it might mean* we're stretching the time it takes cumulative confirmed cases to double or triple—we're flattening the curve, as we say these days—but readers shouldn't ignore other facts. Even a “tiny” 10% increase, if experienced when you're already dealing with tens of thousands of infections, may be catastrophic: hospitals could be even more overwhelmed, leading to more deaths. Think of the situation in Lombardy.

The NYT story contains another graphic comparing rates of change with confirmed cases per thousand people but, as the Times journalists themselves acknowledge, it's “hard to read”:


What to do? It's tricky. Maybe to show more and explain more, as I've suggested before? The New York Times is doing a good job. The body of the story thoroughly explains the pros and cons of these graphics, what they show and what they don't show.

What I fear, though, is that it's too easy to read charts like these while ignoring their footnotes, or to detach the charts from their context. I wonder whether we should produce animated explanations or have presenters explain our visualizations more often, so readers won't be able to separate visuals from their context and annotations. Mediators play an important role.

(* I wrote “might” because confirmed cases aren't total cases. In the U.S. at least, these charts might be showing, at least in part, the increasing availability of testing. Also, in this pet example I'm not considering other factors, such as the number of recoveries.)

Thursday, March 26, 2020

Fourth edition of a classic —plus my favorite visualization books

This morning I replied to a post in Linkedin asking for favorite books about data visualization. You can see my answer below, in case you're curious.

I googled Colin Ware's Information Visualization: Perception for Design, to add a link to it, and discovered that its fourth edition is being released tomorrow, March 27, at least on Amazon. What a coincidence! I just ordered it.

I have many favorite books, but here's my answer to the Linkedin post:

I try to read all books about visualization that I find, so I have many favorites. 
Because it was so illuminating to me more than a decade ago, I love the 1st edition of Thematic Cartography and Visualization by Terry Slocum. Used copies are $7-8 these days, which is great. 
Colin Ware's Information Visualization: Perception for Design, is an absolute classic. It's in its 4th edition already. 
Isabel Meirelles' Design for Information brings the perspective of a visual designer.
Tamara Munzner's Visualization Analysis and Design
For business graphics, Stephen Few's Show Me the Numbers
 Finally, William Cleveland's pair The Elements of Graphing Data and Visualizing Data, which deserve much more popularity than they have. 
Oh, and any book by Howard Wainer. He has many, compiling his articles. 
I could go on an on. There's a lot of good stuff out there these days. We're lucky.

Friday, March 20, 2020

Why not leaving data visualization aside for a few hours to design an explanation graphic?

I began my career in 1997 designing not data visualizations, but visual explanations—we used to call them “infographics”—using illustrations, 3D models, animations, etc. Here's an old example.

I still enjoy that type of work, and in the past few years I've repeatedly lamented its decline in news media—see 1, 2, 3, and my own dissertation. Nowadays, most news graphics desks, at least in the English-speaking world, are focused almost exclusively on data visualizations. I love data visualization, of course, but we shouldn't ignore illustration-based explanation graphics. They are powerful and useful.

Here's a good example: Our World in Data has just partnered up with the German animation studio Kurzgesagt, which has a popular YouTube science video channel, to design an animated infographic about how COVID-19 works. It's really good (I know it's good because my attention-challenged teenager watched it until the end and learned a lot):

Thursday, March 19, 2020

The most-read story ever published by the Washington Post online is a visualization (and other reasons why your organization should invest in a graphics team)

Poynter informs that the most-read piece ever published in the Washington Post's website is a visualization-driven story, the now famous coronavirus simulator, by Harry Stevens.

(Poynter's story is very good; see also this tweet by WaPo's media columnist Paul Farhi, confirming the news.)

Here are a few more factoids for you, without trying to be exhaustive:

In 2013 the most-read piece in The New York Times online was the dialect map, How Y’all, Youse, and You Guys Talk, which still is “one of the most popular in The Times’s digital history.” And remember Snow Fall?

ProPublica's Scott Klein has just told me that “about half of our traffic that goes to journalism on our site is to news apps,” which are databases and visualizations. Back in 2010, the Texas Tribune wrote that their applications account for “a third of the site's overall traffic.”

The Financial Times's graphs and maps about the coronavirus are becoming wildly popular, and for good reason: they are excellent, overall.

I predict that the flatten-the-curve visual explanation—read about it here and here—will become the most iconic image of 2020, and one of the most influential graphics ever made.

I could go on an on.

It's puzzling to me, then, that so many organizations—not just news organizations—are reluctant to invest in a data and graphics team, or to give it the power, resources, and autonomy it needs to thrive. What are you thinking?

(Also, Pulitzer Prize Board, it's about time to create a category this type of work, don't you think?)

Tuesday, March 17, 2020

Linear or non-linear scales? Why not both?

The coverage of coronavirus has rekindled the debate about whether most readers understand non-linear scales. In How Charts Lie I have a cute fictional example of when this type of scale is necessary: imagine that you own four gerbils, two males and two females.

The four gerbils mate and each couple gives birth to four little ones (eight little gerbils in total.) For the sake of argument, let's imagine that the parents die shortly after giving birth. The gerbils keep reproducing at this constant rate, so each generation is double the size of the previous one.

If you plot this exponential growth on an arithmetic Y scale, the line remains very close to the 0 baseline for ~25 generations. Therefore, it'd be impossible to estimate the rate at which you need to increase the amount of food to purchase for your adorable critters:


However, if you use a non-linear scale, the exponential growth of gerbil population becomes clearer. By the 32nd generation there'll be more gerbils in your backyard than people in the world:


When doing graphics about pandemics, we do need non-linear scales because contagion is also non-linear: if you are infected, you likely won't infect just another person, but two, three, or more every n days. That's why community mitigation strategies such as staying at home and washing your hands are so important.

But it's true that most of us have a hard time wrapping our heads around non-linear scales. What to do? Well, we can explain them. As I've said in recent talks, the impulse of too many editors when they think that readers won't understand a visualization is to avoid that visualization. That's self-defeating and wrong. If you never use a type of graphic or scale, how are your readers ever going to learn how to read it?

Another solution is to take advantage of interaction. Showing data on a linear scale is also valuable; it's not just more dramatic than a non-linear scale, but it gives readers an additional view of the data. Why not letting people switch between a linear and a non-linear scale? That's exactly what Spain's El País did in this visualization.


Our World in Data's coronavirus page has a similar feature, although it's harder to see where to click on to switch between scales.