Thursday, May 21, 2020

'How Charts Lie': a few clarifications and edits

(Update 01/08/2020: you can now download most figures from the book in high resolution and in two different color schemes.)


CLARIFICATIONS FOR ALL EDITIONS

A clarification for page 44: where it says “people became richer or poorer” or “people in those countries contaminated more or less,” I think I'd add a qualifier such as “on average” just in case, as these are per capita numbers.

Second clarification, page 73: the Richter scale is based on tenfold increments of wave amplitude. That's what I mean by “stronger”; in other words: it's not the possible energy released, which increases at higher rates.

Third clarification, page 74: in the fictional example about gerbil population growth, I assumed that the parents die shortly after giving birth; otherwise, by the second generation we wouldn't have double the gerbils (8 children) but triple, a total of 12: 8 children and their 4 parents.

Fourth clarification: whenever you see Kaplan-Meier charts in the book, assume that lines have been smoothed; actual Kaplan-Meier estimators create lines that look like staircases.


EDITS FOR THE FIRST PRINT EDITION (hard cover)

If you read the first print edition of How Charts Lie you may notice a few printing and layout errors. These should have been corrected in the e-book already, and also for the paperback version, to be released in October of 2020. If you detect anything that looks strange other than these, please let me know.

I've printed out cards containing the main corrections. If you want to receive one and keep it inside your copy of the print book, contact me.

On page 24 the transparency effects that should emphasize or hide parts of the charts disappeared between the galleys—where the graphic was perfect—and the final printing. Mysteries. Here's how that graphic should look like:


On page 45 the line corresponding to the United States didn't show when printed. Here's the graphic:


There's a minor issue with the gradient on the second bar of the chart on page 142: it doesn't fade to white. It should look like this:



A chart on page 116 is slightly misplaced.

On page 49, the second paragraph should read: “Imagine that a district's circle sits on the +20 line above the baseline. This means that Republicans lost 10 percentage points, which went to Democrats, for a total of +20 percentage point change in their favor (there weren't third-party candidates, I guess.)”

On page 92 there's a needless “is” in a sentence that should read “Unless a crime is premeditated...” On page 104 there's an “s” missing at the end of “assess” (this one made me giggle.) At the bottom of page 128 there's an “as” missing before “Assange”. And on page 157 there's a tiny label that should read MS instead of MI. The last label on the Y-scale of the chart on page 172 should read “600” instead of “490”.

Wednesday, May 20, 2020

About that weird Georgia chart

Visualization social media has been busy mocking the following chart by the Georgia Department of Public Health. Pay attention to its horizontal axis:


I never attribute malice when sloppiness is a more parsimonious explanation. I guess that whoever designed this chart thought that sorting the bar groups from highest to lowest, instead of chronologically, was a good idea.

This is not wrong per se; it's possible to think of situations when it's useful to arrange your data like this during analysis. As it always happens in visualization, design choices depend on purpose.

In this case, though, the purpose is to show “the most impacted counties over the past 15 days and the number of cases over time,” so separating the counties and then sorting their bars chronologically seems to make more sense. Something like this:


Visualization books, including mine, spend many pages discussing how to choose encodings to match the intended purpose of every graphic, but we pay too little attention to the nuances of sorting: should we do it alphabetically, by geographic unit, by time, from highest to lowest, from lowest to highest—or do we need an ad-hoc criterion? Or should we make the graphic interactive and let people choose? As always, the answer will depend on what we want the reader to get from the visualization.

Thursday, May 7, 2020

The problem with inconsistent and unlabeled scales

One of the strategies to come up with novel ways to display data is to combine existing graphic forms. This morning The New York Times published a story titled 'Most States That Are Reopening Fail to Meet White House Guidelines' that contains a series of square equal area cartograms that are, at the same time, trellis charts. The piece is really nice.

There's something about it that worries me a bit, though: charts don't have scales. Removing scales from graphics seems to be getting more popular lately among data journalists, and it works in some cases here: some of these charts have horizontal reference lines—see animation on the right—that help you get a sense of proportion and variation.

But the following set of line charts lacks any reference and, moreover, it seems that each one is based on a different scale: New York has more daily confirmed cases than Florida—thousands versus hundreds—but the last point on Florida's line is higher than New York's. New Hampshire has a 7-day average of around 100 cases; Maine has a bit more than 20. I understand that the goal of these graphics is to reveal upward and downward trends, not the case count itself, but I fear that this design choice may mislead some readers:



Here are those charts with scales:


What could be an alternative here, I wonder? It's tricky. There might not be an ideal solution, as it often happens in visualization; adding detailed labels would clutter these tiny charts. Perhaps not to show daily new confirmed cases, but some sort of index—percentage change based on a common starting point for all states,—or the variation in comparison to the previous day or week?

Monday, May 4, 2020

The Dawn of a Philosophy of Visualization

Cover illustration by Prisca Schmarsow of Eyedea Studio
My new article for Nightingale, the online magazine of the Data Visualization Society, has just been released. It's titled The Dawn of a Philosophy of Visualization, and it adapts the foreword that I wrote for a new book, Data Visualization in Society, published by Amsterdam University Press.

The book will be presented in a public e-event on Wednesday. Sign up for it here. To read the book, follow this link (it's open access.)

Monday, April 27, 2020

Latest project: Search waves during a pandemic

The latest project I've art-directed has just been launched. It's titled Searching COVID-19 and it was produced by Schema Design in collaboration with the Google News Initiative and Axios, which has published its own version.

The visualization consists of a dynamic beeswarm plot in which each bubble represents a query reaching the top searches in a U.S. state (states may appear more than once in the plot.) Bubbles ungroup little by little to reveal the patterns mentioned in the project's opening copy; for instance, in the early stages of a pandemic people often look for information about the disease itself—“what is” searches,—and later we clearly see an increase in searches for how to prepare for the disease or for its consequences—“how to” searches.

You can find many previous projects here, and this is a 2017 article about the collaboration.

Friday, April 24, 2020

Latest project: Visualizing fact-checks

The latest project in the ongoing collaboration between the Google News Initiative and several designers and studios is this large visualization for Poynter's International Fact-Checking Network (IFCN). It shows nearly 4,000 fact-checks from more than 70 countries.

The graphics were designed by Polygraph, mainly by Amelia Wattenberger (don't miss her book about D3.js.) The project has two parts: the interactive visualizations themselves and a searchable database.

My role, as always, was limited to offering suggestions and feedback.

You can see other projects in collaboration with many other designers here.


Tuesday, April 21, 2020

Paul Mijksenaar, information designer

During my formative years, back in the late 1990s, I read widely into the literature of information design, wayfinding, and signage. I still own a small collection of books about those topics; among them, there are two by Dutch designer Paul Mijksenaar that I've loved for more than two decades: Visual Function and Open Here: the Art of Instructional Design.

Roots, a publication that celebrates the history of Dutch graphic design, has just published a nice profile of Mijksenaar. It contains many details I hadn't heard about; for instance, I knew about Mijksenaar's wayfinding signage work for airports, particularly Schiphol, but what I didn't know is that he also created the signs that appear in Steven Spielberg's movie The Terminal.

Tuesday, April 7, 2020

Interview in The effective statistician

As a journalist and a designer who writes for other journalists and designers, I'm always puzzled—and also flattered and a bit terrified—when I get invited to conferences or podcasts run by statisticians, data scientists, of information library science types, but here we are. A couple of weeks ago I had a nice conversation with Alexander Schacht, a mathematician and biostatistician at UCB, who has a podcast called The Effective Statistician.

We talked about charts, the coronavirus, the importance of design for scientists and researchers, and the importance of science for designers. I mentioned tools I love, such as DataWrapper, Flourish, iNZight, books such as Steve Few's Show Me the Numbers, Cole's Storytelling With Data, Tufte's, Tukey's, Cleveland's, and Michael Friendly's work (1, 2, and his upcoming book,) on the history of visualization. I hope you'll enjoy it.

Friday, April 3, 2020

Special issue of the Information Design Journal

The Information Design Journal has a free special issue out collecting the submissions to the 2018's Information+ conference (the 2020 edition has been postponed for obvious reasons.)

I have not read this yet, but I took a look at the table of contents and spotted a few papers that sound very promising, such as “Feeling numbers: The emotional impact of proximity techniques in visualization” (PDF), “Belief at first sight: Data visualization and the rationalization of seeing” (PDF).

Good weekend reading.

Wednesday, April 1, 2020

Interviews in CNN and CBC

In case you're interested, I've just been interviewed by CNN in Spanish about how to be more attentive to the graphics we're seeing these days. I went over some basics, such as verifying the sources of the data, carefully assessing what it is that graphics measure, etc. It's essentially what I explained in How Charts Lie.

I've also appeared in a piece by CBC's Roberto Rocha titled 'The flurry of daily pandemic data can be overwhelming. Here's how to make sense of it.'

I commented on Johns Hopkins University's famous dashboard of confirmed coronavirus cases. I like that visualization, and we should all praise its authors, but I have some doubts about about the decision to represent cases by country (Europe and Africa) or by province/state (the United States, China). I think this can be easily misread, as it seems that the U.S. has many more cases than it has, in comparison to Europe, for instance. This is besides a pretty common problem in proportional symbol maps like this: overlap.

Saturday, March 28, 2020

Rates of change are tricky

Let me begin by saying that (a) we should all appreciate the effort that so many journalists are making to keep the public informed about the coronavirus pandemic; shifts of 10, 12, 14 hours and more are common (subscribe to your favorite news publications, people, be responsible!) (b) Commenting on graphics is easier than making those graphics. As a designer myself, I know how hard it is to navigate the many challenges and trade-offs visualization poses.

This said, I often ponder how we can make visualizations more approachable and understandable. Take the following graph from this New York Times story:


The vertical position of the points on the line represents the percentage change of confirmed cases over the previous 7 days. There are other ways to show change—think of bars with arrowheads pointing up,—but they are clunkier. This graph, if you know how to read it, works fine: the goal is to bring those lines down to the +0% baseline, or close. This point is explained in the body of the story.

However, imagine the following realistic scenario: someone takes a screenshot of this graph and publishes it in social media, adding some personal comments, or wild inferences (1, 2). I wonder whether graphs like this, when isolated from what surrounded them originally, might make some readers reach dubious conclusions or feel too optimistic and confident (“most lines are going down! You're all overreacting! Time to stop worrying and go back to work!”)

Those readers would be missing a crucial point: a 33% increase (line is low) is, in general, better than a 80% one (line is high) indeed, but we need to know more. Prior conditions matter. At the beginning of the chart, the curves are pretty high probably because those are the early stages of each outbreak; few cases were detected. If a city begins with 10 confirmed cases, and later detects 8 more, for a total of 18, it has an 80% increase.

But if we already have many confirmed cases, for instance 1,000, and later we end up with 1,300, we've experienced an increase of +33%. It is better to have +33% than +80%, as it might mean* we're stretching the time it takes cumulative confirmed cases to double or triple—we're flattening the curve, as we say these days—but readers shouldn't ignore other facts. Even a “tiny” 10% increase, if experienced when you're already dealing with tens of thousands of infections, may be catastrophic: hospitals could be even more overwhelmed, leading to more deaths. Think of the situation in Lombardy.

The NYT story contains another graphic comparing rates of change with confirmed cases per thousand people but, as the Times journalists themselves acknowledge, it's “hard to read”:


What to do? It's tricky. Maybe to show more and explain more, as I've suggested before? The New York Times is doing a good job. The body of the story thoroughly explains the pros and cons of these graphics, what they show and what they don't show.

What I fear, though, is that it's too easy to read charts like these while ignoring their footnotes, or to detach the charts from their context. I wonder whether we should produce animated explanations or have presenters explain our visualizations more often, so readers won't be able to separate visuals from their context and annotations. Mediators play an important role.

(* I wrote “might” because confirmed cases aren't total cases. In the U.S. at least, these charts might be showing, at least in part, the increasing availability of testing. Also, in this pet example I'm not considering other factors, such as the number of recoveries.)

Thursday, March 26, 2020

Fourth edition of a classic —plus my favorite visualization books

This morning I replied to a post in Linkedin asking for favorite books about data visualization. You can see my answer below, in case you're curious.

I googled Colin Ware's Information Visualization: Perception for Design, to add a link to it, and discovered that its fourth edition is being released tomorrow, March 27, at least on Amazon. What a coincidence! I just ordered it.

I have many favorite books, but here's my answer to the Linkedin post:

I try to read all books about visualization that I find, so I have many favorites. 
Because it was so illuminating to me more than a decade ago, I love the 1st edition of Thematic Cartography and Visualization by Terry Slocum. Used copies are $7-8 these days, which is great. 
Colin Ware's Information Visualization: Perception for Design, is an absolute classic. It's in its 4th edition already. 
Isabel Meirelles' Design for Information brings the perspective of a visual designer.
Tamara Munzner's Visualization Analysis and Design
For business graphics, Stephen Few's Show Me the Numbers
 Finally, William Cleveland's pair The Elements of Graphing Data and Visualizing Data, which deserve much more popularity than they have. 
Oh, and any book by Howard Wainer. He has many, compiling his articles. 
I could go on an on. There's a lot of good stuff out there these days. We're lucky.

Friday, March 20, 2020

Why not leaving data visualization aside for a few hours to design an explanation graphic?

I began my career in 1997 designing not data visualizations, but visual explanations—we used to call them “infographics”—using illustrations, 3D models, animations, etc. Here's an old example.

I still enjoy that type of work, and in the past few years I've repeatedly lamented its decline in news media—see 1, 2, 3, and my own dissertation. Nowadays, most news graphics desks, at least in the English-speaking world, are focused almost exclusively on data visualizations. I love data visualization, of course, but we shouldn't ignore illustration-based explanation graphics. They are powerful and useful.

Here's a good example: Our World in Data has just partnered up with the German animation studio Kurzgesagt, which has a popular YouTube science video channel, to design an animated infographic about how COVID-19 works. It's really good (I know it's good because my attention-challenged teenager watched it until the end and learned a lot):

Thursday, March 19, 2020

The most-read story ever published by the Washington Post online is a visualization (and other reasons why your organization should invest in a graphics team)

Poynter informs that the most-read piece ever published in the Washington Post's website is a visualization-driven story, the now famous coronavirus simulator, by Harry Stevens.

(Poynter's story is very good; see also this tweet by WaPo's media columnist Paul Farhi, confirming the news.)

Here are a few more factoids for you, without trying to be exhaustive:

In 2013 the most-read piece in The New York Times online was the dialect map, How Y’all, Youse, and You Guys Talk, which still is “one of the most popular in The Times’s digital history.” And remember Snow Fall?

ProPublica's Scott Klein has just told me that “about half of our traffic that goes to journalism on our site is to news apps,” which are databases and visualizations. Back in 2010, the Texas Tribune wrote that their applications account for “a third of the site's overall traffic.”

The Financial Times's graphs and maps about the coronavirus are becoming wildly popular, and for good reason: they are excellent, overall.

I predict that the flatten-the-curve visual explanation—read about it here and here—will become the most iconic image of 2020, and one of the most influential graphics ever made.

I could go on an on.

It's puzzling to me, then, that so many organizations—not just news organizations—are reluctant to invest in a data and graphics team, or to give it the power, resources, and autonomy it needs to thrive. What are you thinking?

(Also, Pulitzer Prize Board, it's about time to create a category this type of work, don't you think?)

Tuesday, March 17, 2020

Linear or non-linear scales? Why not both?

The coverage of coronavirus has rekindled the debate about whether most readers understand non-linear scales. In How Charts Lie I have a cute fictional example of when this type of scale is necessary: imagine that you own four gerbils, two males and two females.

The four gerbils mate and each couple gives birth to four little ones (eight little gerbils in total.) For the sake of argument, let's imagine that the parents die shortly after giving birth. The gerbils keep reproducing at this constant rate, so each generation is double the size of the previous one.

If you plot this exponential growth on an arithmetic Y scale, the line remains very close to the 0 baseline for ~25 generations. Therefore, it'd be impossible to estimate the rate at which you need to increase the amount of food to purchase for your adorable critters:


However, if you use a non-linear scale, the exponential growth of gerbil population becomes clearer. By the 32nd generation there'll be more gerbils in your backyard than people in the world:


When doing graphics about pandemics, we do need non-linear scales because contagion is also non-linear: if you are infected, you likely won't infect just another person, but two, three, or more every n days. That's why community mitigation strategies such as staying at home and washing your hands are so important.

But it's true that most of us have a hard time wrapping our heads around non-linear scales. What to do? Well, we can explain them. As I've said in recent talks, the impulse of too many editors when they think that readers won't understand a visualization is to avoid that visualization. That's self-defeating and wrong. If you never use a type of graphic or scale, how are your readers ever going to learn how to read it?

Another solution is to take advantage of interaction. Showing data on a linear scale is also valuable; it's not just more dramatic than a non-linear scale, but it gives readers an additional view of the data. Why not letting people switch between a linear and a non-linear scale? That's exactly what Spain's El PaĆ­s did in this visualization.


Our World in Data's coronavirus page has a similar feature, although it's harder to see where to click on to switch between scales.



Monday, March 16, 2020

The ethics of counting

The coronavirus pandemic is being covered widely, deeply—and not always correctly. There've been plenty of instances of innumeracy and dubious visualizations, as Amanda Makulec said in an article about ethics and good practices. The other day I made some suggestions myself: never publish anything without consulting experts, for instance.

Happy coincidences, over the weekend I read the manuscript of a timely book that will come out in October this year. Its title is Counting: How We Use Numbers to Decide What Matters by Deborah Stone, a professor emerita at Brandeis University. If you follow this blog or liked How Charts Lie and The Truthful Art, you'll enjoy her book as well.

I've looked into Stone's previous work and found an intriguing article of hers, 'The Ethics of Counting', that anticipates some of the themes that appear in Counting. It's a transcript of her acceptance speech for the 2017 James Madison Award, and it's organized into five parts:
What does it mean to count?
How do numbers get their meaning?
How do numbers get their authority?
How can counting change hearts and minds?
Are there some things we shouldn’t count?
Don't miss it.

Sunday, March 15, 2020

Before showing any data, explain how your visualization works

When discussing how to make the public more graphically literate (“graphicate”) in recent talks about How Charts Lie, I've been advocating for explaining how unfamiliar visualizations work before we reveal any data. I use this famous Hans Rosling video as an example. I described it in my article for IEEE, too. I emphasized that the part at the beginning, when Rosling talks about the encodings—horizontal and vertical position, bubble size, color,—is crucial. You can see one of those talks here; jump to minute 15'.

That's why I'm happy to see the most recent Lazaro Gamio's visualization about the possible impacts of the coronavirus at The New York Times.  It contains a bubble scatter plot, and it applies Rosling's technique. It's a great use of the annotation layer: (a) “Each bubble on this chart represents an occupation. The bigger the bubble, the more people do that job,” (b) “the vertical position of each bubble is a measure of how often workers in a given profession are exposed to disease and infection,” (c) “the horizontal position is a measure of how close people are to others during their workdays.” Well done.


Saturday, March 14, 2020

Trump sent his followers a signed chart; I'm doing the same with mine

It seems that President Trump is (mis)using the copy of How Charts Lie I sent him months ago to thank him for the free publicity he's given me—the review in The Economist mentioned this fact in its last paragraph.

According to CNN, Trump sent the following chart to his fans and followers. His point is that the markets experienced a rapid recovery after his press conference about the coronavirus yesterday, Friday 13. The vertical black line on the right-hand side of the chart marks the time of that press conference:


It's hard to see, but the chart shows fluctuations of the Dow Jones Industrial Average just on Friday 13, which I don't think is sufficient information. This is a case of convenient cherry-picking. I decided to send my followers my own signed chart, this one showing the variation since January 1 2020, as Chinese authorities informed the World Health Organization of the first cases of coronavirus at the end of December of 2019. I also added some annotations:

I sincerely want to thank President Trump again for all his efforts to promote How Charts Lie. Whenever he tweets a map or a chart, I think that sales increase.

Friday, March 13, 2020

Explaining and simulating the coronavirus

The other day a reporter asked me about my favorite visualizations about the coronavirus. I've been hesitant about the quality of many graphics I've seen (thisthis, this, and this,) so I chose the “flatten the curve” abstract diagram—many of its versions can't be called data visualizations, as they don't encode actual data—particularly when adding a verbal annotation layer to it, like CNN's Brian Stelter did. Stelter acted like the “mediators” I discussed in my recent article for IEEE. We shouldn't just show information to viewers or readers; we ought to explain it.

Nicholas Kristof and Stuart Thompson have just released another intriguing piece. This one lets you simulate how the curve would change depending on how early or late you intervene to stop the spread of the virus, or how mild or aggressive your actions are:

In any case, here's my take about visualizing anything related to the coronavirus—or anything at all, for that matter: don't mindlessly apply your generic statistical or visualization skills to data downloaded from public sources when covering serious topics; you likely lack domain-specific knowledge, which is essential to getting things right. Always consult with an expert or two. Seek the help of epidemiologists, biostatisticians, or public health specialists. And, when in doubt, err on the side of caution and don't publish anything.

UPDATE: I'd revisit this 2017 article by Steve Wexler and Jeff Shaffer: “Publishing bogus findings undermines our credibility. It suggests we value style over substance, that we don’t know enough to relentlessly question our data sources.”

Wednesday, March 4, 2020

An opinion article for IEEE

IEEE Computer Graphics and Applications has just published my opinion article “If Anything on This Graphic Causes Confusion, Discard the Entire Product.” If you're wondering where that quirky title comes from, see the last part of this article for Nightingale from a few months ago. “If Anything...” also deals with Sharpiegate, but its focus is a bit different: I talk about the role that people who mediate between a visualization and its intended audience—a TV presenter explaining how to read a graphic, for instance—may play.

The article is paywalled, but here's an early and unedited draft.