Friday, June 28, 2019

Lines aren't just for time-series

The other day there was an interesting debate on Twitter. The Star Tribune's C.J. Sinner posted the graph below and wrote: “Repeat after me: Line charts should be a time series! Line charts should be a time series!  Found in @cityofsaintpaul (MN) 2040 Comprehensive Plan doc.” This is the source; the graph is on page 13. See it below.

The line represents the percentage of people in each district who are people of color (POC), and the bars correspond to the percentage of participants in several community engagement events held in those same districts who were also POC. Sometimes these percentages are close to each other, and sometimes they aren't:

replied to C.J. for two reasons. First, because even if I don't think that this graph is great—more about that later,—I don't find it misleading. Second, because I don't think that “line charts” should always be time series.

Let's begin with the latter: it all depends on what we mean by “line chart”. That term is often used as a synonym for time-series line graphs, which indeed should display time on one axis. That's the strict and traditional definition of “line chart”.

But we shouldn't infer from that definition that all graphs that use lines varying in length or height should also display time on one axis. Think of density curvesconnected scatter plots, or parallel coordinate plots. In the new book, How Charts Lie, I explain that I've met people who find these types of visualization confusing because they apply to them the wrong mental model—that of a time-series line chart—and they feel frustrated.

The impulse of some designers is to say “let's not frustrate any reader; let's use lines just for time-series.” That's self-defeating. If you think that a novel or unusual graphic form is the best way to tell a story, but that some readers won't understand it, don't refrain from using it; instead, explain how to read it.

As for the graphic that illustrates C.J.'s tweet, I confess I was playing devil's advocate in the discussion. I agree that going against expectations and conventions when designing a visualization is often risky, and that it's not justified in this case—but it is justified sometimes.

However, my problem with the original chart isn't, as some argued, that the line might make readers infer some sort of spurious continuity between districts. It seems to me that once you read the graph's legend and labels—something we must always do anyway—it's obvious that we should estimate percentages based on the vertical position of the points connecting the segments of the line, not on the slope of those segments.

My problem is that the line doesn't seem to be necessary, and it's not very efficient at letting you estimate percentages and differences. We could redesign the chart as a bullet graph:

However, let's suppose that we sort the districts from highest to lowest percentage of people of color, and then we add a second variable, such as percentage of people who are 65+years-old (warning: the numbers aren't real; I made them up.) Imagine, just for the sake of argument, that one of the purposes of the graph were to show that the lower the percentage of POC is in a district, the higher the proportion of elderly people becomes:

In a case like this I wouldn't oppose adding connecting lines to emphasize this higher-lower/lower-higher pattern, even if it's unorthodox. Not because the lines encode anything (they don't) but because they can be perceptual aids that highlight one of the key messages of the graph:

Going back to the original graph, here's another alternative design: if the purpose is to show whether there's a close correspondence between percentage of POC in each district and the percentage of participants in the community engagement events who were also POC, a scatter plot may work better. This a quick makeover with a caption explaining how to read it: