Monday, February 20, 2017

Extrapolation is risky business

Here's an interesting case of potential trumpery™  that I may use in the lecture tour and book. This is a post-in-progress, so feel free to send suggestions.

Infowars, The Washington Times (WT) and other publications are reporting that “Nearly 2 million non-citizen Hispanics are illegally registered to vote.” Infowars adds that “a survey of Hispanics in the U.S. revealed as many as two million non-citizens are illegally registered to vote, reinforcing claims by President Donald Trump that millions of illegal votes were cast in the 2016 election.”

What's the evidence for such a claim of 2 million of illegal voters? There is none. It's based on a reckless extrapolation from a study that was designed with a completely different purpose.

It all began with this 2013 survey of 800 Hispanic adults conducted by Mclaughlin & Associates. The survey itself looks fine to me. I asked the author, John McLaughlin, and he provided detailed explanations of the methodology, how the sample was randomly chosen, etc. My problem, then, is not the survey, but the far-fetched extrapolations that Infowars and WT made, which have gone viral, unfortunately.

On page 68 of the summary of results you will see that among the Hispanics who aren't citizens, 13% said that they are registered to vote:

The stories at Infowars and WT quote James D. Agresti, who leads a think-tank called Just Facts:

[Agresti] applied the 13 percent figure to 2013 U.S. Census numbers for non-citizen Hispanic adults. In 2013, the Census reported that 11.8 million non-citizen Hispanic adults lived here, which would amount to 1.5 million illegally registered Latinos.
Accounting for the margin of error based on the sample size of non-citizens, Mr. Agresti calculated that the number of illegally registered Hispanics could range from 1.0 million to 2.1 million.
“Contrary to the claims of many media outlets and so-called fact-checkers, this nationally representative scientific poll confirms that a sizable number of non-citizens in the U.S. are registered to vote,” Mr. Agresti said.

I thought that Agresti's reasoning was a bit off, so I took a look at the data.

First, according to WT and Infowars, 56% of people in the survey (448 out of 800) were non-citizens. But that figure is incorrect. As the documentation of the survey itself explains, they didn't ask all the 800 people about their citizenship. They only asked those people who said that they were born outside of the United States.

Here is the actual breakdown of approximate percentages and corresponding number of people, based on the documentation of the survey (see page 4):

So it's 263 non-citizens, not 448. Of those, 13% said they are registered to vote anyway. That is around 34 people out of a sample of 800.

I sent an e-mail to Agresti pointing out that his initial calculations were based on an incorrect number of non-citizen Hispanics, 448 instead of 263. He replied very graciously, acknowledged the mistake, and proposed this correction with a larger margin of error:
For 2013, the year of the survey, the Census Bureau reports that 11,779,000 Hispanic non-citizens aged 18 and older resided in the United States. At a 13% registration rate, this is 1,531,270 Hispanic non-citizens registered to vote. Accounting for the sampling margin of error, there were about 264 non-citizens in this survey. In a population of 11.8 million, the margin of error for a sample of 264 is 6.0% with 95% confidence. Applied to the results of the survey, this is 824,530 to 2,238,010 Hispanic non-citizens registered to vote (with 95% confidence).
But this is still wrong. That 34 may look worrying (update; see comments section: It's actually just 29 people,) but it could be due to questions that might not have been well understood, even if they were clearly worded —they were; I checked,— to responders not being open to disclose their immigration situation, voter registration status, etc., or to those people even lying about any of those. These are a crucial factors to ponder.

Moreover, the original survey by McLaughlin was designed with a specific purpose —asking Hispanics about politics— and it must be used just for that. If you want to analyze voter registration fraud, you ought to design a completely different study with a questionnaire crafted with that goal, and to help overcome the aforementioned challenges, including, for instance, control or repeated questions to dodge misunderstandings or lies. I'd add that the sample of such a survey should be just of non-citizen Hispanics —not of Hispanics in general— to be truly representative.

Also, confidence intervals and their margin of errors aren't particularly precise, and ought to be used with great care, even when the sample is perfectly representative and you do no extrapolation from it to its population. Statistician Heather Krause has this excellent summary about their many limitations, and about why they are often much wider than they look. Andrew Gelman, also a statistician, has written extensively (also this, and this) about why using confidence intervals to make sweeping inferences is risky. This other article is also relevant.

Just for fun, I computed my own extrapolations using a different method: calculating the margin of error of the original percentage, 13%. There are online tools, but I prefer to do it the back-of-the-napkin way, with pencil, paper, and a basic calculator.

First, the formula to calculate a confidence interval of a sample proportion is:

This looks much more complicated than it is. First, that z value in there is 1.96 when we want a confidence level of 95% —don't worry about where that comes from; if you want to learn more about it, read the middle chapters of The Truthful Art.

So, z = 1.96. Let's move on.

What about p? That is the proportion that those 34 non-citizen Hispanics who declared to be registered to vote represent over the 263 non-U.S-born, non-citizens. So: 13%.

In statistics percentages are often represented as proportions of 1.0. Therefore, 13% becomes 0.13, and the 1-p in the formula becomes 0.87 (that's the remaining 87% of the 263.)

Now that we know that z= 1.96 and p = 0.13, let's input them in the formula. Here is the result:

That 0.04 means +/-4 percentage points. That's the margin of error that surrounds the 13% figure. 

Therefore, I can claim that if I could run a survey like this —with the exact same sample size and the same design— 100 times, I believe that 95 of them would contain the percentage of non-citizen Hispanics in the population who would say that they are registered to vote, and that it'd be within the 9% to 17% (13% +/- 4) boundaries of the confidence interval. I cannot say the same about the remaining 5 surveys. In those, the results could be completely different.

However, this calculation would only work if the percentage is close to 50% —see the comments section,— and if the sample is carefully and randomly chosen specifically in relationship to the question at hand. If it isn't, as it's the case here, uncertainty may increase astronomically.

This is, by the way, without taking into account other possible uncertainties, like the one surrounding the 11.8 million figure from the Census, which I didn't bother to check.

Again, what I've done here is just an arithmetic game. I think that all these figures and computations are way too uncertain to say anything that isn't absurd. Based solely on the survey data, we cannot suggest that we have an illegal voter problem in the U.S. —or that we don't. The data from the survey is useless for this purpose, and it certainly doesn't support a headline saying that 2 million people are illegally registered to vote.* Besides the problems with casually extrapolating from a sample, the survey wasn't designed to analyze voter fraud anyway.

(My friends, statisticians Heather Krause, Diego KuonenJerzy Wieczorek, and Mark Hansen, read this post and provided very valuable feedback. Thanks a lot!)

This funny XKCD cartoon is worth remembering, by the way:

*Disclaimer: I am not opposed to requiring a photo ID to vote in principle. There are arguments in favor and against it in the U.S: We have solid evidence that photo ID laws are used to restrict the vote of minorities (this book is a great starting point); but I also understand the concerns of those who want to keep elections 100% clean. I'm Spanish, and all Spaniards have a DNI (National Identification Document,) which you must show to vote. I can't see why this cannot happen in the U.S, too. There are some big and hairy “buts” in this comparison, though: Spain's DNI is extremely easy and inexpensive to get. And we are registered to vote by default.

UPDATE (02/21/2016): The great Mark Hansen, from Columbia University, has just sent me an e-mail with these comments about the confidence intervals:

But what does this interval 13%+/-4 mean? Suppose 100 other organizations ran the same survey —with the exact same sample size and the same design -- but drawing their own sample of the population. OK 100 is large, but there are lots of polling organizations out there taking the public's temperature on various topics. Suppose each of the 100 groups then computes an interval like I've done here. In some samples, they will again find 34 people claiming to have registered to vote. But some groups will have a number that's larger, and some will have a number that's smaller. It depends on the sample the group has drawn.

However, because they are all taking random samples, statisticians assure us that we should expect 95 of the 100 intervals they've constructed will contain the number you're interested in, the true percentage of non-citizen Hispanics in the population who would say that they are registered to vote. Now here's the trick. You don't know if the true percentage you're after is in any particular interval. Like your 13%+/-4. This interval could be one of the 95 that contains the true percentage, or, if you're unlucky, it is one of the 5 that doesn't. You don't know.

This is what is meant when statisticians use the term "confidence." It might not sound particularly confident, but the researchers who pioneered this idea were looking for "rules to govern our behavior... which insure that, in the long run of experience, we shall not be too often wrong." So the 95 out of 100 refers to repeated uses of the survey _procedure_ (conduct random sample, construct interval). Wording it differently, it tells us that our confidence intervals won't actually contain the number we are hoping discover in 5 out of 100 surveys. Yes,  5 out of 100 organizations will get it wrong. That's 1 in 20. Of course that begs the question, who decided being wrong 1 in 20 times is OK? Save that for another post!