Lies, Damn Lies, and Statistics

There are only three types of lies: lies, damned lies, and statistics.

Attributed to the British Prime minister Benjamin Disraeli, Mark Twain first popularized this adage in the United States; it is as relevant today as it was when Twain first quoted it in the 19th century. Statistics, and the numbers they are composed of, permeate the modern world and are used to rationalize decisions. Numbers rule arguments both offline and online. Glance at any advertisement and they will be filled with promising numbers — “double your weight loss,” “cut your bill in half,” or even just “50% more nutritious.”

Can you spot the trick? (Gun deaths rose after 2005) photo courtesy of Heap Analytics

Can you spot the trick? (Gun deaths rose after 2005) photo courtesy of Heap Analytics

Statistics appeal to us on a fundamental level because numbers appear indisputable — unlike anecdotes or words which we’ve all learned to recognize as possibly deceptive. Numbers have a mystical certitude. They may be confusing, yet they sound impressive. It’s hard to argue with them, because it’s hard to argue with something presented as truth. Most importantly, statistics are comforting because they suggest a sense of control over the future — or at least an anticipation of what’s about to come.

Polls, much like statistics, have a scientific, quantitative, and precise panache. They’re based on data; they must be accurate. Right?

Well, not always. In the Republican race in the Iowa Caucus, Donald Trump anticipated a clear cut victory, bolstered by polls showing him up nearly 5% over Ted Cruz.

But Trump, instead of a comfortable 5% margin of victory, found himself coming in second place behind Cruz by 3%.

Initially gracious, Trump quickly turned to anger and outrage. The defeat left him reeling. A flutter of Trump-driven tweets accused Cruz of voter fraud — how could the polls have been wrong? Trump himself tweeted: “Ted Cruz didn’t win Iowa, he stole it. That is why all of the polls were so wrong and why he got far more votes than anticipated. Bad!”

While the accusations surrounding voter “voting reports” (scoring Iowans based on their past voting appearances and comparing them to neighbors — an attempt to shame more people into caucusing) have yet to be investigated along with Cruz’s misreporting of Ben Carson conceding the race, it seems unlikely that there will be a recount or a re-do of the caucus (which Trump has requested). Throughout the allegations one fallacious theme stayed consistent — that the polls couldn’t be wrong; from this false premise it was an easy jump to conclude that the caucus results must have been incorrect.

In fact, there were a lot of ways for the polls to be wrong. Bush, Huckabee, and Kasich all also “underperformed” compared to their polls. Some polls going into the caucus even had Cruz ahead — although not the polls directly preceding the caucus. Trump’s lead could be explained by other factors in the polls.

For one thing, his brand recognition may have meant that participants being polled chose him over other candidates simply because they recognized his name. Some polls looked at a general composition of Iowans — not just those who have caucused in the past and were likely to caucus in the future. e polls that did look at likely voters assumed a higher turnout in the caucasus and therefore a more secular population of caucus-goers. us the impact of the potential religious vote was diluted. For this reason pollsters had discounted evangelicals in models.

 

Very signicant statistical correlation can be found between unrelated statistics images courtesy of tylervigen.com

Very significant statistical correlation can be found between unrelated statistics images courtesy of tylervigen.com

But Evangelicals were the group most likely to vote for Cruz. While pollsters were correct to predict record breaking turnout — 180,000 Iowans caucused in the Republican caucus compared to the previous record of 120,000 in 2012 — Evangelicals made up 64% of the electorate. Cruz won 34% of Evangelicals compared to Trump’s 22%. Trump won 29% of the non- Evangelical caucus attendees — Cruz only won 18%.

The outcome of the Iowa caucus underscores an important point. Despite the trust we place in them, polls and statistics are only as good as the assumptions and data on which they are based. Statistics can and will lie to you. With careful thought (or just pure carelessness) they can be designed to spin practically any outcome. Even with careful and accurate math underlying a statistician’s work, small choices can still have tremendous impacts.

Even just changing the Y-axis scale can dramatically change the story of the data. photo courtesy of Heap Analytics

Even just changing the Y-axis scale can dramatically change the story of the data. photo courtesy of Heap Analytics

For example, let’s pretend that the Powerball lottery is back up to $1.6 billion dollars (record jackpot size that was won recently by three participants). Actually, we’ll make it interesting and set it at $2 billion. As you’re going to buy your ticket, I o er you a deal. For $100, I will give you a secret pattern to choose from that will double your odds of winning. Do you take the deal to double your chances of winning $2 billion dollars? Sounds like a great exchange — but I only gave you your relative chance of winning. I didn’t mention that for $100 I was merely increasing your chances from one in 292 million to two in 292 million (for comparison your chance of being killed by a vending machine in a given year is 1 in 112 million). Your absolute chance of winning was still basically zero — and you could’ve doubled your chances for far less than $100 by buying another ticket.

Statisticians can pull other tricks. For example, they can use the average instead of the median so that outliers will more heavily in uence the results. Or choose non- represenitive samples. If I ask 50 people waiting in line for McDonald’s on their opinion on McDonald’s — surprise surprise, they’ll probably like it. Tobacco companies would pay for dozens of studies on the correlation between smoking and health risks such as cancer — they would only publish the results of the one study that found little or no correlation, discarding the results of the other studies.

This isn’t to decry statistics. But the point is to recognize what statistics are; predictions of the future, not guarantees. Statistics can be incredibly powerful tools, and are essential for estimating parameters, making inferences, and predicting the future. But too o en we forget that statistics are fundamentally well educated guesses. Even the best statisticians make mistakes (see Iowa), and it is all of our responsibility to look beyond the numbers — no matter how comforting they may be — to understand the context and methodology. It’s harder to present statistics in an unbiased fashion then it is to use the numbers to re ect ones personal opinions. By no means stop using “facts” or statistics. Just recognize that most statistics are fundamentally biased — and please remember that numbers (and statisticians) will lie to you.