I avoided statistics at school like a rash – filled myself up with super powered double A-levels in Maths And Mechanics instead. I’m not sure why, as it turns out statistics is vital to the good functioning of our world in 2017.
We have plenty of computers and plenty of data, and plenty of willingness to make evidence based decisions. Alas it is easy to fool yourself into thinking you know something you don’t. For your confirmation bias to convince yourself you’ve found evidence, so you don’t feel guilty when you report up your management chain.
You don’t need to juke the stats – that is to say, consciously interfere with data collection on a large scale, as described in this scene in a school from The Wire.
No, no such outright deceit. It’s easy instead to “rig” the stats – whether by accident, on purpose, or with different parts of your mind doing different things without full understanding.
Having spent some more time attentive to statistics this year, here is a high level check list of things to watch for:
1) Make sure the sample is large enough. If you don’t have enough data, you get an essentially random result. It’ll change with each new data point. It’s easy for non experts to gather data until the apparent result hits what they expect, and then stop. Instead, learn how to calculate confidence intervals. Fancy Bayesian probability distributions can clearly show the limits to your knowledge caused by lack of data. The hard bit is the cost – the sample sizes needed to actually reach valid conclusions can be quite large. Or maybe just be honest, accept there may be simply no way to get enough data.
2) Choose your measure first and stick to it. It’s super tempting to measure several things or tweak the weightings between different contributors to an index, until the answers meet your preconceptions. Instead, choose a good measure up front and go with what you get. Overfitting models or overtraining neural networks feels like an extension of this error.
3) Don’t confound your variables. The real cause may be one thing, but it can look like it is something else which is correlated with that cause. For example, it’s easy to show that black people in the US commit more crime. You just have to ignore the confounding fact that actually it is poor people who get into circumstances where they commit more crime, and that black people tend to be poor. This one goes to the very heart of causality and meaning.
4) Classify things instead of ranking them. Often reality doesn’t neatly give things a score, but does have categories or outliers. For example, you might not have enough to exactly rate hospitals, but you do have enough to know (thanks Anna for reference) which are 3 standard deviations from the normal, so should be inspected.
I’ve noticed that it is common for highly trained, experienced people to spot one of these four things, and focus on improving that or bringing the doubt caused by it to everyone’s attention. That’s a mistake, as to come to a sound conclusion you have to follow all of them, and a lot more (for example, I haven’t mentioned accounting for measurement errors). One won’t do.
This at first can all seem discouraging. It often turns out that, despite apparent data, you know nothing or little.
Don’t despair, for two reasons.
i) Despite the above, statistics still finds nuggets. Truths you can act with confidence on.
ii) It’s valuable to know what you don’t know – to know when, despite your intuition, there really is no evidence either way for something. That management time is best spent either gathering more data, or doing something else entirely.
The peace to know what you don’t know.