I always enjoy Stephen Fry’s writings, podcasts, and shows. His unique perspective and captivating exploration of topics keep curious minds engaged.
It was no different with his latest Substack article ‘The One and the Many’ that revolved around Statistics and AI. I thoroughly enjoyed it and it reminded me of what Statistics means to me, and why I got involved in it in the first place.
Dear fellow-statisticians,...
…How do we define Statistics? A sentence in ‘The One and the Many’ caught my eye:
“For all my ignorance, I am at least aware that statistics is a branch of mathematics and thought that describes much of our world and that the field is concerned with far more than the petty lists of figures that politicians brandish…”
Stephen Fry, “The One and the Many”, The Fry Corner, Feb. 2024
Correct, and thankfully we statisticians do deal with far more than politicians’ machinations, or we would consequently make for a very large therapy group indeed. But to me that’s not the essence of Statistics. It is not so much about describing the world, it is mostly about how uncertain we are about it.
The definition of Statistics that I was taught in University and abide by is this:
“Statistics is the science that attempts to quantify uncertainty.”
Simple, but it contains all the keywords. I won’t delve into the part about Statistics being a distinct discipline and science, and not a mere branch of Mathematics, as there are strong arguments both ways. If you are curious, put a passionate mathematician with an equally passionate statistician in the same room, pose the question, then sit comfortably, grab a large bag of popcorn and enjoy the show.
I will focus instead on our attempts to quantify uncertainty. “Attempts” is the keyword. We attempt, we never claim to actually do. Ultimately, what we attempt to do is to minimise the error in our statements about the world.
Have you heard of type 1 and 2 errors, standard error, margin of error, confidence intervals, sensitivity and specificity? These all serve to assess whether there is significant evidence that a relationship, or correlation, exists. “Significant evidence” implies uncertainty, albeit an acceptable level of it, and as such differs from a definite proof that mathematics typically offers (take that my dear Mathematician!). You will never hear us say that “the world IS so”, but rather that “we have significant evidence and are ninety-something percent confident that the world is so”.
Half empty or half full?
Which brings me to why I became a statistician: To be able to draw the line between significance and insignificance. The notion of a limit, a dividing line, always excited me, and the limits of confidence even more so.
Growing up I saw these magic limits everywhere: In dusk, that is neither night nor day, in a mug balancing on its edge about to fall one way or the other.
Sometime in my school days, my father asked me the classic question: Is a glass half full or half empty? “It depends”, I replied unfazed. “If you were in the process of filling it and stopped midway, it is half full. If you were emptying it and stopped, it is half empty”. Later on, when I was introduced to limit theory and x approaching x(0) from both sides, I remembered the half-full glass and realised that what I had unknowingly done was to build a simple mental model (alright, yes, a mathematical one) to describe a real-world phenomenon.
That’s what we statisticians and data scientists do. We draw dividing - or perhaps defining - lines and attempt to describe the world by building models. These models are, by definition, imperfect: They are abstract as they focus only on the most important parameters, incomplete as they most likely lack data, and may even be biassed - God forbid.
Can our algorithms see right through you?
Despite all our confidence issues, we are indeed getting better at describing the world. That’s not so much because of theoretical breakthroughs (although we have had several of those too), but mostly due to the vast quantities of data currently available and of the unprecedented processing power now cheaply accessible to most.
The dizzying processing speeds enable us to train a score of models on huge datasets in seconds. Whereas in the past we could only test a handful of models to find, with considerable toil, the least-bad option, now we can painlessly try dozens at a time and discover a really powerful one.
The abundance of data means that we have both more variables (features) and larger samples to base estimations on. Our data tables got both wider (more columns) and longer (more rows). If you’re wondering how much wider, you may find the following slide by AWS illuminating.
Subsequently, our descriptions of the world got both more detailed, incorporating more features, and more accurate, yielding lower prediction errors. The end result, accurate behavioural predictions for finely-grained ‘personas’, may, indeed, feel scary:
“... the algorithms that (should) frighten us out there seem to be getting closer and closer to knowing not just how my demographic and my age group will behave but precisely how I, the individual Stephen Fry, will behave.”
Stephen Fry, “The One and the Many”, The Fry Corner, Feb. 2024
True, but the thing is, we are still describing groups. Only smaller ones. Including more features refines the groups we describe, making them smaller and more specific. Those defining lines of ours may seem to be tightening around the individual, but, as with mathematical limits, they will never quite reach it. It’s inherent in the way Statistics works:
Predictions about you, the individual, rely on your similarity to others, to a group, who are known to have consistently behaved in a certain way. To be reasonably confident about the consistency of that behaviour, we need a sufficiently large sample of people to test it with. As we are all unique, there will never be a sample bigger than one to precisely and fully describe you with all your characteristics and traits combined; and a sample of one is never enough to allow us to be confident about anything. The individual is safe.
It’s complicated…
But to me the quest is not about describing the world or predicting individuals. What I always saw in numbers was a springboard for dreams.
Complex numbers have a real part and an imaginary part. The real part in many cases specifies a snapshot of a system at a specific moment, such as its current position. The imaginary part is related to how the system is changing, where it may go, or how it might react to a shock, and in that sense it may describe its dynamics.
The above also implies that real numbers (ℝ) are a subset of complex numbers (ℂ).
With my amateur philosopher’s hat on in my school years I drew three conclusions:
First, that the complex nature of our world extends beyond our immediately observable reality. “Reality”, the extended edition, is subjective as it also includes our imagination. We may all observe the same phenomenon or object, such as a conflict or even a piece of furniture, but our perception of it differs as it is biassed by our opinions and experiences.
What is the ‘objective truth’ then? You know, the one we are trying to describe and agree on? My second conclusion was that objectivity is the intersection of a sufficiently large number of subjective reality sets. That is where data comes in to find sufficiently large sets (how large depends on the context) and draw conclusions based on unambiguous observations. Therefore, to understand the objective truth I tend to trust clean and unbiased data, not opinions, even if these come from experts.
The third conclusion was and still remains paramount for my subjective truth:
The true nature and beauty of numbers, does not lie in the reality that they measure, but in all the possibilities that they allow you to imagine.
It’s not per se about data, numbers, algorithms, or precise predictions. To me it is all about dreams. If that’s a tad too romantic for you, allow me to rephrase in business talk: It’s about the vision. And the data-backed decisions that will take you there.
They say data is the new oil, and I fully agree. Not only because it powers the world economy, but also because data fuels our visions of a world in the making. Whether these visions will be pleasant dreams or nightmares is entirely up to us.
Comments