Why I am relearning statistics

I have finally started what I had been thinking of doing for at least two years now: relearning statistics. I am a freelance journalist/programmer, and I control the volume of work I want to hold at any point. To make time for this academic pursuit, I consciously reduced my workload—it’s a struggle to say no to exciting journalism assignments, but I try.

I am six weeks into this learning endeavour and it’s truly rewarding: I am having a lot of fun diving into mathematical theory and solving problems for hours at stretch. If things go as per the plan, I should be doing this till the end of this year—I hope to stay on course.

In a series of blog posts, I will document my learning approach, my renewed appreciation for MOOCs and thoughts on higher education and pedagogy. I am starting today by documenting why I am doing this.

The goal is pretty clear: learning probability theory and statistics from the ground up, building strong foundational understanding, and getting my hands dirty with the underlying math — just like a good mathematics undergraduate student. The irony is that I have an undergraduate degree in mathematics and I did not do what I am doing now back then.

I did have a compulsory course in probability and statistics in my curriculum but I did not take the subject matter seriously (you can pass the course—without an honourable grade—without really understanding the material). I ignored it despite being quite excited about the prospects of data science, simply because I was not aware of the centrality of probability theory in data analysis. I spent a lot of time learning about databases, cloud computing, big data systems, machine learning algorithms, writing code to solve problems and other stuff to get better at all data-related stuff. My focus leaned towards computing and less on core statistics.

It is deeply embarrassing to acknowledge my ignorance of basic statistical ideas that I did not know at the time I proudly graduated with a degree in mathematics and computing from one of the finest universities in India.

But none of this created any problems in my career: I got a break in journalism because of my data skills. To be fair to myself and not be completely self deprecating, I do a decent job with numbers — and that shows up in the body of journalistic work I have produced (more on data journalism in an earlier blog post). I have a data-driven lens to look at the world and an empirical approach is rooted in my thinking. On the practical side, I routinely write code to scrape and clean data, find trends to inform readers, and visualise them for easy interpretation.

Meanwhile, few things happened:

1. Exposure to the distinction between descriptive and inferential statistics

Descriptive part is easy. It largely deals with summary statistics (like mean and median) and tells us what the data is telling about the sample we are looking at. But I figured I was clueless about inference, which, simply put, is the process of drawing conclusions about a population based on a sample or subset of the data.

Yes, you can draw meaningful insights from basic exploratory analysis and computation of summary stats, but inference uses probability theory to substantiate those conclusions and quantify the uncertainty that comes with sample estimates.

For example:

I can analyse the state of the economy by putting together datasets of key indicators (like GDP and unemployment) and analyse what it’s telling us. That’s just simple data interpretation. What I don’t understand are the statistical processes used to estimate those indicators: the method to calculate the GDP or unemployment numbers.

As a result, when the policy debate on India’s revised GDP series was dominating the news cycle, I was unable to rigorously think about it: it was about the methodology. Similarly, when the Indian government changed the way it conducts the labour force survey (which is the basis of unemployment figures), I was able to follow the political debate; not the statistical one.

If I am told that the unemployment went up from 4.2% to 4.9% — how do I assess if this rise is statistically significant or not? What if the change is simply a reflection of random fluctuation? How should I think about the threshold? At the moment, I can’t answer these questions with mathematical precision.
I can scan empirical studies and research papers and ask questions about the results. But I can’t meaningfully engage with the processes used to arrive at the conclusions. Statistical inference lies at the heart of scientific inquiry and not knowing its fundamentals means I can not fully participate in understanding what’s really going on. One can argue that you don’t necessarily need to know the specifics of methods all the time—and I agree. I am merely suggesting that I am not confident in my arguments without understanding the foundations — and I want to fix that.

These are just a few simple examples that I could immediately think of. I am not even getting into the possibilities of statistical modelling that remain out of my reach — and I have encountered practical problems where my analysis would have been much richer had I been equipped with those tools.

Simply put, the missing bit in my statistical understanding had been hampering my ability to think about questions I care about.

2. Valuing the importance of statistical theory: Here is a paragraph from the book All of Statistics by Larry Wasserman (which I am using as a reference textbook) that puts it clearly.

Students who analyze data, or who aspire to develop new methods for analyzing data, should be well grounded in basic probability and mathematical statistics. Using fancy tools like neural nets, boosting and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use band-aid.

On point: if only I had this perspective during my undergrad, I might have made different choices. It also points to my failure of appropriately managing a mix of theory and application. Both matter, but my choices and interests were heavily skewed towards application, coding and problem solving, which came at the cost of downplaying the importance of theoretical ideas.

3. A probabilistic worldview: The way I look at the world completely changed once I started appreciating the role of chance in everyday life. The world is not deterministic — it is fundamentally probabilistic. And we underestimate how much it actually matters.

For instance, the (frequentist) probabilistic worldview tells you that if we were to re-run the events of history 100 times, 100 different realities will emerge. The one we are living right now was not “meant to be” — it is just one of the infinite paths the world could have taken. This raises complicated cause-effect questions (“How would things be had the low-probability event X hadn’t occurred?”)

This thinking emerged primarily from reading. I had read Leonard Mlodinow’s The Drunkard’s Walk: How Randomness Rules Our Lives a decade ago (sometime in 12th grade); but it was only after I graduated from university and read the work of Nassim Nicholas Taleb, Philip Tetlock (and a few others) that ideas related to risk and uncertainty got baked in my head.

And then I hit a roadblock: how do I transfer my intuitive ideas about probability to practical problem solving? Learning that is one of my goals.

4. Journalism and epistemology: The primary goal of journalism is truth-seeking (this is what I think about my profession; others may propose different end goals: eg social impact, political influence, empowering individuals).

Which is why I think a lot about epistemology—the branch of philosophy that deals with the thorny question of “knowing”: how do we know what we know?

A fundamental problem with many journalists is their confidence that they know things and can explain it well, but it should be news to no one that most of it is just garbage. Just look at the deluge of shallow political theories dumped upon us after every election, only to change in the next — it is so frustrating. As Tetlock writes in his book Superforecasting, the problem is that we move too fast from confusion and uncertainty to a clear and confident conclusion without spending enough time in between.

As I wonder about epistemic methods, a science-based statistical inquiry dominates everything else. But my own work told me this approach has its limitations and that I need to expand my horizon. I want to stress test and think more deeply about the limitations of statistics and the power of possible alternatives.

American Historian Jill Lepore stated this brilliantly in a panel discussion:

Even journalism now, is like, “Oh, we should read the 538 because that’s data” and everything else is just opinion. We have a kind of cultural worship of data. The bigger question is what crap you can get away with now by saying you’re working with data, and what you can impose on other people, and even foot a tax bill for by saying that.

It diminishes all other ways of knowing in realms of knowledge. That is a huge crisis we can’t understand. You might know more about the crime I committed by reading a poem, than working with this algorithm. But we don’t think about that poem as a form of knowledge. The ways in which this, the reign of data, discredits all kinds of realms of knowledge, and among other things, the experience of women and children and families and the intimate and the sexual and demeans the private as something that can purely just exist for commodification.

There’s a whole set of assumptions in that world that we should be talking about. I mean not to say there’s no amazing extraordinary research being done that is data-driven or that falls under the heading of data science, but there were a lot of mistakes made when people decided in the 1890s that social science would solve every problem. It was kind of important for other people to say, “You know what, social science can’t necessarily solve every problem. It’s really useful, but it’s important to think about when we should use it and when not.”

Has there ever been a time when we have said that what we’re doing doesn’t solve every problem?

I am hoping that stronger fundamentals will allow me to proportionally place statistics in the broad scope of epistemic methods, displace it where it should not be and make space for alternatives which I continue to explore.

I am excited to see how my learning process evolves. I don’t expect to get answers to all of my questions right after completing the coursework I am pursuing (more on that in a future blog post) but I find it helpful to sketch out the big picture as you mess around with minute details. If nothing else, it’s a good reminder of everything I don’t know — a key driver for continuous learning.