Why it’s a mistake to compare calculators to ChatGPT

Courtesy of one of my morning papers, a discussion of how Canadian universities are reacting to ChatGPT, which includes this quote from University of Calgary Associate Professor of Education Sarah Elaine Eaton:

“There’s a complete moral panic and technological panic going on, and I think we need to take a step back and look at other kinds of tech that have been introduced,” Prof. Eaton said.

“We’ve heard people say things like they think this is going to make students stupid, that they’re not going to learn how to write or learn the basics of language. In some ways it’s similar to arguments we heard about the introduction of calculators back when I was a kid.”

The idea that ChatGPT (and technologies like it — this isn’t a one-app issue) and the calculator are similar enough to allow for useful comparisons is rapidly gaining hold. Sam Altman, the CEO of OpenAI, the for-profit company that has billions invested in the mainstreaming of this tech, has embraced the comparison wholeheartedly:

“Generative text is something we all need to adapt to,” he said. “We adapted to calculators and changed what we tested for in math class, I imagine. This is a more extreme version of that, no doubt, but also the benefits of it are more extreme, as well.”

The ChatGPT-calculator comparison is based on the argument that, just as the calculator automated numerical calculation, ChatGPT has automated the writing and research process. The thinking goes, if we can adapt to one type of automation, we can adapt to this.

But are they comparable? Let’s think it through.

For background, I started elementary school in the late 1970s, so pocket calculators happened on my watch. Although my PhD is in political science, I’ve taken undergraduate calculus courses and graduate-level econometrics (for my Economics MA). As a doctoral student, I’ve TAed a quantitative methods course (i.e., I ran the labs where the students honed their stats skills). And I’m currently teaching our graduate methods course. So I’ve been around mathy stuff for a few decades now.

As for the ChatGPT side of things, my recent CIGI piece builds on research I’ve done over the past several years, building toward a co-authored book that should be published later this year: we just handed in our last round of major revisions.

Let’s start on the calculator side of things. If you need a refresher on how calculators function, check out this Wikipedia page or this explainer. Or this one. But for our purposes, what’s important is that it automates the process of calculation in a predictable way, by translating numbered-key inputs into binary language, and then performing operations on them via logic gates (e.g., AND, OR, NOT).

In its way, it’s not that different from using an abacus. Importantly, it produces consistent and verifiable outputs. And by examining the calculator’s processors to make sure they’re put together properly, we can confirm that our inputs will give us an accurate answer.

Now think about machine learning. ChatGPT’s output is not predictable, nor do its creators fully understand how it gets from point A to point B. That’s kind of what generative AI means. Its output, however, is based on a statistical analysis of texts converted to data points. Ars Technica has a fantastic and accessible write-up about how all this works, which I highly recommend.

Validity and reliability

All this raises important methodological questions about ChatGPT’s use as an input into research — into its use by scholars, students and others who want to create knowledge. One of the key ideas we teach our undergraduates in political science is that methods and indicators must be both reliable and valid. They must be reliable in terms of consistency: results won’t change no matter how many times you use it. To be valid, they have to accurately represent the thing that you want to measure.

Obviously, a properly functioning calculator is both highly reliable and produces valid results: No matter how many times I input “2” “+” “2”, and no matter how many different calculators I use, the output will always be “4”, which by definition is the correct answer.

If we take ChatGPT, or a similar chatbot, seriously as academic tools, its problems with both validity and reliability should become clear. By definition, they should not output the same thing every time for any non-trivial questions: if we wanted a calculator, we’d have a calculator. But also, answers to the same question will conceivably differ across chatbots. All calculators work the same way, but chatbots, like search engines, reflect the idiosyncrasies of their designs, and their designers.

So, how do we know which answers are “correct”? This is a bigger problem for non-calculator questions, because in reality, there may be more than one “correct” answer. (One of the biggest mistakes engineers make is their assumption that because the study of society doesn’t need to be numbers- or statistics-based, it’s easier than the hard sciences. The opposite is true.) The short answer is, we can’t, at least not on the chatbot’s terms. If we are not to take its correlation-based outputs as gospel, we need to evaluate it according to other criteria, whether it’s scientific method or our own unquestioned assumptions about how the world works. In any case, as a method for generating knowledge (which is what research methods are), chatbots are unreliable in the methodological sense.

The issue of validity gets to the question of what, exactly, chatbots are representing. A good survey, for example, will convincingly link its specific questions about, say, annual household income and voting intentions, to larger concepts, such as the role of income and political affiliation. The problem with chatbots is that their source material isn’t the world, but writing about the world. They’re sampling texts, not measuring things as they exist.

That’s a step further removed from reality than we may want to be as researchers. At the very least, the question of validity as it relates to chatbots needs to be considered more directly than I, at least, have seen in ChatGPT-focused discussions.

Automation and assessment

ChatGPT and calculators are very different in terms of how they produce their output, as well as the reliability and validity of their output. Those should be enough for all academics to be very, very concerned about using ChatGPT in their work without subjecting it to far more critical analysis than it’s received.

But it’s the automation of key parts of the educational process that is the real concern for academics, and should be for students and their parents. Does ChatGPT merely automate the writing process, and can we look to the introduction of pocket calculators as a guide to how to ensure that we educators can continue to assess and guide our students’ educational development? Let’s find out!

Calculators, as I’ve noted, allow the user to find the (consistently) correct answer to mathematical questions. As educators, however, we want to teach the process behind the calculation. This is the beating heart of the scientific (and, I would argue, democratic) worldview: That it is good for people to understand how the world works, and not just take it as a given.

As an undergraduate and later a graduate student in the 1990s and 2000s, I actually experienced the transition to software-focused statistics classes. The early part of my academic career featured much more figuring out how to run statistical tests by hand and using rudimentary command-line-based computer programs to analyze my datasets. By the time I graduated with my PhD in 2011, the statistical packages were so much easier to use, just point and click.

Although I certainly remember thinking how easy the #kidstoday had it, never, from secondary school through to my PhD, did my teachers and instructors ever focus on the final result. Instead, they graded the process. The tool: the partial mark. Every step of the way, we had to show our work, and we lost points if we missed a step in our calculations, even if we got the final answer right.

Then again, and this is a question for the punch card and slide rule generation: did teachers ever grade math only by focusing on the answer? I mean, for anything above a Grade 2 level?

So while there may have been a moral panic over the introduction of calculators, it’s important to note that the solution — show your work and grade the process — was there from the beginning.

(And just to be clear, while calculators have advanced to become more like computers, the calculator-chatbot comparison is about the introduction of each technology, not what the technologies in their mature form.)

What about chatbots?

Chatbots automate the writing and research process. Putting aside questions of reliability and accuracy for the moment, for educators this technology poses the same problem as the calculator, in that the answer — the final product — is in many ways the least important part of the equation. Just as math teachers want their students to understand how math works, the point of the essay or any similar written assignment is to teach the student how to research and how to collect and evaluate evidence.

The researching and writing of the essay is the means to the end of teaching the student how to, for lack of a better word, think. Even more importantly, and what separates the calculator from the chatbot, the “correct” answer to an essay is far more indeterminate than with calculator-focused math problems.

That said, this is where a comparison with the calculator is actually useful. It highlights the fundamental pedagogical challenge posed by chatbots: How can we disaggregate written assignments into something amenable to a “part marks” approach? Can this even be done?

For some forms of evaluation, sure. In-class exams function in this way: the student has to write for three hours in front of a witness, without computer aid. Some people have suggested oral exams, but we should recall that teachers are not exempt from the human propensity to be convinced by someone who sounds convincing but knows nothing.

For essays, though, that’s the problem. Many teachers already award part marks for an essay. Students in my class typically have to provide the components of a paper in the run-up to the final essay. So, the research question and argument, the literature review and source list, and so on. These steps effectively disaggregate the essay into its component parts.

The problem, of course, is that chatbots can be used to generate all parts of this disaggregated essay. The value in a literature review or article analysis is in the reading, but a student can just pull that off the shelf from a chatbot. Unlike the initial situation with calculators, any written work that takes place beyond the instructor’s watchful eye, will be suspect. Partial work and partial marks won’t work.

The sad reality is that, if these for-profit companies are allowed to continue on their current trajectory, it will get increasingly more difficult to trust student work produced anywhere but in the classroom or under constant digital surveillance.

Even sadder, and much less widely acknowledged, is the hard reality that we do not currently possess a technology for teaching students how to think and research, or for creating knowledge, that can hold a candle to the written essay. Exams offer a solid check on a student’s knowledge and understanding, but it is in the process of writing and research — of trying to understand a reading rather than just accepting a machine-driven summary output — that creates both understanding and the capacity to create your own knowledge. People may hate the essay for any number of reasons, but there is currently no Plan B.

Thinking through our responses

The mistake made by Prof. Eaton and OpenAI’s CEO is to assume that the automation of different processes are functionally equivalent. This is not the case. Different technologies have different purposes. Different automation processes produce different results.

A thoughtful, intelligent and useful response to the challenge that ChatGPT poses to the education system must first start with an understanding of the essay as a form of technology, including its fundamental purpose, and then consider how ChatGPT’s automation of the writing process either promotes or inhibits these objectives.

A second step would be to recognize that ChatGPT, despite how it is portrayed, including in the Globe article, is not an inevitability. Technologies are human creations, and the direction of their development is shaped by human actions and regulation. One question we might ask ourselves is: should this technology’s development be left to the whims and prejudices of self-interested Silicon Valley billionaires? Or should we as citizens act to defend ourselves from their hubris?

A third step would be to move beyond easy platitudes about how we’ll just replace the essay with something else. The education field is riddled with half-baked ideas, enthusiastically embraced one year only to be forgotten the next (remember MOOCS? I do.). In the face of the reality that universities and instructors are already abandoning the essay as a pedagogical tool, anyone who would downplay the challenge that ChatGPT has the responsibility to offer — right here, right now — a workable solution for how we can continue to do the basic job of educating our students.

ChatGPT is a technology with powerful companies and billions of dollars behind it. It is already causing havoc in the education sector, upending a cornerstone of centuries of scientific education and advancement.

We need to take it seriously. Simply dismissing ChatGPT as a souped-up calculator is highly misleading. It does nobody any favours.