Come for a walk with me.
When I am at home, my habit is to walk for about an hour or more before dawn. So, come on: it's a beautiful time to be out, quiet and dark. This is also very good thinking time; the mind still has memories of dreams, and the business of the day has not taken hold. All sorts of things come together, often in startling propinquity.
I have written here before about knowledge and narratives as a form of landscape, through which we navigate. Often, when I look back on my dawn walks to recall my chain of thought, I find it literally implicated in the landscape of my route, rather like a memory palace. Like this ...
Uphill
On Friday, starting the climb to the top of our hill and on to the long straight stretch beyond, I was thinking about a recent research paper with the rather wordy and academic title, Gender, Confidence, and the Mismeasure of Intelligence, Competitiveness and Literacy. " It was fascinating and worth reading if you want to dig into the details. Walking, I didn't recall all the technicalities, but the high-level findings really did get me thinking.
The researchers (Harrison, Ross and Swarthout) used the Raven Advanced Progressive Matrices (RAPM) test, a common intelligence test with 36 problems, to conduct several experiments with hundreds of college students. Instead of just marking answers right or wrong, the researchers developed an interface that lets subjects allocate tokens to different answer choices for each problem. The number of tokens allocated to an answer reflects the subject's confidence in it being correct.
Women performed much better when they could express their confidence levels rather than being forced to pick just one answer. In fact, women outperformed men when confidence in results is measured. Minority students also showed significant improvement with the confidence-based system. These results held even when the questions were randomized, rather then being presented as progressively more difficult.
The researchers suggest that women and minorities may be more willing to acknowledge uncertainty when they're not completely sure of an answer.
Traditional measures don't simply fail to capture women's intelligence; they actively penalize the forms of reasoning that women more frequently exhibit by forcing a single choice.
The team tested this concept in other areas and found similar patterns. In competition, women aren't afraid to compete but make rational risk assessments. In financial literacy tests, women's I don't know responses reflect appropriate uncertainty, rather than ignorance.
What I believe we're seeing is a need to include an understanding of ambiguity and uncertainty in our assessments of intelligence. So, there's a lot to think about in that paper, although the conclusions are strikingly simple.
The farm road
Turning along the lane between the horse farms, (Good morning, Indigo! Hi, Diesel!) my subject often changes with the change in surroundings. This day, I found myself finding odd conjunctions.
Of all people to come to my mind first, it was Hayek (Thatcher's favourite economist) and his central argument about the use of knowledge in society. That was weird, but although I have little time for Hayek's political ideology, he makes many valid observations, especially about the nature of knowledge and how it is distributed.
Hayek understood that knowledge is not centralized, but scattered among countless individuals. He sees prices of goods in a market as a way of coordinating this dispersed knowledge, not by central control but through local signals that allow each person to act on their own information (over-simplified: what you want to buy, why, and how much you are prepared to pay), which may be unavailable to others.
Centralized economic planning assumes a kind of artificial certainty and comprehensive knowledge. And this is what connected in my mind to the intelligence research …
Uncertainty is a more effective mechanism because it allows for more subtle approaches to knowledge that more accurately reflect the real world.
The way through the woods
Turning into the woods where, at this time of the year, I am a little wary of both aggressive owls and lumbering bears (the owls are scarier), I connected all this to something Dr Francis Young has written about: how people increasingly treat AI-generated content not as probabilistic outputs from pattern-matching systems, but as oracular pronouncements.
Young is a historian of belief, and suggests we have moved from one authority-based system of knowledge (ancient texts, religious doctrine) through a period of empirical scientific discovery, only to arrive at another authority-based system; this time with AI as the ultimate arbiter of truth. AI believers see a digital supermind that has somehow consumed the totality of human knowledge into the pleroma of data. Young does make clear he means pleroma in the Gnostic sense of a divine completeness beyond the material world. (Saint Paul uses the word somewhat differently.) With sufficient data, the pleroma approaches omniscience.
Homeward with hallucinations
No owls today, except a distant hooting, and no bears. Emerging from the woods and turning for home, my thinking also turns, now to a recent paper from OpenAI: Why Language Models Hallucinate.
The paper is quite technical, but in essence, the authors show that hallucinations arise not from flaws in implementation but from fundamental statistical pressures in the training process. When faced with questions about arbitrary facts (birthdays of obscure individuals, for instance), even well-calibrated systems must guess, and they do so in ways that appear confident to users.
They also see a deeper problem with evaluation frameworks. Current AI benchmarks reward guessing over acknowledging uncertainty, systematically training models to exhibit precisely the kind of overconfidence that research shows to be a disadvantage in humans.
With both AI and IQ, we've created measurement systems that optimize for the appearance of knowledge rather than its substance.
You see the pattern. AI engineers (and, I suspect, the tech industry in general) approach uncertainty and knowledge by consistently designing systems that penalize acknowledgment of ignorance while encouraging confident assertions, regardless of their accuracy.
Hayek's analysis suggests why this occurs: complex systems require local knowledge that cannot be centrally aggregated or assumed into a pleroma. Yet our technical and institutional responses consistently attempt to create comprehensive, centralized measures. As we have seen, intelligence tests reduce multifaceted, complex cognitive capabilities to single scores. Marxist economic planners (and the leadership of most US companies) attempt to replace endlessly diverse local signals with centralized calculation, performance targets and plans. AI systems try to compress the complexity, ambiguity and tentativeness of human knowledge into simplistic outputs.
Young's observations about the pleroma of data illuminate how this dynamic can intensify. If people increasingly treat AI outputs as authoritative regardless of their grounding in evidence, we risk creating feedback loops where confidence matters more than accuracy, and where the simulation of knowledge displaces genuine understanding.
The technical constraints that the OpenAI paper identifies suggest these problems may prove intractable through purely algorithmic means. If hallucination emerges from statistical necessities rather than engineering failures, then addressing it requires changing how we evaluate and deploy these systems rather than simply improving their training.
The downhill stretch
We've created systems that reward false confidence. But it's important to say that this tellingly reflects choices about what we value, not objective or technological necessity.
When we design intelligence tests that penalize appropriate and useful uncertainty, we're making ethical choices about what kind of intelligence counts and, by implication, whose intelligence counts.
The same applies to AI systems: the fact that they hallucinate confident falsehoods reflects the values embedded in their training, not some inevitable technological outcome.
We can do better.
Excellent and thought provoking ideas, as usual, Donald. The discussion reminded me of another we've had many times over the years, regarding whether and to what degree we trust data vs. trusting the person who provides the data. For the current post, I see a similar thread woven throughout:
The context of an observation is essential to the interpretation/meaning of that observation.
When we receive observations from a person, we load those observations with tremendous amounts of metadata … about the observer: our history with them, our knowledge of their habits and biases, their accounts of recent and distant experiences, their demonstrated skills and blind spots, their habits, their recent travels or work projects, their friends/teachers/collaborators/etc, their worldviews, their value systems, their analytical preferences and tools, etc etc etc
We rarely assume universality of their info, even when regard them as a reliable and trusted source of information. We rarely need to make that assumption, because their info is so enriched with the trove of metadata.
If I say that “it will touch 30 degrees today”, my friends will know that I mean cold/Fahrenheit because I am hiking around Mount Blanc at 3k meters. 30 as data means very little compared to the trove of contextual metadata.
I think this asymmetry in value between data and metadata has something to do with the myth of objectivity as commonplace. Useful, meaning-rich information about reality is most often wildly subjective — tied to our peculiar experiences, to the uniqueness of those moments and of us, etc. This inherent subjectivity that is so commonplace in the useful, meaningful information we encounter helps explain why tacit metadata is so helpful in disambiguating the info we share with each other. But we have no such metadata context for an ai source.
The final thought that your piece generated in me today (on a walk, no less) is that there is a powerful hack for our tendency to interpret info as deterministically reliable: make the confidence window explicit.
In my analytic and strategic work with clients over the last few years, I found that asking them specifically. “how certain do you want to be that your decision is correct?“ Is a very powerful tool for moving them out of the mode of thinking that every piece of information they see is either absolutely reliable or absolutely not. The question reframes their cognitive process to move them into a grayscale universe, which happens to be the sort of universe in which we actually live. I find that the quality of thinking produced by every person after this question is superior to the quality of thinking before this question, irrespective whether the person is quite ordinary in mental capability or quite sophisticated.
Perhaps this could help us with your concerns in the way we interact with AI. I suspect some interesting and productive interactions might be created by asking AI the following questions about its responses to our inquiries:
1. How sure are you that your assertion of X is correct?
2. What are 2-3 alternative perspectives/answers to my original question?
3. Looking across these possible answers, and considering the unique aspects of the contexts in which each is more likely to be preferable, and considering the context within which I am asking the question, rank order each one based upon the probability that it is superior to the others in the context in which I am asking the question.
4. What aspect of my context, if changed, would cause you to rate a different one of the possible answers as the most reliable for me? Explain why you assert that these changes in context require a change in answer.