Mind, Matter, and Meaning

Tag: machine understanding

The Word “Hallucination” Was Already Taken
MIND · MATTER · MEANING No. 36 · May 2026

The Word “Hallucination” Was Already Taken

A system with no inside can’t hallucinate — only drift.

An essay mindmatterandmeaning.com

A chatbot hands you a footnote. It cites a paper that does not exist. The author it names never wrote anything close to the title. The journal volume runs ten issues short, and the page numbers point into empty air. You copy the citation into a search engine, find nothing, and go back to the chat window with the now-standard complaint: it hallucinated again. Everyone in this little drama uses the word as if it carried no philosophical freight at all — as if the engineers had flipped through the dictionary, hunting for something punchy, and simply landed on the right term.

They did not land on the right term. They landed on a word that already had a job, and the job mattered.

In philosophy, hallucination names something quite specific. A person has an experience that seems, from the inside, to present a real object — a pink rat on the kitchen counter, a friend at the foot of the bed — when no such object is there. The experience happens. The object does not. And the whole difficulty lives in that word seems. The hallucinator looks out at what feels like a world and meets no resistance in it; nothing on the inside of the experience whispers that it has failed.¹ Philosophers disagree, sometimes fiercely, about what that shared appearance comes to.² But every account worth having agrees on one thing: a hallucination happens to someone. It needs a subject who seems to see.

Now look at what the engineer means. A model trained on text produces a string that mentions a paper it has never encountered. No seeming is involved. Nothing inside the system scans a row of citations and concludes, falsely, that one of them is real. The model has no vantage point from which the false output looks like a world. It does not have a world. It has a distribution over tokens, shaped by your prompt, and a sampling step that picked one path through that distribution rather than another. The output strays from the truth because nothing in its training rewarded tracking the truth this finely. The phenomenon is real, and it deserves a name. Hallucination simply names the wrong shape.

Why did the word stick? Partly because it sounds clinical and forgiving at the same time. It pathologizes the model gently, as though it had caught a passing fever. The alternative is to say plainly what every text-only system does: it strings together plausible continuations whether or not those continuations track anything. And partly the word stuck because it smuggles in a familiar picture — a mind, turned inward, deceived. The old Cartesian theater reopens for one more show, this time staged inside a server rack. By sheer suggestion, the model becomes a tiny subject, now and then misled. Once that picture takes hold, the question how do we stop the model from hallucinating? sounds answerable, the way medicating a patient sounds answerable. The harder question, the one the picture hides, never even gets asked.

Here is that harder question. To misrepresent anything, a system has to be tied to the world tightly enough that something fixes when it succeeds and when it fails. The teleosemantic tradition locates that tie in a system’s biological or designed function: a state misrepresents when it fires outside the conditions it was built to track.³ Searle gets to a kindred conclusion by another road — genuine meaning shows up only where formal symbol-shuffling connects to real causal, embodied dealings with the world.⁴ Either way, misrepresentation is an achievement. A thing has to first be the kind of thing that can represent. Only then can it, on a given occasion, get something wrong.

A text-only language model has no such footing to lose. Its outputs ride on statistical patterns combed out of a corpus, and the corpus stands in for the world only in the thin sense that the humans who wrote it were writing about the world. Nothing in the model’s loop checks its outputs against any state of affairs out there. The very idea of the model getting it wrong imports a yardstick the model cannot hold. We hold it for it. We are the ones who notice the missing journal issue. The model notices nothing.

Some philosophers push back on this hard line, and they deserve a hearing. Marek Havlík argues for what he calls semantic fragmentism: the claim that language models do achieve real meaning, not everywhere, but within bounded patches of language where their training is dense and coherent.⁵ The view is trying to honor something obvious — the gulf between Eliza shuffling canned phrases and a modern model translating, summarizing, and holding a dozen constraints in the air at once. Fair enough. But fragmentism still owes us an account of what fixes meaning inside those patches. If the answer is use within a corpus, it has only moved the form/meaning gap somewhere harder to see. If the answer is grounding in the world, it has conceded the whole point.⁶

None of this makes the engineering problem vanish once we take the philosophical word back. The problem stays. It just gets more honest. What the model does, when it fabricates a citation, is closer to confabulation — a word we already use for fluent narration produced without access to the facts the narration claims to report. Or, more plainly still: drift from a standard the system cannot detect. Neither phrase will ever move a product launch. Both have the modest merit of being true.

The cost of the borrowed word shows up in the questions the field lets itself ask. Ask whether a model can be made to stop hallucinating, and you have quietly assumed it has a grip on the world that slipped — and that the right tweak will tighten it. Ask instead what a system would need before its outputs counted as representations at all, and you walk straight into the harder country: sensors, a body, a causal history, the long apprenticeship through which a creature comes to mean cat by the word “cat.” Better questions tend to make better engineering. They also, as it happens, make better metaphysics.

No one is giving the word back. The AI industry does not borrow vocabulary and then return it, and there is something almost endearing about the theft — a field moving so fast it will cheerfully lift a clinical term from the discipline next door and call the lifting naming. But look at what the borrowing does. It plants, at the dead center of the most consequential technology story of the decade, a word that describes an inner theater inside a system that has no inside. The pretense earns its keep. It quietly props up the very confusion this whole project has been working, patiently, to take apart. A model that hallucinates sounds like a mind on the mend. A model that drifts from facts it cannot detect sounds like exactly what it is. And once we hear it as what it is, we can finally ask the real question: what would have to be added before a system could be capable of getting anything wrong at all?

Notes
1. Tim Crane, “Is There a Perceptual Relation?”, in Perceptual Experience, ed. Tamar Szabó Gendler and John Hawthorne (Oxford: Oxford University Press, 2006), 126–146; “Introspection, Intentionality, and the Transparency of Experience,” Philosophical Topics 28 (2000): 49–67; and “The Problem of Perception,” Stanford Encyclopedia of Philosophy (rev. 2021, with Craig French). On Crane’s intentionalist treatment, hallucination is a representational state whose content fails to match the world; the phenomenal character of the state arises from how it represents, not from any inner object the subject is alleged to inspect. This sits within strong representationalism and explains why a hallucination seems to present a worldly object — it represents one, just inaccurately. The treatment is congenial to the present essay’s claim that hallucination is a content-failure of a world-directed state, and it is the account the main text of Chapter 4 develops and Ch04.2 (“The Pain in the Toe That Isn’t There”) applies to phantom limb. The point of citing it here is to fix the philosophical meaning of hallucination before the engineering metaphor co-opts the word. ↩
2. M.G.F. Martin, “The Transparency of Experience,” Mind and Language 17 (2002): 376–425, and “On Being Alienated,” in Perceptual Experience, ed. Tamar Szabó Gendler and John Hawthorne (Oxford: Oxford University Press, 2006), 354–410. Martin’s disjunctivism denies the common factor assumption — that veridical perception and a subjectively indistinguishable hallucination share a metaphysically substantive mental state. On his view a hallucination is characterized only negatively, as a state indistinguishable through reflection from a veridical perception of a particular kind. The book sits closer to Crane’s intentionalist account than to Martin’s negative epistemic disjunctivism (see Ch. 3 on the argument from illusion), which is exactly the disagreement the main text gestures at with “sometimes fiercely.” Both camps nonetheless converge on the point that does the work here: the philosophical use of hallucination requires a subject who seems to encounter a world. ↩
3. Ruth Garrett Millikan, Language, Thought, and Other Biological Categories (Cambridge, MA: MIT Press, 1984), chs. 1–2; and Karen Neander, A Mark of the Mental: In Defense of Informational Teleosemantics (Cambridge, MA: MIT Press, 2017). Neander’s “informational teleosemantics” extends Millikan’s framework by tying representational content to the conditions a system is functionally adapted to detect — its informational functions, which carry what she calls “normative aboutness” — rather than to the conditions it merely happens to correlate with. The misrepresentation case then comes out clean: a state misrepresents when it occurs outside the conditions its function was selected to track. The frog’s bug-detector firing at a passing pellet (Chapter 6’s example) is the canonical illustration. Crucially for the present argument, neither Millikan nor Neander offers any route by which a text-only LLM could misrepresent, since on neither account can a representational achievement be inherited without the selection history that grounds it (see Ch06.1 for the developed argument). ↩
4. John Searle, “Minds, Brains, and Programs,” Behavioral and Brain Sciences 3 (1980): 417–424; and “Is the Brain a Digital Computer?”, Proceedings and Addresses of the American Philosophical Association 64 (1990): 21–37. Searle’s two-step argument — first that syntax does not yield semantics (the Chinese Room), then that computation is itself observer-relative — is the spine of Chapter 9’s case against treating LLMs as understanders. (A guarded note for the careful reader: the Chinese Room alone does not establish the strong, fully general claim that no syntactic process could ever yield semantics; the book leans on the observer-relativity argument, not the thought experiment in isolation, to carry that weight.) The point relevant here is narrower. Searle’s distinction between systems with genuine, world-grounded semantic content and systems that merely emit semantic-shaped tokens makes the engineering term hallucination a category mistake: a system without content to begin with has no content to misrepresent. What it has is output drift relative to an external standard. ↩
5. Marek Havlík, “Meaning and Understanding in Large Language Models,” Synthese 204 (2024). Havlík’s “semantic fragmentism” (developed in his §3.7) is the more sympathetic edge of the contemporary LLM-meaning debate: rather than denying LLMs any relation to content, he argues that they achieve bounded semantic competence within domains where their training distribution is dense and coherent. The book grants the empirical observation — modern models really do show competence gradations across domains — but resists the inference that domain-bounded statistical coherence amounts to genuine semantic content. The fragmentist position trades on the same conflation Bender and Koller diagnose (next note): the slide from “the form is right” to “the meaning is there.” A useful contrast piece is Jumbly Grindrod, “Large Language Models and Linguistic Intentionality,” Synthese 204:71 (2024), discussed at length in Ch06.1. ↩
6. Emily M. Bender and Alexander Koller, “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), 5185–5198. Their “octopus test” is the cleanest engineering-side statement of the syntax/semantics gap for large language models: a hyperintelligent octopus that taps an undersea cable carrying two islanders’ chatter could learn every statistical regularity in their messages without ever encountering a coconut or a predator, and would still produce fluent replies it does not understand — until the day real help is needed and the fluency fails. The argument recasts Searle’s Chinese Room in distributional-semantics terms and has been much discussed in the NLP literature without being much heeded. The form/meaning gap they diagnose is the same gap this essay turns on: textual coherence is not world-grounded representation, and a system that lacks the latter cannot, strictly, misrepresent in the way hallucinate implies. ↩
May 25, 2026
What a Machine Would Have to Earn
MIND · MATTER · MEANING No. 29 · May 2026

What a Machine Would Have to Earn

Understanding is earned in a world, not performed on a screen.

An essay mindmatterandmeaning.com

A friend sent me a transcript last spring. He had asked a chatbot what a sunburn feels like the morning after — that specific tight, hot, can’t-find-a-way-to-lie-down misery — and the machine answered better than he could have. It named the flinch when a shirt seam drags across the shoulders. It knew the small betrayal of forgetting for a second and leaning back into a hot car seat. He found it uncanny, a little moving, and he wanted to know: does it understand what a sunburn is?

Good question, asked at the right moment. The honest answer takes a while to earn, so let me start with the answer most of us reach for first — because it’s reasonable, and because it’s wrong.

The reasonable view goes like this. Understanding shows up in what you can do. A student who can answer any question about the French Revolution, field the follow-ups, catch the trick ones, and explain the whole thing to a ten-year-old — that student understands the French Revolution, and we would be cranks to deny it on the grounds that we can’t peer inside her skull. Understanding is as understanding does. So if a machine handles every question about sunburns as well as a sunburned person could, the difference between the machine and the person starts to look like a difference we invented to feel special about ourselves. The picture has a respectable pedigree: it descends from behaviorism, and it has a famous instrument in Alan Turing’s imitation game, where the test for thinking just is indistinguishable performance.

Notice the quiet assumption, though. The picture takes understanding a word to be a matter of using it correctly, and takes “correctly” to be settled by looking only at the outputs. Pull on that thread and the whole thing comes apart in your hands.

Stevan Harnad, a cognitive scientist with a gift for naming traps, named this one in 1990: the symbol grounding problem.¹ Imagine trying to learn Chinese from a Chinese-only dictionary. Every definition sends you to other entries, which send you to others, and you ride that merry-go-round forever without once touching the ground. A system whose symbols are defined only by more symbols never means anything by them. Meaning gets in only when some of the symbols connect to the things they are about by some route other than further symbols — when “red” hooks to red, not merely to “crimson,” “scarlet,” and “the color of a stop sign.”

What supplies the hook is not anything inside the system. Hilary Putnam made the case unforgettable with a thought experiment about Twin Earth — a planet just like ours except that the stuff they call “water” there is some other compound with all of water’s surface features.² A person here and their molecular duplicate there can be internally identical, down to the atom, and still mean different things by “water,” because the word answers to the stuff in the world, not to the state of the head. “Meanings,” Putnam wrote, “just ain’t in the head.” Tyler Burge pushed the same point from the social side: what your word “arthritis” picks out depends on the practice of the community you defer to, not on a private definition you carry around.³ Content lives in a relation — between a system, a world, and the company it keeps.

There is even a natural story about how the relation gets built. On teleosemantic accounts — Ruth Millikan’s and Fred Dretske’s, chiefly — a state comes to be about something by acquiring the function of tracking it, the way a frog’s strike comes to be about flies through a long history in which catching flies is what kept frogs going.⁴ The clinching detail is misrepresentation: to get something wrong, a system has to have been in the business of getting it right. A state can mean fly and fire at a passing pellet only because its job, fixed by history, was flies. No history, no job; no job, nothing to be mistaken about; nothing to be mistaken about, no content.

So understanding a word turns out to be an achievement, not a knack: it consists in having states that are genuinely about the world — not states that merely accompany the right answers, but states directed at the very things the words name — and aboutness is something a system earns over time. Your “red” means red because red things have been pushing on you, through eyes and skin and the small stakes of an actual life, since before you could pronounce the word. This is what people are gesturing at, usually too vaguely, when they say minds are embodied. The word invites mysticism, so let me drain it of any. Embodiment names three sober requirements: the system takes in the world through senses and acts back on it; its inner states have been shaped by real traffic with the features they represent; and those states are there to track a world the system inhabits, not merely to emit the right strings. Michael Tye — who spent three decades building the most careful theory we have of how experience could be nothing more than representational content, and then argued that even his own theory needs history — makes the sharpest version of the point. Two creatures could be intrinsically identical at an instant, he argues, and still differ in what they experience, because one has a past of tracking the world and the other was assembled, atom for atom, five minutes ago.⁵ History is not decoration on content. It is part of what fixes it.

Which lets me say, at last, what a machine would actually need. Not the right stuff — I don’t think the barrier is silicon, and here I part company with John Searle, who ties understanding to the specific causal powers of biological brains.⁶ The barrier isn’t carbon; it’s a world. A system understands when its inner states have been shaped by, and stay answerable to, the things they represent — when it senses and acts, lives under stakes, and can get things wrong and pay for it. Build that, and the door to genuine artificial understanding stands open. I mean open, not slyly closed. The claim here is not the tired one that machines could never understand. It is that understanding is earned through engagement, and there is no coupon for skipping the engagement.

Skipping the engagement is precisely what today’s text-only language models do. A large model learns the statistics of how we talk — the staggeringly intricate shape of which words follow which — from a corpus of descriptions of the world, never from the world.⁷ It has read everything ever written about sunburns and has never once had skin. Its “red” is a position in an immense map of words, anchored to other words, anchored to nothing outside the map. The fluency is real and the achievement is genuine; it is simply not the achievement of understanding.

Here the strongest objection arrives, and it deserves a real hearing rather than a brush-off. If the machine’s answers became indistinguishable — in principle, not merely in today’s practice — from an understander’s, then insisting it still lacks understanding looks like clinging to a ghost. A difference that makes no detectable difference, the objection runs, is no difference at all. That is the whole moral of the imitation game, and it is not a silly one.

But “makes no difference you can detect in the output” is the definition of a good simulation, not the absence of a difference. Simulate a hurricane to any precision you please: the equations are flawless and your desk stays bone dry. Modeling a process is not running it.⁸ Two systems can produce the very same words while one means them and the other reports the statistics of how the word gets used — because meaning was never a property of the output. It lives in the history behind the output, and that history is exactly what an output test cannot see. The objection mistakes the instrument for the quarry. It notices that the meter reads the same and concludes there is nothing the meter is missing.

So: does the machine understand what a sunburn is? It has never had skin. It has never flinched, never dreaded an evening because of how the sheets would feel. It holds the words and not the world the words are about. Ask the question again in some later decade, of some later system that has spent years bumping into things and paying for its errors, and the answer could come back different — that is the part the doom-mongers and the hype-merchants both manage to miss. Understanding is not a performance a system delivers. It is a debt a system pays, to the world, in the one currency the world accepts: contact. Until the bill comes due, fluency is only fluency. It was always going to be the easy part.

References

Burge, Tyler. 1979. “Individualism and the Mental.” Midwest Studies in Philosophy 4: 73–121.

Dretske, Fred. 1988. Explaining Behavior: Reasons in a World of Causes. Cambridge, MA: MIT Press.

Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D 42: 335–346.

Harnad, Stevan. 2002. “Symbol Grounding and the Origin of Language.” In Computationalism: New Directions, edited by Matthias Scheutz. Cambridge, MA: MIT Press.

Havlík, Vladimír. 2025. “Meaning and Understanding in Large Language Models.” Synthese 205: 9.

Millikan, Ruth Garrett. 1989. “Biosemantics.” Journal of Philosophy 86 (6): 281–297.

Putnam, Hilary. 1975. “The Meaning of ‘Meaning.’” In Mind, Language and Reality: Philosophical Papers, Volume 2, 215–271. Cambridge: Cambridge University Press.

Searle, John R. 1980. “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3 (3): 417–457.

Searle, John R. 1990. “Is the Brain a Digital Computer?” Proceedings and Addresses of the American Philosophical Association 64 (3): 21–37.

Tye, Michael. 2019. “Homunculi Heads and Silicon Chips: The Importance of History to Phenomenology.” In Blockheads! Essays on Ned Block’s Philosophy of Mind and Consciousness, edited by Adam Pautz and Daniel Stoljar. Cambridge, MA: MIT Press.

Notes
1. Harnad (1990) coined “the symbol grounding problem” and framed it with the Chinese-dictionary regress; he later tied it to the origin of language (Harnad 2002). The problem is older than the label — it is the computational heir of the externalist worry about how any representation latches onto its object — but Harnad’s formulation is the one the AI literature inherited, and it is sharper than the Chinese Room for present purposes because it isolates grounding from Searle’s further claims about consciousness. ↩
2. Putnam (1975). The conclusion is specifically about reference and extension: the content that fixes what “water” is true of does not supervene on the speaker’s intrinsic states. Note that Putnam later qualified his own semantic externalism in several directions; nothing here turns on the most contested versions of the thesis, only on the minimal claim that reference depends on causal-environmental relations the head alone does not settle. ↩
3. Burge (1979) extends externalism from natural-kind reference (Putnam) to social content: holding a thinker’s physical history fixed while varying the surrounding linguistic community varies which concept the thinker exercises. The two cases are independent routes to the same structural conclusion — internal organization underdetermines content — which is why the essay leans on both rather than treating Burge as a footnote to Putnam. ↩
4. The teleosemantic tradition, principally Millikan (1989) and Dretske (1988), grounds content in proper function: a state represents what it has the function of tracking, where functions are fixed by selection or learning history. Misrepresentation is the standard adequacy test for any naturalistic theory of content, since a theory on which states cannot be false has not yet described representation. Rival tracking theories handle reliable misrepresentation differently, but the historical structure — content fixed by what a state was for — is common ground and is what the embodiment argument borrows. ↩
5. Tye (2019). The thesis is that two beings intrinsically alike at a time can differ in phenomenal character because they differ in history — a representationalist’s concession that current intrinsic structure does not suffice. Ned Block replies in the same volume (“Fading Qualia: A Response to Michael Tye”) that a subject could be radically wrong about their own phenomenology; the disagreement is real and unresolved, and the essay sides with Tye while granting that Block has located the genuine pressure point. That Tye, of all people, reaches for history is the relevant fact: the most developed representationalism on offer does not think structure alone fixes content. ↩
6. Searle (1990) argues that computation is observer-relative — a physical system “computes” only under an interpretation we assign — so computational description cannot, by itself, explain intrinsic intentionality. The essay takes this negative point and leaves Searle’s positive doctrine behind. Searle’s biological naturalism holds that only the specific causal powers of brains can produce understanding; the view defended here replaces “the right biology” with “the right causal-environmental engagement,” which a non-biological system could in principle possess. The negative argument survives the amputation of the positive one. ↩
7. Not everyone takes the contact gap to be fatal, and the most direct contrary voice deserves naming. Vladimír Havlík (2025) argues the reverse of this essay’s conclusion — that large language models do ground the meanings of their expressions, by way of what he calls semantic fragmentism, so that grounding in worldly reference is not a precondition of understanding. I think this mislocates the gap rather than closing it. Semantic fragmentism can explain how a model’s tokens come to bear stable relations to one another; the externalist and teleosemantic considerations above concern what fixes the relation between a token and the world, which is precisely what a text-only training signal never touches. The architectural premise is not what divides us — a text-only model is trained to predict the next token over a corpus of text, full stop — what divides us is whether that suffices for content, and Havlík’s affirmative answer is the live position this essay rejects. ↩
8. The simulation/realization distinction is Searle’s reply to the Brain Simulator objection in “Minds, Brains, and Programs” (1980), generalized: a model of a process is not an instance of it, and whatever a process owes to its physical realization is not delivered by a description of that realization, however exact. The hurricane example makes the point without the contested premises about consciousness — no one is tempted to say the simulated storm is wet — which is why it does cleaner work here than the Chinese Room. ↩
May 25, 2026
Multimodality and the Symbol-Grounding Problem
MIND · MATTER · MEANING No. 31 · May 2026

Multimodality and the Symbol-Grounding Problem

Adding eyes to a language model gives it more pictures, not a world.

An essay mindmatterandmeaning.com

Hold a bruised avocado up to the newest chatbot and it will tell you, with a confidence you have never once earned at a produce counter, that the fruit has about a day left and you should make the guacamole tonight. It can see the avocado. That is the pitch, anyway, and it lands. After years of watching these systems shuffle words around — predicting the next token the way a very well-read parrot predicts the next syllable — here at last is one that looks at your kitchen and answers.

The demos impress, and the feeling they produce is specific: the machine has finally made contact. The symbols have touched down. Whatever was missing in the text-only models — the thing that made us suspect the parrot didn’t know what it was saying¹ — surely closes the moment you give the thing eyes.

Here is the story almost everyone now tells, and I told a version of it myself for longer than I’d like to admit. The old language models lived sealed in a room of words. “Apple” meant nothing to them beyond its statistical company — the other words it tends to travel with. No wonder they made things up; they had never met an apple. But bolt on a camera and a microphone, and “apple” stops being a token rubbing shoulders with other tokens and becomes the round red thing on the counter. Multimodality, on this telling, just is grounding. It is the rope that finally ties the words to the world.

It is a natural thought, and something in its neighborhood is even correct. But the conclusion doesn’t follow, and seeing why it doesn’t pays better than any demo.

Start with what a multimodal model actually eats. It does not eat avocados. It eats images of avocados — arrays of numbers, paired during training with text that humans wrote about them. A photograph has not smuggled a piece of the world into the machine. A photograph is a representation: a flat, frozen, human-made encoding, every bit as much a symbol as the word “avocado,” only written in a richer alphabet. Feed a model a billion captioned pictures and you have fed it a billion more descriptions of the world. You have handed it more symbols, in a new code. You have not handed it more world.

This is the trap Stevan Harnad named in 1990, and Harnad — a cognitive scientist who has spent the better part of his career worrying about how a symbol ever comes to be about anything — gave it a form worth keeping.² Imagine trying to learn Chinese from a Chinese-Chinese dictionary. Every word gets defined in terms of other words, which lead to still other words, around and around, and you never once step outside the circle of symbols to the things they name. No amount of definition conjures meaning out of more symbols; the chain has to touch ground somewhere. Somewhere a symbol has to connect to the thing — not to another symbol — through the system’s own capacity to pick that thing out, sort it, act on it.

Harnad had a sharp way of pricing this. Language, he wrote, lets us “steal” categories quickly and cheaply, through hearsay — I can tell you what a zebra is and spare you the safari. But theft works only because somebody, somewhere, earned the category the hard way, through what he called sensorimotor “toil”: the trial and error of dealing with actual zebras, guided by the cost of getting it wrong. It cannot be theft all the way down.³

And theft all the way down is exactly what multimodality quietly proposes. It tries to buy grounding with a bigger pile of borrowed representations. But a photograph of a zebra is more hearsay, not the safari. The richer alphabet is still an alphabet, and an alphabet, however many characters you add to it, is the kind of thing that needs grounding — never the kind of thing that supplies it.

There’s a deeper reason the input’s richness can’t do the job, and it arrives from the least mystical corner of philosophy. Hilary Putnam — who revised his own positions so often, and so cheerfully, that the restlessness became part of his reputation — argued in 1975 that meanings “just ain’t in the head.”⁴ What a thought is about depends on how the thinker stands to the world, not only on what is happening inside. Two systems can be alike down to the last detail and still mean different things, because they have different histories of contact with different surroundings. Michael Tye, who built one of the most careful versions of the view that an experience just is a way of representing the world, pressed the same point about minds: what a state represents depends partly on the causal history through which the system came to have it.⁵ A system that has tracked ripeness — reached for fruit, been right, been wrong, paid the difference in a bad lunch — has states that are about ripeness. A system assembled from a frozen archive of ripe-labeled photographs has states that are about how humans tended to label photographs. Which is not nothing. It is just not ripeness.

So here is the distinction the grounding story walks straight past. Multimodality adds modalities of representation — more kinds of symbol the system can take in. It does not add modalities of engagement — sensors wired to actuators in a world the system inhabits, a history of tracking real features, and some stake in getting it right.⁶ The first is a matter of feeding the model new file formats. The second is a matter of putting the model on the line. They are not the same project, and no quantity of the first sums to the second. The avocado demo feels like seeing. But seeing is something a creature does in a world it can be wrong about and suffer for being wrong about. What the model does is map an array of numbers onto a likely sentence.⁷ It has never been hungry. It has never been fooled. It has never cut into one and found mush.

The strongest reply grants most of this and turns it around. Fine, the objector says — you’ve already admitted an embodied system could mean things. And multimodal models are precisely the perception stack going into embodied systems: the same vision encoders that caption your avocado get bolted onto robots that pick things up. So you’re knocking down a strawman. Nobody serious claims a static image model is grounded; the claim is that multimodality is step one toward a system that is. The trajectory is the point.

This objection is right about nearly everything, and I want to be careful, because where it’s right is exactly what matters. Yes — a robot that acts in a world, tracks what it touches, and pays for its mistakes could come to mean something by “avocado.” I have no objection in principle; the door stands open. But notice what does the work in that story. The grounding gets accomplished by the acting-in-a-world — the closed loop, the tracking, the stakes — and not by the number of input channels feeding the network. A simple creature with one sense and a body on the line stands nearer to meaning than a thousand-modality oracle trained on a frozen scrape of the internet. So the honest version of the trajectory claim is not “multimodality grounds language.” It is “embodiment might, and multimodality is some of the plumbing.” Those two sentences advertise very different products. The first hands you grounding you have not paid for. The second admits the bill is still outstanding.

The avocado on your counter is ripe or it isn’t, and you settle the question the only way anyone ever has: you cut it open — a small risky act in a world that pushes back and now and then embarrasses you. The model has never once been embarrassed, because it has never been anywhere it could be wrong. Giving it a camera changed what it can be shown. It did not change what it can be answerable to — and answerability to the world, not access to more pictures of it, was the whole of what we were missing. We did not open the model’s eyes. We widened the window of the room it was always in, and hung a sharper picture in the glass.

References

Burge, Tyler. 1979. “Individualism and the Mental.” Midwest Studies in Philosophy 4: 73–121.

Dretske, Fred. 1988. Explaining Behavior: Reasons in a World of Causes. Cambridge, MA: MIT Press.

Dretske, Fred. 1995. Naturalizing the Mind. Cambridge, MA: MIT Press.

Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D 42: 335–346.

Harnad, Stevan. 2002. “Symbol Grounding and the Origin of Language.” In Computationalism: New Directions, edited by Matthias Scheutz, 143–158. Cambridge, MA: MIT Press.

Havlík, Vladimír. 2024. “Meaning and Understanding in Large Language Models.” Synthese 204: 71.

Putnam, Hilary. 1975. “The Meaning of ‘Meaning.’” Minnesota Studies in the Philosophy of Science 7: 131–193.

Searle, John R. 1980. “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3 (3): 417–457.

Tye, Michael. 2019. “Homunculi Heads and Silicon Chips: The Importance of History to Phenomenology.” In Blockheads! Essays on Ned Block’s Philosophy of Mind and Consciousness, edited by Adam Pautz and Daniel Stoljar. Cambridge, MA: MIT Press.

Notes
1. The suspicion is not universal, and honesty requires flagging the dissent. Vladimír Havlík argues that Searle’s assumption of an unbridgeable gap between syntax and semantics is unjustified, and that meaning of a kind can emerge from the distributional and inferential structure a large model internalizes (Havlík 2024). I take the disagreement seriously but read it as a quarrel over what “meaning” must answer to. If content is individuated by world-involving causal relations (see notes 4–6), then distributional structure recovers how a linguistic community uses a term without recovering what anchors the term to the world. On that reading the parrot worry is relocated, not dissolved — which is why this essay presses on grounding rather than on usage. ↩
2. Harnad, “The Symbol Grounding Problem” (1990), poses the problem through the image of trying to learn a first language from a Chinese-Chinese dictionary: an endless circuit of symbol-to-symbol definition that never reaches the world. The claim is not that symbols can never refer, but that reference cannot be conferred by further symbols alone — the regress must terminate in a non-symbolic capacity to identify a category’s members. Note that Harnad’s diagnosis is considerably friendlier to connectionism than Searle’s: the grounding he demands is sensorimotor categorization, a task he takes neural networks to be well suited to learn, given the right embodiment. The argument here is therefore not anti-connectionist; it is anti–disembodied-connectionist. ↩
3. Harnad, “Symbol Grounding and the Origin of Language” (2002): “What language allows us to do is to ‘steal’ categories quickly and effortlessly through hearsay instead of having to earn them the hard way, through risky and time-consuming sensorimotor ‘toil.’” The theft/toil contrast is his. The application is mine: a model trained exclusively on representations attempts the theft with no underwriting toil anywhere in its causal history — not its own, and not, in any content-fixing way, the photographers’. The captioned-image corpus is a vast ledger of other people’s earnings that the model never made. ↩
4. Putnam, “The Meaning of ‘Meaning’” (1975). Twin Earth fixes the individuation of content by external relations: my molecular twin and I, internally identical, mean different substances by “water” because our environments differ (H₂O here, the look-alike “XYZ” there). Burge (“Individualism and the Mental,” 1979) extends the externalism to the social environment. I lean only on the modest thesis — that internal richness underdetermines content — and not on any stronger claim about whether phenomenal character itself is wide. The modest thesis is enough to sink “more pixels equals more meaning.” ↩
5. Tye, “Homunculi Heads and Silicon Chips: The Importance of History to Phenomenology” (2019). Tye accepts Block’s verdict that a “China-body system” duplicating our functional organization at a moment would have no experiences, but argues the reason is historical rather than organizational: the system lacks the causal history through which its states would come to track — and therefore represent — worldly features. Because Tye holds that phenomenal character just is representational content of the right kind, a historical condition on content becomes a condition on experience. (The library’s copy carries a “2011” preprint stamp; the published version appears in the Pautz and Stoljar Blockheads! volume, MIT Press 2019.) For the record, Tye announced a move toward panpsychism in 2024; nothing here depends on that later turn — the historical thesis stands on its own. ↩
6. This is the teleosemantic ingredient, and it is doing quiet but essential work. On Dretske’s account (Explaining Behavior, 1988; Naturalizing the Mind, 1995), a state represents what it has the function of indicating, and functions are acquired through a learning or selectional history in which getting it right and getting it wrong carried consequences. “Stakes” is shorthand for that history: a system for which misrepresentation costs nothing is, on this view, not yet in the business of representation at all. A frozen training corpus supplies correlations in abundance but no such history — which is why scaling the corpus, in any modality, changes the quantity of correlation without manufacturing the one thing teleosemantics says content requires. ↩
7. I bring in Searle’s syntax/semantics argument (“Minds, Brains, and Programs,” 1980) only here, and deliberately not at the front: the educated reader has largely filed the Chinese Room under “answered,” by way of the Systems and Robot replies. But notice that the Robot Reply — the proposal that grounding the symbols in sensors and effectors would supply understanding — concedes precisely this essay’s point. It locates the missing ingredient in embodiment, not in more or richer symbols. Searle himself resists even that, on the ground that bolting transducers onto the room changes nothing happening inside it; whether he is right about that further step is a dispute this essay can leave open, because its target — the claim that multimodal input alone grounds meaning — is one the Robot Reply and Searle both reject. ↩
May 25, 2026