Mind, Matter, and Meaning

Tag: symbol grounding

What a Machine Would Have to Earn
MIND · MATTER · MEANING No. 29 · May 2026

What a Machine Would Have to Earn

Understanding is earned in a world, not performed on a screen.

An essay mindmatterandmeaning.com

A friend sent me a transcript last spring. He had asked a chatbot what a sunburn feels like the morning after — that specific tight, hot, can’t-find-a-way-to-lie-down misery — and the machine answered better than he could have. It named the flinch when a shirt seam drags across the shoulders. It knew the small betrayal of forgetting for a second and leaning back into a hot car seat. He found it uncanny, a little moving, and he wanted to know: does it understand what a sunburn is?

Good question, asked at the right moment. The honest answer takes a while to earn, so let me start with the answer most of us reach for first — because it’s reasonable, and because it’s wrong.

The reasonable view goes like this. Understanding shows up in what you can do. A student who can answer any question about the French Revolution, field the follow-ups, catch the trick ones, and explain the whole thing to a ten-year-old — that student understands the French Revolution, and we would be cranks to deny it on the grounds that we can’t peer inside her skull. Understanding is as understanding does. So if a machine handles every question about sunburns as well as a sunburned person could, the difference between the machine and the person starts to look like a difference we invented to feel special about ourselves. The picture has a respectable pedigree: it descends from behaviorism, and it has a famous instrument in Alan Turing’s imitation game, where the test for thinking just is indistinguishable performance.

Notice the quiet assumption, though. The picture takes understanding a word to be a matter of using it correctly, and takes “correctly” to be settled by looking only at the outputs. Pull on that thread and the whole thing comes apart in your hands.

Stevan Harnad, a cognitive scientist with a gift for naming traps, named this one in 1990: the symbol grounding problem.¹ Imagine trying to learn Chinese from a Chinese-only dictionary. Every definition sends you to other entries, which send you to others, and you ride that merry-go-round forever without once touching the ground. A system whose symbols are defined only by more symbols never means anything by them. Meaning gets in only when some of the symbols connect to the things they are about by some route other than further symbols — when “red” hooks to red, not merely to “crimson,” “scarlet,” and “the color of a stop sign.”

What supplies the hook is not anything inside the system. Hilary Putnam made the case unforgettable with a thought experiment about Twin Earth — a planet just like ours except that the stuff they call “water” there is some other compound with all of water’s surface features.² A person here and their molecular duplicate there can be internally identical, down to the atom, and still mean different things by “water,” because the word answers to the stuff in the world, not to the state of the head. “Meanings,” Putnam wrote, “just ain’t in the head.” Tyler Burge pushed the same point from the social side: what your word “arthritis” picks out depends on the practice of the community you defer to, not on a private definition you carry around.³ Content lives in a relation — between a system, a world, and the company it keeps.

There is even a natural story about how the relation gets built. On teleosemantic accounts — Ruth Millikan’s and Fred Dretske’s, chiefly — a state comes to be about something by acquiring the function of tracking it, the way a frog’s strike comes to be about flies through a long history in which catching flies is what kept frogs going.⁴ The clinching detail is misrepresentation: to get something wrong, a system has to have been in the business of getting it right. A state can mean fly and fire at a passing pellet only because its job, fixed by history, was flies. No history, no job; no job, nothing to be mistaken about; nothing to be mistaken about, no content.

So understanding a word turns out to be an achievement, not a knack: it consists in having states that are genuinely about the world — not states that merely accompany the right answers, but states directed at the very things the words name — and aboutness is something a system earns over time. Your “red” means red because red things have been pushing on you, through eyes and skin and the small stakes of an actual life, since before you could pronounce the word. This is what people are gesturing at, usually too vaguely, when they say minds are embodied. The word invites mysticism, so let me drain it of any. Embodiment names three sober requirements: the system takes in the world through senses and acts back on it; its inner states have been shaped by real traffic with the features they represent; and those states are there to track a world the system inhabits, not merely to emit the right strings. Michael Tye — who spent three decades building the most careful theory we have of how experience could be nothing more than representational content, and then argued that even his own theory needs history — makes the sharpest version of the point. Two creatures could be intrinsically identical at an instant, he argues, and still differ in what they experience, because one has a past of tracking the world and the other was assembled, atom for atom, five minutes ago.⁵ History is not decoration on content. It is part of what fixes it.

Which lets me say, at last, what a machine would actually need. Not the right stuff — I don’t think the barrier is silicon, and here I part company with John Searle, who ties understanding to the specific causal powers of biological brains.⁶ The barrier isn’t carbon; it’s a world. A system understands when its inner states have been shaped by, and stay answerable to, the things they represent — when it senses and acts, lives under stakes, and can get things wrong and pay for it. Build that, and the door to genuine artificial understanding stands open. I mean open, not slyly closed. The claim here is not the tired one that machines could never understand. It is that understanding is earned through engagement, and there is no coupon for skipping the engagement.

Skipping the engagement is precisely what today’s text-only language models do. A large model learns the statistics of how we talk — the staggeringly intricate shape of which words follow which — from a corpus of descriptions of the world, never from the world.⁷ It has read everything ever written about sunburns and has never once had skin. Its “red” is a position in an immense map of words, anchored to other words, anchored to nothing outside the map. The fluency is real and the achievement is genuine; it is simply not the achievement of understanding.

Here the strongest objection arrives, and it deserves a real hearing rather than a brush-off. If the machine’s answers became indistinguishable — in principle, not merely in today’s practice — from an understander’s, then insisting it still lacks understanding looks like clinging to a ghost. A difference that makes no detectable difference, the objection runs, is no difference at all. That is the whole moral of the imitation game, and it is not a silly one.

But “makes no difference you can detect in the output” is the definition of a good simulation, not the absence of a difference. Simulate a hurricane to any precision you please: the equations are flawless and your desk stays bone dry. Modeling a process is not running it.⁸ Two systems can produce the very same words while one means them and the other reports the statistics of how the word gets used — because meaning was never a property of the output. It lives in the history behind the output, and that history is exactly what an output test cannot see. The objection mistakes the instrument for the quarry. It notices that the meter reads the same and concludes there is nothing the meter is missing.

So: does the machine understand what a sunburn is? It has never had skin. It has never flinched, never dreaded an evening because of how the sheets would feel. It holds the words and not the world the words are about. Ask the question again in some later decade, of some later system that has spent years bumping into things and paying for its errors, and the answer could come back different — that is the part the doom-mongers and the hype-merchants both manage to miss. Understanding is not a performance a system delivers. It is a debt a system pays, to the world, in the one currency the world accepts: contact. Until the bill comes due, fluency is only fluency. It was always going to be the easy part.

References

Burge, Tyler. 1979. “Individualism and the Mental.” Midwest Studies in Philosophy 4: 73–121.

Dretske, Fred. 1988. Explaining Behavior: Reasons in a World of Causes. Cambridge, MA: MIT Press.

Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D 42: 335–346.

Harnad, Stevan. 2002. “Symbol Grounding and the Origin of Language.” In Computationalism: New Directions, edited by Matthias Scheutz. Cambridge, MA: MIT Press.

Havlík, Vladimír. 2025. “Meaning and Understanding in Large Language Models.” Synthese 205: 9.

Millikan, Ruth Garrett. 1989. “Biosemantics.” Journal of Philosophy 86 (6): 281–297.

Putnam, Hilary. 1975. “The Meaning of ‘Meaning.’” In Mind, Language and Reality: Philosophical Papers, Volume 2, 215–271. Cambridge: Cambridge University Press.

Searle, John R. 1980. “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3 (3): 417–457.

Searle, John R. 1990. “Is the Brain a Digital Computer?” Proceedings and Addresses of the American Philosophical Association 64 (3): 21–37.

Tye, Michael. 2019. “Homunculi Heads and Silicon Chips: The Importance of History to Phenomenology.” In Blockheads! Essays on Ned Block’s Philosophy of Mind and Consciousness, edited by Adam Pautz and Daniel Stoljar. Cambridge, MA: MIT Press.

Notes
1. Harnad (1990) coined “the symbol grounding problem” and framed it with the Chinese-dictionary regress; he later tied it to the origin of language (Harnad 2002). The problem is older than the label — it is the computational heir of the externalist worry about how any representation latches onto its object — but Harnad’s formulation is the one the AI literature inherited, and it is sharper than the Chinese Room for present purposes because it isolates grounding from Searle’s further claims about consciousness. ↩
2. Putnam (1975). The conclusion is specifically about reference and extension: the content that fixes what “water” is true of does not supervene on the speaker’s intrinsic states. Note that Putnam later qualified his own semantic externalism in several directions; nothing here turns on the most contested versions of the thesis, only on the minimal claim that reference depends on causal-environmental relations the head alone does not settle. ↩
3. Burge (1979) extends externalism from natural-kind reference (Putnam) to social content: holding a thinker’s physical history fixed while varying the surrounding linguistic community varies which concept the thinker exercises. The two cases are independent routes to the same structural conclusion — internal organization underdetermines content — which is why the essay leans on both rather than treating Burge as a footnote to Putnam. ↩
4. The teleosemantic tradition, principally Millikan (1989) and Dretske (1988), grounds content in proper function: a state represents what it has the function of tracking, where functions are fixed by selection or learning history. Misrepresentation is the standard adequacy test for any naturalistic theory of content, since a theory on which states cannot be false has not yet described representation. Rival tracking theories handle reliable misrepresentation differently, but the historical structure — content fixed by what a state was for — is common ground and is what the embodiment argument borrows. ↩
5. Tye (2019). The thesis is that two beings intrinsically alike at a time can differ in phenomenal character because they differ in history — a representationalist’s concession that current intrinsic structure does not suffice. Ned Block replies in the same volume (“Fading Qualia: A Response to Michael Tye”) that a subject could be radically wrong about their own phenomenology; the disagreement is real and unresolved, and the essay sides with Tye while granting that Block has located the genuine pressure point. That Tye, of all people, reaches for history is the relevant fact: the most developed representationalism on offer does not think structure alone fixes content. ↩
6. Searle (1990) argues that computation is observer-relative — a physical system “computes” only under an interpretation we assign — so computational description cannot, by itself, explain intrinsic intentionality. The essay takes this negative point and leaves Searle’s positive doctrine behind. Searle’s biological naturalism holds that only the specific causal powers of brains can produce understanding; the view defended here replaces “the right biology” with “the right causal-environmental engagement,” which a non-biological system could in principle possess. The negative argument survives the amputation of the positive one. ↩
7. Not everyone takes the contact gap to be fatal, and the most direct contrary voice deserves naming. Vladimír Havlík (2025) argues the reverse of this essay’s conclusion — that large language models do ground the meanings of their expressions, by way of what he calls semantic fragmentism, so that grounding in worldly reference is not a precondition of understanding. I think this mislocates the gap rather than closing it. Semantic fragmentism can explain how a model’s tokens come to bear stable relations to one another; the externalist and teleosemantic considerations above concern what fixes the relation between a token and the world, which is precisely what a text-only training signal never touches. The architectural premise is not what divides us — a text-only model is trained to predict the next token over a corpus of text, full stop — what divides us is whether that suffices for content, and Havlík’s affirmative answer is the live position this essay rejects. ↩
8. The simulation/realization distinction is Searle’s reply to the Brain Simulator objection in “Minds, Brains, and Programs” (1980), generalized: a model of a process is not an instance of it, and whatever a process owes to its physical realization is not delivered by a description of that realization, however exact. The hurricane example makes the point without the contested premises about consciousness — no one is tempted to say the simulated storm is wet — which is why it does cleaner work here than the Chinese Room. ↩
May 25, 2026
Multimodality and the Symbol-Grounding Problem
MIND · MATTER · MEANING No. 31 · May 2026

Multimodality and the Symbol-Grounding Problem

Adding eyes to a language model gives it more pictures, not a world.

An essay mindmatterandmeaning.com

Hold a bruised avocado up to the newest chatbot and it will tell you, with a confidence you have never once earned at a produce counter, that the fruit has about a day left and you should make the guacamole tonight. It can see the avocado. That is the pitch, anyway, and it lands. After years of watching these systems shuffle words around — predicting the next token the way a very well-read parrot predicts the next syllable — here at last is one that looks at your kitchen and answers.

The demos impress, and the feeling they produce is specific: the machine has finally made contact. The symbols have touched down. Whatever was missing in the text-only models — the thing that made us suspect the parrot didn’t know what it was saying¹ — surely closes the moment you give the thing eyes.

Here is the story almost everyone now tells, and I told a version of it myself for longer than I’d like to admit. The old language models lived sealed in a room of words. “Apple” meant nothing to them beyond its statistical company — the other words it tends to travel with. No wonder they made things up; they had never met an apple. But bolt on a camera and a microphone, and “apple” stops being a token rubbing shoulders with other tokens and becomes the round red thing on the counter. Multimodality, on this telling, just is grounding. It is the rope that finally ties the words to the world.

It is a natural thought, and something in its neighborhood is even correct. But the conclusion doesn’t follow, and seeing why it doesn’t pays better than any demo.

Start with what a multimodal model actually eats. It does not eat avocados. It eats images of avocados — arrays of numbers, paired during training with text that humans wrote about them. A photograph has not smuggled a piece of the world into the machine. A photograph is a representation: a flat, frozen, human-made encoding, every bit as much a symbol as the word “avocado,” only written in a richer alphabet. Feed a model a billion captioned pictures and you have fed it a billion more descriptions of the world. You have handed it more symbols, in a new code. You have not handed it more world.

This is the trap Stevan Harnad named in 1990, and Harnad — a cognitive scientist who has spent the better part of his career worrying about how a symbol ever comes to be about anything — gave it a form worth keeping.² Imagine trying to learn Chinese from a Chinese-Chinese dictionary. Every word gets defined in terms of other words, which lead to still other words, around and around, and you never once step outside the circle of symbols to the things they name. No amount of definition conjures meaning out of more symbols; the chain has to touch ground somewhere. Somewhere a symbol has to connect to the thing — not to another symbol — through the system’s own capacity to pick that thing out, sort it, act on it.

Harnad had a sharp way of pricing this. Language, he wrote, lets us “steal” categories quickly and cheaply, through hearsay — I can tell you what a zebra is and spare you the safari. But theft works only because somebody, somewhere, earned the category the hard way, through what he called sensorimotor “toil”: the trial and error of dealing with actual zebras, guided by the cost of getting it wrong. It cannot be theft all the way down.³

And theft all the way down is exactly what multimodality quietly proposes. It tries to buy grounding with a bigger pile of borrowed representations. But a photograph of a zebra is more hearsay, not the safari. The richer alphabet is still an alphabet, and an alphabet, however many characters you add to it, is the kind of thing that needs grounding — never the kind of thing that supplies it.

There’s a deeper reason the input’s richness can’t do the job, and it arrives from the least mystical corner of philosophy. Hilary Putnam — who revised his own positions so often, and so cheerfully, that the restlessness became part of his reputation — argued in 1975 that meanings “just ain’t in the head.”⁴ What a thought is about depends on how the thinker stands to the world, not only on what is happening inside. Two systems can be alike down to the last detail and still mean different things, because they have different histories of contact with different surroundings. Michael Tye, who built one of the most careful versions of the view that an experience just is a way of representing the world, pressed the same point about minds: what a state represents depends partly on the causal history through which the system came to have it.⁵ A system that has tracked ripeness — reached for fruit, been right, been wrong, paid the difference in a bad lunch — has states that are about ripeness. A system assembled from a frozen archive of ripe-labeled photographs has states that are about how humans tended to label photographs. Which is not nothing. It is just not ripeness.

So here is the distinction the grounding story walks straight past. Multimodality adds modalities of representation — more kinds of symbol the system can take in. It does not add modalities of engagement — sensors wired to actuators in a world the system inhabits, a history of tracking real features, and some stake in getting it right.⁶ The first is a matter of feeding the model new file formats. The second is a matter of putting the model on the line. They are not the same project, and no quantity of the first sums to the second. The avocado demo feels like seeing. But seeing is something a creature does in a world it can be wrong about and suffer for being wrong about. What the model does is map an array of numbers onto a likely sentence.⁷ It has never been hungry. It has never been fooled. It has never cut into one and found mush.

The strongest reply grants most of this and turns it around. Fine, the objector says — you’ve already admitted an embodied system could mean things. And multimodal models are precisely the perception stack going into embodied systems: the same vision encoders that caption your avocado get bolted onto robots that pick things up. So you’re knocking down a strawman. Nobody serious claims a static image model is grounded; the claim is that multimodality is step one toward a system that is. The trajectory is the point.

This objection is right about nearly everything, and I want to be careful, because where it’s right is exactly what matters. Yes — a robot that acts in a world, tracks what it touches, and pays for its mistakes could come to mean something by “avocado.” I have no objection in principle; the door stands open. But notice what does the work in that story. The grounding gets accomplished by the acting-in-a-world — the closed loop, the tracking, the stakes — and not by the number of input channels feeding the network. A simple creature with one sense and a body on the line stands nearer to meaning than a thousand-modality oracle trained on a frozen scrape of the internet. So the honest version of the trajectory claim is not “multimodality grounds language.” It is “embodiment might, and multimodality is some of the plumbing.” Those two sentences advertise very different products. The first hands you grounding you have not paid for. The second admits the bill is still outstanding.

The avocado on your counter is ripe or it isn’t, and you settle the question the only way anyone ever has: you cut it open — a small risky act in a world that pushes back and now and then embarrasses you. The model has never once been embarrassed, because it has never been anywhere it could be wrong. Giving it a camera changed what it can be shown. It did not change what it can be answerable to — and answerability to the world, not access to more pictures of it, was the whole of what we were missing. We did not open the model’s eyes. We widened the window of the room it was always in, and hung a sharper picture in the glass.

References

Burge, Tyler. 1979. “Individualism and the Mental.” Midwest Studies in Philosophy 4: 73–121.

Dretske, Fred. 1988. Explaining Behavior: Reasons in a World of Causes. Cambridge, MA: MIT Press.

Dretske, Fred. 1995. Naturalizing the Mind. Cambridge, MA: MIT Press.

Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D 42: 335–346.

Harnad, Stevan. 2002. “Symbol Grounding and the Origin of Language.” In Computationalism: New Directions, edited by Matthias Scheutz, 143–158. Cambridge, MA: MIT Press.

Havlík, Vladimír. 2024. “Meaning and Understanding in Large Language Models.” Synthese 204: 71.

Putnam, Hilary. 1975. “The Meaning of ‘Meaning.’” Minnesota Studies in the Philosophy of Science 7: 131–193.

Searle, John R. 1980. “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3 (3): 417–457.

Tye, Michael. 2019. “Homunculi Heads and Silicon Chips: The Importance of History to Phenomenology.” In Blockheads! Essays on Ned Block’s Philosophy of Mind and Consciousness, edited by Adam Pautz and Daniel Stoljar. Cambridge, MA: MIT Press.

Notes
1. The suspicion is not universal, and honesty requires flagging the dissent. Vladimír Havlík argues that Searle’s assumption of an unbridgeable gap between syntax and semantics is unjustified, and that meaning of a kind can emerge from the distributional and inferential structure a large model internalizes (Havlík 2024). I take the disagreement seriously but read it as a quarrel over what “meaning” must answer to. If content is individuated by world-involving causal relations (see notes 4–6), then distributional structure recovers how a linguistic community uses a term without recovering what anchors the term to the world. On that reading the parrot worry is relocated, not dissolved — which is why this essay presses on grounding rather than on usage. ↩
2. Harnad, “The Symbol Grounding Problem” (1990), poses the problem through the image of trying to learn a first language from a Chinese-Chinese dictionary: an endless circuit of symbol-to-symbol definition that never reaches the world. The claim is not that symbols can never refer, but that reference cannot be conferred by further symbols alone — the regress must terminate in a non-symbolic capacity to identify a category’s members. Note that Harnad’s diagnosis is considerably friendlier to connectionism than Searle’s: the grounding he demands is sensorimotor categorization, a task he takes neural networks to be well suited to learn, given the right embodiment. The argument here is therefore not anti-connectionist; it is anti–disembodied-connectionist. ↩
3. Harnad, “Symbol Grounding and the Origin of Language” (2002): “What language allows us to do is to ‘steal’ categories quickly and effortlessly through hearsay instead of having to earn them the hard way, through risky and time-consuming sensorimotor ‘toil.’” The theft/toil contrast is his. The application is mine: a model trained exclusively on representations attempts the theft with no underwriting toil anywhere in its causal history — not its own, and not, in any content-fixing way, the photographers’. The captioned-image corpus is a vast ledger of other people’s earnings that the model never made. ↩
4. Putnam, “The Meaning of ‘Meaning’” (1975). Twin Earth fixes the individuation of content by external relations: my molecular twin and I, internally identical, mean different substances by “water” because our environments differ (H₂O here, the look-alike “XYZ” there). Burge (“Individualism and the Mental,” 1979) extends the externalism to the social environment. I lean only on the modest thesis — that internal richness underdetermines content — and not on any stronger claim about whether phenomenal character itself is wide. The modest thesis is enough to sink “more pixels equals more meaning.” ↩
5. Tye, “Homunculi Heads and Silicon Chips: The Importance of History to Phenomenology” (2019). Tye accepts Block’s verdict that a “China-body system” duplicating our functional organization at a moment would have no experiences, but argues the reason is historical rather than organizational: the system lacks the causal history through which its states would come to track — and therefore represent — worldly features. Because Tye holds that phenomenal character just is representational content of the right kind, a historical condition on content becomes a condition on experience. (The library’s copy carries a “2011” preprint stamp; the published version appears in the Pautz and Stoljar Blockheads! volume, MIT Press 2019.) For the record, Tye announced a move toward panpsychism in 2024; nothing here depends on that later turn — the historical thesis stands on its own. ↩
6. This is the teleosemantic ingredient, and it is doing quiet but essential work. On Dretske’s account (Explaining Behavior, 1988; Naturalizing the Mind, 1995), a state represents what it has the function of indicating, and functions are acquired through a learning or selectional history in which getting it right and getting it wrong carried consequences. “Stakes” is shorthand for that history: a system for which misrepresentation costs nothing is, on this view, not yet in the business of representation at all. A frozen training corpus supplies correlations in abundance but no such history — which is why scaling the corpus, in any modality, changes the quantity of correlation without manufacturing the one thing teleosemantics says content requires. ↩
7. I bring in Searle’s syntax/semantics argument (“Minds, Brains, and Programs,” 1980) only here, and deliberately not at the front: the educated reader has largely filed the Chinese Room under “answered,” by way of the Systems and Robot replies. But notice that the Robot Reply — the proposal that grounding the symbols in sensors and effectors would supply understanding — concedes precisely this essay’s point. It locates the missing ingredient in embodiment, not in more or richer symbols. Searle himself resists even that, on the ground that bolting transducers onto the room changes nothing happening inside it; whether he is right about that further step is a dispute this essay can leave open, because its target — the claim that multimodal input alone grounds meaning — is one the Robot Reply and Searle both reject. ↩
May 25, 2026

Tag: symbol grounding

What a Machine Would Have to Earn

References

Notes

Multimodality and the Symbol-Grounding Problem

References

Notes