/Tech45d ago

Fei-Fei Li warns AI neglects physical and embodied world

Fei-Fei Li cautioned that AI development overemphasizes language models while neglecting physical, visual, spatial, and dynamic aspects of the real world. She noted that most economic activity depends on embodied intelligence through seeing, moving, and interacting with the environment. The comments were delivered during an onstage panel discussion, and video clips of the session spread on X.

805159615466.5K

#178

Original post

Rohan Paul@rohanpaul_ai#1257inTech

Fei-Fei Li warns that AI may be staring too hard at language models. The world is not just text on a screen. It is physical, visual, spatial, and always changing. Most of the economy runs on seeing, moving, interacting, and embodied intelligence.

12:15 AM · May 16, 2026 · 51.1K Views

Sentiment

Positive users endorse Fei-Fei Li's warning that AI overfocuses on language models and call for embodied interaction plus other modalities, while negative users dismiss her as a talking head or sarcastically defend language's role.

Pos

66.7%

Neg

33.3%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS14.9KBOOKMARKS36LIKES172REPLIES22

Gary Marcus@GaryMarcus

💯. Way too much focus on language models.

Rohan Paul@rohanpaul_ai

45d14.9K17236

RETWEETS63

Rohan Paul@rohanpaul_ai

45d51.1K342118

William Hastings@WillyPete300

@rohanpaul_ai Seriously? Robotic components speak a "language" too. LLMs can speak every human language, C, Java, Python, and... and... and... You think it cant get input from cameras and take action by "talking" to robotic components?

44d4111

MadDog@FearAndMadnUS

@GaryMarcus Google is walking past the big AI companies because they are not just an AI company. Because they are quiet and don't try to get silly headlines. They're the only company close to what China is doing right now. Data center investors beware. https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive

44d191

on godot@on_godot

lol, in college I worked in the foreign language department as a secretary… and the Russian professor I didn’t have a crush on, but I would full body blush around him. Like I wasn’t attracted to him, but legit full body bright red, fumble my words. His wife also worked in the department. She was always sweet to me and I think she could tell it wasn’t like that… I didn’t ogle or follow.. so I’m going to guess Fei Fei …?

44d6

Ole Tillmann@oletillmann

@rohanpaul_ai @grok what‘s the original source for the video?

45d36

Rohan Paul@rohanpaul_ai

@GaryMarcus so true.

Gary Marcus@GaryMarcus

💯. Way too much focus on language models.

45d40310

on godot@on_godot

@rohanpaul_ai Love her so much!

44d3

Nick@Nicky_Bonez

@axiomwave_xbt @GaryMarcus LLMs are easy to show off and seem impressive because text modeling (not language!) is relatively easy and humans have powerful innate text modelers. Too bad for AI companies, they are not very useful in the grand scheme of things.

44d2

Active Kinetic 1@AK1_tweet

Artificial kinetic intelligence (AKI) moves beyond even physical AI, which is limited to the boundary of on-board, vision and sensors. So Fei-Fei is correct that on screen AI is narrowing the future perception of true AGI. The initial AKI framework Already establishes for deeper capability. Artificial Kinetic Intelligence (AKI) — https://doi.org/10.5281/zenodo.19496506

44d91

Rohan Paul@rohanpaul_ai

@rtheoryxyz 💯

45d89

Steadtler@SteadtlerA58435

@rohanpaul_ai Language is the mean we use to convey information.

But this is a false premise, all the serious AIs are already multimodal and understand at least images.

44d211

Matthew White@MatthewWhite000

@rohanpaul_ai Often the results are like the AI STARED into the sun too long… And don’t do that!!

44d57

Amor Avhad@glass_it

@rohanpaul_ai Check out LeWorldModel which runs on a single GPU. Even Elon Musk inquired about it..

44d151

Bala Sankar@onthegoAI

@rohanpaul_ai but the fact is, the fundamental architecture is the same: transformer; self attention; multidimensional vector space

datasets would be images, audio and videos

44d49

Carina Nicolosi@CarinaN818

The missing layer is not just multimodal AI.

It is physical observability.

Humans, machines, and environments do not only exchange information through interfaces. They are already physically coupled through motion, vibration, pressure, heat, electromagnetic activity, sensor contact, latency, and feedback.

That shared layer is not language in the symbolic sense.

It is coupled dynamics.

In that layer, information appears as timing relationships: phase, frequency, amplitude, synchronization, resonance, drift, phase-locking, coherence, and recovery after perturbation.

AI usually operates after this physical layer has already been converted into data: sensed, conditioned, digitized, encoded, and represented.

But the human, the machine, and the environment are already interacting before representation.

The deeper question is not only how to build better models of the world.

It is how to measure the stability of the coupled physical system itself.

44d101

Volsurface@Volsureface

@rohanpaul_ai overfitting on text feels too common

embodied ai is still the bottleneck

45d32

Selwyn@SelwynLio

@rohanpaul_ai always been the gap in llm demos

bodied tasks are still solved by humans

45d30

Article 3 BILL OF RIGHTS@BorgoniaBorgy

@GaryMarcus LLMs are easiest to market and exploit, particularly to those who are ignorant of how the technology works (i.e. CEOs, executives, decision makers, investors, stockholders, etc.). #generativeAI #aigenerated #artificialintelligence #LLMs

44d28

Mel@0x_Melisso

@rohanpaul_ai embodied ai could unlock the next wave of practical applications beyond chatbots

44d24