RFM: Action-Oriented Intelligence for Physical AI
Captured source
source ↗RFM: Action-Oriented Intelligence for Physical AI | by LG AI Research | Jun, 2026 | Medium
Sitemap
Sign up
Sign in
Get app
Write
Search
Sign up
Sign in
RFM: Action-Oriented Intelligence for Physical AI
LG AI Research
9 min read
Just now
https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F63422e04cb3b&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40lgairesearch%2Frfm-action-oriented-intelligence-for-physical-ai-63422e04cb3b&user=LG+AI+Research&userId=3223c7903363
--
https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Frepost%2Fp%2F63422e04cb3b&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40lgairesearch%2Frfm-action-oriented-intelligence-for-physical-ai-63422e04cb3b&user=LG+AI+Research&userId=3223c7903363
https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F63422e04cb3b&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40lgairesearch%2Frfm-action-oriented-intelligence-for-physical-ai-63422e04cb3b
Listen
Share
“Thinking Brain” to the “Action-Oriented Brain”
Recent advancements in AI technology have moved beyond simply generating text and images on a screen, evolving into the era of “physical AI,” where AI interacts with the real world through physical embodiment. If traditional Large Language Model (LLM)-based AI was a “thinking brain” that communicated through language in a virtual world, physical AI is an “acting brain” that perceives the physical environment and performs tasks directly using actual hardware as a medium. At the heart of this transformation lies the Robot Foundation Model (RFM), a universal robotic intelligence that can be applied across various robots and tasks.
Previous robot AI research has primarily focused on training specialized policies optimized for specific tasks. Today, however, the paradigm is shifting toward building Robot Foundation Models (RFMs) that can be applied across diverse environments and robot embodiments. Just as foundation models in natural language processing and computer vision have demonstrated their versatility across various tasks through large-scale pre-training, the core challenge in robotics lies in extending this capability into actions within the physical world.
The growing attention on RFMs stems from the clear limitations of conventional robot learning methods. Previous robot policies were often overfitted to a specific robot, environment, and task. Under this paradigm, performance is highly sensitive to environmental changes. Even minor variations in the following can cause a sharp decline in performance.
However, an RFM is far more than a linear extension of an LLM or VLM. This is because robots must go beyond simply understanding text and images; they need to actively interact with the physical world through their actions, observe the outcomes of those behaviors, and dynamically modify their actions when faced with failures. Currently, RFM research is primarily centered around two directions: generating robot actions based on visual and linguistic information, and predicting physical world changes to utilize them for action generation.
In this context, this article explores the primary landscapes of RFM research through two distinct technological avenues. The first is the Vision-Language-Action (VLA) model, which directly maps visual and textual inputs to robot actions. The second is the World Model (or World Action Model) approach, which focuses on predicting future environmental changes alongside action generation.
1. Primary Landscaping of RFM Research
RFM research is currently unfolding across two primary technical approaches. The first is the Vision-Language-Action (VLA) framework, which directly generates robot actions based on visual and textual inputs. The second is the World Model (or World Action Model) approach, which focuses on predicting future environmental states alongside action generation.
While early VLA models primarily focused on a straightforward pipeline of “current observation + textual instruction → action,” recent advancements increasingly incorporate reasoning, task context, failure cases, and tactile/force signals to further sophisticate control policies. Concurrently, to achieve cross-embodiment generalization — the core objective of RFMs — methodologies for joint training on datasets from disparate robot platforms have emerged as a critical focal point. Conversely, the World Action Model paradigm bridges video-based future prediction with robot action generation, evolving to complement the limitations of VLA models in forecasting physical and dynamic environmental shifts.
Image 1. VLA Model and Cosmos World Foundation Model(Source: https://pub.towardsai.net/what-are-world-models-41ff394ed871)
1–1. VLA Models: Mapping Vision and Language directly to Real-World Action
VLA models predict optimal robot actions by taking images or videos, natural language instructions, and the current state of the robot as inputs. Unlike conventional imitation learning, which merely trains a robot to replicate human-demonstrated motions, VLA synthesizes these demonstrations with language instructions and visual grounding. In essence, it extends the semantic understanding inherent in VLMs into the robot’s physical action space.Research within the VLA paradigm has recently been diversifying through several sophisticated approaches:
- Gemini Robotics [1]: Exemplifies the trend of reinforcing embodied reasoning, including scene understanding, spatial reasoning, and task planning.
- NVIDIA GR00T N1[2] and Figure AI Helix[3]: Serve as flagship examples of VLA-based RFMs tailored for humanoid robotics. GR00T N1 aims to be a generalist humanoid robot foundation model, while Helix was introduced as a generalist humanoid VLA capable of performing diverse household object manipulations and multi-robot collaborations based on natural language instructions.
- Physical Intelligence π0 Series[4,5,6,7]: Pursues a generalist robot policy, showing a distinct trend toward minimizing recurring failure modes in real-world deployment. It achieves this by utilizing rich context to condition policies and incorporating autonomous rollouts, failure cases, and human interventions directly into the training loop.
- RLWRLD의 RLDX-1[8]: Represents a unique pivot within the VLA domain that emphasizes dexterous manipulation. While conventional VLAs heavily prioritized vision-and-language-based universality,...
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Research post on physical AI, no traction evidence.