"Intelligence emerges from observing, interacting, and continually learning from other intelligent systems."

Most natural intelligence systems, from insects to dolphins, learn from other agents in their environments from birth. They acquire new skills and behaviors by observing and interacting with these systems. These observations, whether active (interacting and observing the reaction) or passive (observation without interaction), range from body movements and expressions to linguistics. These observations, I argue, are a projection of an intelligent system. By learning to replicate or associate certain behaviors with certain observations, the new agent begins to exhibit intelligent behavior.

Conversely, most artificial intelligence systems we build are trained on a projection of human intelligence. From ImageNet labels to segmentation annotations to web text data, all are forms of human intelligence projection. For instance, an ImageNet trained ResNet model performs very close to human intelligence on the plane of objects and 1000 labels. We could say we "align" the ImageNet models to this specific human behavior. Similarly, Language Learning Models (LLMs) observe human intelligence as a projection onto the language space, and the model behaves like human intelligence in this space or is "aligned". While this behavior is not full human intelligence, LLMs excel at mimicking human speech, which is all we need at this plane of intelligence. Recent advances in LLMs demonstrate that they are indeed very close to human behavior on this plane of intelligence.

However, to achieve full human-like intelligent behavior, we need to expand beyond our current capabilities. This requires: 1. Intelligent behavior across multiple planes of projections. 2. Continual interaction with other agents and modification of behavior over time.

To understand what I mean by "observation as a projection of intelligence", consider a dog. From birth, it observes many dogs, its owners, their instructions, and the positive or negative reinforcements. All these observations are projections of an intelligent system. In nature, we don't clone brain cells to replicate an intelligent system. It's sealed behind a wall, and only through observations does another agent gain certain skills and behaviors. We only observe certain intelligent behaviors, creating a projection plane of intelligence. For example, humans have been writing text since the dawn of time, compressing certain intelligence into these words. This is a projection of human intelligence onto a plane spanned by written text. To verify whether a system is behaving intelligently, we observe it and compare it to an already established intelligent system on a given task. For instance, if a self-driving car behaves very close to or better than a human driver, we could say it exhibits intelligent behavior in driving. This analogy is very similar to the "Turing test", but the observations can be anything, from talking to the system to driving, walking, and cleaning a room. We can observe intelligence in various planes and measure how close they are to human intelligence.

If we aim to build human-like intelligence systems, we should theoretically start learning from a space where human intelligence is well projected, or a set of spaces that combined would cover nearly full human-like intelligence. As humans, we observe the scene, the agent's movements, body language, expressions, and tone. If we have all of these observations, we might be able to train an intelligent behavior system that behaves very closely to humans, from talking and walking to manipulating objects and interacting with other humans.

Currently, we have a large yet partial observation of a human intelligence system (LLMs and some vision-language models). The observations the model is trained on are partial, encompassing only language and sometimes vision, and perhaps audio and depth sensors. However, we are missing many other essential observations for a full system, from interactions with objects and people to navigating through the world. Each observation space has a different scale of data and a different angle of projection from full intelligence.

The second missing ingredient is interaction and continual learning. Rather than learning from just observations, when the system starts to interact with other agents, it introduces more constraints to the learning process. From physical to social to survival perspectives, the agent now has to respect the constraints generated by other agents. From gravity to emotions to social norms, the learning agent needs to update its world model according to new observations generated from interactions. For example, to prevent an apple from falling on the floor, we need to hold it. We might have seen other agents doing it, but to find the correct grip and to know how much force to apply, we need to hold it ourselves. This interaction generates new observations, such as too little force will lose the grip and the apple will fall, or too much force will damage the apple. We learn from this and next time we will know how to apply the correct amount of force.

In conclusion, I firmly believe that to reach near-human intelligence, we need a multi-modal scalable model that can replicate human behavior across multiple projections, interact with other agents, and continually learn from them.