C. Hinsley

28 July 2020


There are three primary techniques for skill acquisition:

The third of these, I argue, is what will finally give rise to human-level artificial intelligence.


Young children obtain knowledge through experience; they must either learn directly by bootstrapping reward signals from the environment or receive guidance from caretakers. This guidance may be in the form of proxy reward signals, such as praise or scolding, or it may be in the form of demonstration, so that the child can mimic the behavior in order to establish traction in an initial learning trajectory.

The computational analogue to this strategy is reinforcement learning (RL), which attempts to learn by exploration or experimentation. Both direct and indirect learning are available via bootstrapping reward signals that may (in the direct case) be embedded in the agent as natural responses to particular events observed in the environment, or may (in the proxy case) be provided ad-hoc by an external observer.

Imitation learning is the corresponding analogue of a child's mimicry. Children naturally imitate the behaviors of their parents, friends, and other role models. This strategy is useful because behavioral nuances like pen-holding or ball-throwing cannot typically be comprehensively expressed in language (barring qualitative cues like "don't bear down so hard!" in handwriting, or "follow through!" in pitching), and they involve such refined control that experimentation and practice alone cannot match the rate of improvement achieved by copying the behaviors of an expert.

For machine learning agents, imitation learning requires sample behavior that is more successful (able to achieve greater reward) than the mimicking agent's current policy — or, more precisely, sample behavior that, by attempting approximation thereof with incremental policy updates, the mimicking agent can use to achieve greater reward. This allows complex policies to be "taught" to a learning agent that would not otherwise be attainable — oftentimes, one must maintain a certain trajectory in learning for a while before it becomes apparent that that trajectory will even be fruitful to begin with. In the context of human-level performance (HLP), imitation learning can often be useful to reach a satisfactory result; however, once an agent breaches the HLP baseline, learning can be quicker for an agent which has reached HLP purely via reinforcement learning (without imitation) than for the agent which has accelerated its progress to HLP by imitation learning. The result in the AlphaGo Zero paper is a clear example of this phenomenon.

Indeed, tasks which may be infeasible through reinforcement learning alone may become viable with imitation learning. In the interest of achieving general HLP however, it is not clear that imitation learning will in the future bridge the gap between traditional reinforcement learning capabilities and HLP in a large number of tasks. Many tasks for which it can be difficult to define objective reward functions for are simple for humans to subjectively evaluate, and there have been promising results in guiding RL agents subjectively in recent years. It may be the case that imitation learning finds its greatest utility in helping RL agents quickly mimic the behavior of and approach the performance of deterministic, hard-coded agents/algorithms, or some other similarly narrow use case I haven't considered. Outside of intuition-heavy games with large state-action spaces, it does not seem as though imitation learning will become the primary driver of improvement in RL agents.

The problem with imitation learning is pedagogical: many tasks are just damned hard to teach. Consider how much time and effort goes into both initially and continually training humans to perform jobs, for instance. Firstly, for most non-trivial responsibilities, a decade-and-a-half-long formal, standardized education is prerequisite to establish the conceptual and literary foundations for communicating and resolving problems that find their ways to individuals in an organization. For more demanding careers, next comes field-specific training, which may be an additional two to four years of even more intense study and guided critical refinement. Upon actually beginning the job, the individual must remain closely supervised for an extended period of time until uncertainty about productivity and maintenance of liabilities (and, of course, about whether the individual fits into the company culture!) can be reduced to an acceptable level. During this period, guidance is provided constantly from all other individuals in the organization that share liabilities with the new hire. Finally, once the individual has settled into the responsibilities of the job, continual reduction of this performance uncertainty must be sustained by circumstantial study and relatively minor supervision.

Any general human-level artificial intelligence must be able to fulfill the organizational demands a human employee is subject to. An obvious benefit of a hypothetical learning agent which is able to match the post-educational performance of a human is that, once one such agent has done so, it can be duplicated ad infinitum to skip over the first twenty years of cognitive development it would be bound by if it were human. Unfortunately, this is a very lofty hypothetical; many of the dynamics of childhood and adolescent development are contributed by very specific environmental norms, which, when removed or altered just outside what is typical, can cause major defects in a person's performance and assimilation to working culture.


Taking inspiration from the brain

Firstly, to determine what is required to achieve general HLP in a mechanical system, we must consider the fact that the human physiological-neurological configuration of the brain, particularly during development, plays a key part in a person's behavior. At the macro level, the brain comprises multiple lobes that somewhat independently develop through childhood. These lobes develop in response to related stimuli, so the linguistic portion of the brain remains underdeveloped through adulthood if a child does not practice speech and literacy. Likewise, the areas of the brain responsible for musical interpretation and genesis can remain relatively unformed if they are not exercised during childhood. Although at a systems level the neurons in other lobes might be technically capable of behaving similarly to dedicated cerebral regions and approximating their application-specific behavior — hypothetically, the prefrontal cortex could learn to perform an analogue to some computation normally done in the temporal lobe, for instance, or perhaps it might perform some ancillary low-resolution visual processing task in place of the occipital lobe — the application-specific brain region will ostensibly perform best at the tasks for which it is developed. Once neurogenesis has slowed in adulthood, a person isn't as likely to develop the core abilities that might have been neglected in adolescence.

A diagram of the functional regions of the human brain. From Secrets of the Brain: An Introduction to the Brain Anatomical Structure and Biological Function by Jiawei Zhang (https://arxiv.org/pdf/1906.03314.pdf)

A diagram of the functional regions of the human brain. From Secrets of the Brain: An Introduction to the Brain Anatomical Structure and Biological Function by Jiawei Zhang (https://arxiv.org/pdf/1906.03314.pdf)