Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans without any further training. However, the plans produced naively by LLMs often cannot map precisely to admissible actions. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models.
Given an arbitrary human activity expressed in natural language, we desire a complete action plan. The action plan must only use a chosen-set of actions supported by a robotic agent. Furthermore, it must include all the common-sense steps that are often missed by humans.
Although large language models (LLMs) such as GPT-3 and Codex, when prompted with an example, can produce very plausible action plans for complex human activities. Unfortunately, they often cannot be executed in the environment or may contain various linguistic ambiguities because they are expressed in free-form language.
In this paper, we propose several tools to improve executability of the produced action plans, which require no further training and thus is not tailored to a particular environment.
For every generated action step, we first enumerate all actions achievable by an agent. Then the model output is translated to the most semantically-similar action achievable by the agent. In this work, we use similarity measure between sentence embeddings produced by a Sentence RoBERTa model, but other choices are shown to work similarly well. The translated action is appended to the existing prompt used to generate the next action step.
In the case that a dataset, or a handful of examples, is available, we can provide weak supervision to the model by prompting the model with an example that is similar to the query task. We re-use the same Sentence RoBERTa model for this purpose, but instead we measure similarity between the query task against all possible examples.
We consider two axes for evaluation: executability and correctness. Executability measures whether an action plan can be correctly parsed and satisfies the common-sense constraints of the environment. Correctness is evaluated by humans, where 10 human annotators determine the semantic correctness of the generated plans.
Large language models, such as GPT-3 and Codex, can generate action plans with semantic correctness comparable to those written by humans. However, they are rarely executable in an embodied environment like VirtualHome, unlike the plans written by human experts that use only admissible actions. Using the proposed tools to bias model generation, executability can be significantly improved.
Action plans generated by our approach can be executed and visualized in the VirtualHome Simulator: