Let’s dig deeper into this new capability and how it’s helping Bard improve its responses.
Improved logic and reasoning skills
Large language models (LLMs) are like prediction engines — when given a prompt, they generate a response by predicting what words are likely to come next. As a result, they’ve been extremely capable on language and creative tasks, but weaker in areas like reasoning and math. In order to help solve more complex problems with advanced reasoning and logic capabilities, relying solely on LLM output isn’t enough.
Our new method allows Bard to generate and execute code to boost its reasoning and math abilities. This approach takes inspiration from a well-studied dichotomy in human intelligence, notably covered in Daniel Kahneman’s book “Thinking, Fast and Slow” — the separation of “System 1” and “System 2” thinking.
- System 1 thinking is fast, intuitive and effortless. When a jazz musician improvises on the spot or a touch-typer thinks about a word and watches it appear on the screen, they’re using System 1 thinking.
- System 2 thinking, by contrast, is slow, deliberate and effortful. When you’re carrying out long division or learning how to play an instrument, you’re using System 2.
In this analogy, LLMs can be thought of as operating purely under System 1 — producing text quickly but without deep thought. This leads to some incredible capabilities, but can fall short in some surprising ways. (Imagine trying to solve a math problem using System 1 alone: You can’t stop and do the arithmetic, you just have to spit out the first answer that comes to mind.) Traditional computation closely aligns with System 2 thinking: It’s formulaic and inflexible, but the right sequence of steps can produce impressive results, such as solutions to long division.
With this latest update, we’ve combined the capabilities of both LLMs (System 1) and traditional code (System 2) to help improve accuracy in Bard’s responses. Through implicit code execution, Bard identifies prompts that might benefit from logical code, writes it “under the hood,” executes it and uses the result to generate a more accurate response. So far, we’ve seen this method improve the accuracy of Bard’s responses to computation-based word and math problems in our internal challenge datasets by approximately 30%.
Even with these improvements, Bard won’t always get it right — for example, Bard might not generate code to help the prompt response, the code it generates might be wrong or Bard may not include the executed code in its response. With all that said, this improved ability to respond with structured, logic-driven capabilities is an important step toward making Bard even more helpful. Stay tuned for more.