AI chatbots don’t actually “know” anything. Can we fix that?

AI models are great at generating human-like text — but that’s not the same thing as knowing facts.

April 13, 2023

The success of OpenAI’s chatbot ChatGPT has been so overwhelming that you’d have to live under a rock to not have heard about it. In the few months since it was introduced, over 100 million people have signed up to use it and asked it far-reaching questions, and it has spawned an AI arms race as firms across industries rush to add ChatGPT-like capabilities to their services.

Microsoft and Google, for example, are adding AI tools that will soon allow you to let a robot write your meeting notes and emails, and even design your presentations. Their search engines are trying to eliminate the need to scroll through the list of blue links and instead summarize the answer to your query for you. Even Snapchat wants users to consult its new AI chatbot for their conversations, while DoNotPay has released one that can negotiate your bills. Heck, one business is even deploying ChatGPT to manage radio stations.

At the same time, as the tech behind ChatGPT — called large language models (LLMs) — expands their presence, it’s also becoming increasingly hard to ignore the drawbacks. Because such models work by following patterns in text to guess what word should come next in a sentence, they can’t reason or tell whether what they’re generating is accurate. Ask ChatGPT or Google’s Bard chatbot to count down the number of days to your birthday, and it will spew out an incorrect answer most of the time.

An arms race is brewing to build AI that are logical and factual.

Their tendencies to ramble, make up information, and even act rogue has researchers worried their slapdash widespread adoption could flood the internet with misinformation, provided by an artificial intelligence that’s inherently no smarter, as sci-fi writer Ted Chiang puts it, than a flawed Xerox copy.

Behind the scenes, therefore, another race has been brewing: to build language models that are far more logical and factual.

Over the last year or two, research groups, including ones at Meta and Google, have accelerated work on what they collectively call “augmented language models” or ALMs. They supplement existing ChatGPT-like models with additional layers to help them better understand what a user is asking and respond with common sense, instead of just trudging down the statistical patterns in their training data.

One form of an ALM enables a chatbot to accomplish complex reasoning skills by building a chain of thought. It breaks down a query into a series of intermediate reasoning steps and keeps feeding itself each result until it arrives at a conclusion.

In its research earlier this year, Google found that a chain-of-thought approach allowed its chatbot to solve complex math problems far better than rivals, such as OpenAI’s GPT-3, the model that the basic version of ChatGPT is based on. It scored nearly double than GPT-3 on a benchmark test. It also made big jumps in strategic and common sense queries — and even outperformed a human in some of them.

It tackles a key shortcoming of existing LLMs, which many refer to as the “compositionality gap” — for instance, the model can correctly tell you the dates of birth and death of a celebrity, but it may not be able to use that information to correctly calculate their age.

Google’s chain-of-thought AI outperformed a human on some strategic and common sense questions.

If you ask Google’s chain-of-thought ALM whether a pear sinks in water, it will first tell you the density of the fruit and then establish whether it’s more or less than water. It pairs those two pieces of information to determine that a pear, having a lower density than water, will float. When I prompted ChatGPT with the same question, it got the densities correct, but kept contradicting itself since it had no way to reason sentence-by-sentence, and ultimately produced the wrong answer.

Dr. Noah Smith, a researcher at the Paul. G. Allen School of Computer Science and Engineering agrees that chain-of-thought can improve an LLM’s reasoning behavior but more importantly, he adds, the real value of these techniques is that they’ll make the model more transparent.

Current chatbots work like black boxes from the user’s perspective and don’t explain how they arrive at an answer, making it hard for their developers to trace at which step they failed. A chain-of-thought ALM’s responses, on the other hand, are “detailed, explicit, and explanatory and will potentially help human users catch more errors or inconsistencies,” Dr. Smith told Freethink.

Another kind of augmented language model retrieves relevant snippets of information and documents from its training dataset, or a live source like a search engine, to ensure the response is factually accurate. It helps prevent chatbots from going off track or “hallucinating” by feeding them a limited set of matching knowledge.

Dr. Wei Xu, an assistant computing professor at the Georgia Institute of Technology, finds the retrieval-based method a very promising solution, as it combines “the strengths of LLMs and search engines to have the best of both worlds.” The LLM parses a large body of data to stitch together coherent text, while the retrieval part offers “ground truths” and references, she adds, which can help keep the model in check and help human users fact-check its answers.

Retrieval-based methods combine the best of LLMs and search engines.

When Meta built a retrieval ALM called Atlas, it responded to trivia and natural questions with even greater accuracy than Google’s PaLM model, in spite of PaLM being 50x bigger.

In addition, Dr. Wu believes retrieval ALMs can partially solve the intellectual property issues generative AI poses for creators. AI chatbots like ChatGPT are largely trained on the internet’s (mostly copyrighted) data, without crediting or compensating the websites they build on. A retrieval ALM, in comparison, can cite the sources it’s basing its responses on, by including a link URL.

The AI functions Google and Microsoft have added to their search engines are forms of ALMs. However, the companies have done little to counteract the vastness of the web itself and their chatbots still struggle to avoid incorporating the misinformation, hate speech, and nonsense out there. Yet, they hold huge promise for enhancing productivity and revolutionizing a number of industries, and previous research has discovered chatbots capable of retrieving live information produce far fewer false statements than the ones that don’t have access to the internet.

Still, many AI researchers, though optimistic about the advances in ALMs, don’t expect the glaring issues of LLMs to be entirely solved anytime soon.

“There is still a long way to go to understand how they work, and interpret questions,” Dr. Wu told Freethink.