408. What LLMs Suck At

▮ LLMs

Generative AI has been trending for quite a while, and I’ve been curious about the actual credibility of these models. The outputs these models create are so persuasive that it seems too good to be true.

So for this post, I’d like to share what I’ve learned about what they suck at. (I know, I sound like a terrible person)

Before going straight to the capabilities, there are some fundamental facts we should know about LLMs to understand their inabilities. So I’ll first discuss what these models are and what their inference process looks like and then write about what they currently suck at.

▮ Auto-Regressive LLMs

All famous LLMs, including chatGPT, are based on an architecture called “auto-regressive” models.

When you pass in a query to a LLM, the LLM is not generating the whole sentence at once. It repeats the following steps.

  1. Generate the next possible word based on the input
  2. Re-input the generated new word along with all the previous input and generate the next possible word.
Auto-Regressive Models

Our next step is to understand the overall inference procedures.

▮ Inference Process

Now that we know what “auto-regressive” models are, let’s take a look at how these models run inference.

The steps are the following.
1. Human inputs a query
2. LLM converts the query to embeddings
3. Run a semantic search through vector database to find the most probable upcoming word
4. Augment/Generate the next word
5. Create a response to Human

Inference Process

▮ What They Suck At

Now that we know the fundamentals of LLMs, let’s finally dive into their capabilities.

1. Taking into account recent information

Like the current free ChatGPT 3.5, it only has information before September 2021.

But for this point, I think it is only a matter of time that this is resolved. I’ve heard that google created a quantum computer that can calculate much faster than the current most fastest supercomputer. So I think it won’t be long until big companies can retrain LLMs in couple of minutes and resolving this issue.

2. Reasoning, Planning

When humans do complex planning, we first see the overall picture and then break it down into chunks so that we can come up with the required procedures to achieve the final objective.

LLMs can only predict one by one which makes it hard to create sophisticated planning. They can’t look at the overall picture first before generated the required steps. So, since they can’t think like humans, we can sort of say that they cannot generate human-level reasoning.

Some researches such as “Chain-Of-Thought” shows that these LLMs can reason to a certain point.

Chain-Of-Thought

However a research paper published after that stated the following.

LLMs still can’t reason
3. Exponential Accuracy Decrease

Like I’ve said in the previous section, the model does a semantic search to find the most probable upcoming word.

If we define the probability of the generated word being semantically wrong as e, the probability of output being correct when output sequence length is n, would be (1-e)^n.

This means the accuracy of the outputs exponentially decreases as the length of the output grows.

This is a fundamental flaw these “auto-regressive” models have, and will need a major design modification in order to overcome this.

4. Controllability

Again, due to the model architecture, it is extremely difficult to control the outcome. This means, it will suffer the following.

  • Cannot produce factual and consistent answers
  • Outputs non-factual/toxic information