406. Evaluating Generative Models

▮ Generative Model Trends

I’ve had so many requests from multiple clients that they want to integrate generative models such as ChatGPT and Stable Diffusion into their current workflow. Despite the trend, one of the difficulties of adopting such models is the evaluation method. This is because these models have a non-deterministic output and the prediction can be correct AND incorrect at the same time depending on who evaluates the answer. So for this post, I like to share how we can evaluate such models.

▮ Why does it matter?

It’s easy to attract people and blow them away with the new technology, but at the end of the day, if the model is not performing as expected you’ll easily lose people’s trust. For this reason, it is important for us and also for our client to acknowledge the following and properly evaluate the model.

  • For LLMs, they make tons of “hallucinations”(mistakes) and it is hard for people to distinguish it compared to generative computer-vision models.
  • Improving in one way can get worse in other aspects.

▮ Evaluation Procedure

Here is one way to evaluate a generative model.

  1. Prepare Data
    • For language models, we can have “Question”/”Expected Answer” as a pair.
    • For txt-to-image models, we can have “Image”/”Description” as a pair.
  2. Split Data
    • Split the data pairs we prepared into training/validation/test sets.
  3. Train Model
    • Use the training set to fine-tune the generative model
    • Be careful which algorithm to use, if the training dataset includes private information.
  4. Run Inference
    • Predict using the test set
  5. Calculate metrics
    • The below illustrates a workflow that might be able to help you decide on a metric for your specific situation.
    • If you have the expected answer, you can use “reference metrics” such as quantifying the semantic similarity between the text or images
    • If you have an answer that was generated by a previous model, you can ask another LLM to decide which answer is better(the previous one, or the one that was generated)
    • If you are able to get a human feedback, ask another LLM if the new answer has incorporated the feedback on the old answer
    • If none of the above are available, you can use static metrics such as checking the returning data structure
    • If the project has just started, it might be good to calculate multiple metrics so that you can compare each metric and the corresponding outputs.
    Fig.1 – Metric Decision Tree

▮ Prompt Engineering

When evaluating a generative model, you will likely go through the procedures I’ve discussed in the previous section multiple times. If we were to use the same generative model network, there are mainly 2 ways to optimize the results throughout the iteration.

  1. Add more data
  2. Prompt Engineering

For this section, I’d like to mainly discuss the latter. The text input we give to the generative model is often called “prompts”. For example, in text-to-image generation, how we describe the sentence will affect hugely how the images are going to be generated. To achieve the desired results, we will need to explicitly specify the prompt. Here are several tips on writing a “good” prompt.

  • Be clear and specific
    • If you want to generate an “elegant” image, think about what makes an image visually “elegant” and use that keyword as prompt
  • Avoid Impreciseness
  • Phrase your questions carefully
  • Encourage multiple alternatives
  • Add weights to the prompt if there are multiple

After getting the right prompts, it is also important to align how the prompts are perceived between AI and people because in most cases they are different. For example, when you see a picture, YOU might think it is “elegant” but for the AI it might reckon it as something else. There is no way to align the perception 100%, but we can use tools such as Lexica to align it up to a certain degree. Using Lexica you can see what prompt was used to generate a certain image.

Fig.2 – Lexica

▮ References