A note on workflow#

You’ve done the hard part of translating a practical problem into a “make a choice” problem that some LLM may be able to solve. Can you deploy this prompt + LLM system now? Probably not. Here’s what, and how, you need to do.

Gather data#

Without previous experiments, don’t assume that an LLM and your first prompt-completion format are going to work well. Instead, gather data like so:

A bunch of input-output pairs (200 examples)#
id	raw input	correct output index
1	“input 1”	2
2	“input 2”	0
3	“input 3”	0
…	…	…
200	“input 200”	1

The prompt is a transformation of the raw input. The correct output index corresponds to the correct output/choice for that input.

In general, you should gather as many of these input-output pairs/examples as is feasible. If there are only 2 possible choices (and say accuracy is 90%), then gather at least 200 examples total.[1] As the number of possible choices increases, or as accuracy gets closer to random guessing, more examples are needed to evaluate the system.

If you don’t have many input-output examples immediately within reach, then do the hard but important work of making them up.[2] Think carefully about the types of inputs you expect to see in production, and their relative frequencies. Make sure every choice is included in the dataset. Consider adding a few tricky inputs to understand the limits of your system. But don’t evaluate anything just yet!

Split data into train and test#

Now that you have a nice dataset, before you do anything else, randomly partition the dataset into a “training” dataset and a “test” dataset.[3] The importance of this step cannot be overstated.

training dataset (50 examples)#
id	raw input	correct output index
105	“input 105”	1
…	…	…
27	“input 27”	0

test dataset (150 examples)#
id	raw input	correct output index
174	“input 174”	2
26	“input 26”	2
91	“input 91”	1
…	…	…
136	“input 136”	1

Iterate on the training dataset#

Evaluate your first prompt-completion format on the training dataset. Examine and understand failure cases. Is your prompt specific enough? Does it include enough context? Iterate the format, language model, or prior, and evaluate on the training dataset again.

Be disciplined about not seeing or evaluating on the test dataset until you’ve finalized your selections for a format, langauge model, and prior.

If necessary, bring out the big guns#

Sometimes, you’ll find that your task is too difficult for a smaller model and a static prompt-completion format. In that case, consider the most OP solution: get a chain-of-thought completion from GPT-4 or Claude 2, and then have a cheap model classify the answer from this completion using CAPPr. See this section of the documentation for an example. Just keep in mind that the big guns cost quite a bit of latency and money.

Evaluate on the test dataset once#

After fully specifying everything about how your system is going to work, run that system on the test dataset once. When you’re asked for performance metrics, report the ones from this dataset.