Select a language model#

CAPPr typically works better with larger, instruction-trained models. But it may be able to squeeze more out of smaller models than other methods. So don’t sleep on the little Llamas or Mistrals out there, especially if they’ve been trained for your application.

Besides that, selecting a language model is almost entirely a process of trial and error, balancing statistical performance with computational constraints. It should be easy to plug and play though.

For CAPPr, GPTQ models are the most computationally performant. Mistral trained on OpenOrca is statistically performant. These models are compatible with cappr.huggingface.classify.

Hugging Face#

To work with models which implement the transformers CausalLM interface, including AutoGPTQ and AutoAWQ models, CAPPr depends on the transformers package. Search the Hugging Face model hub for these models.

Note

For transformers>=4.32.0, GPTQ models can be loaded using transformers.AutoModelForCausalLM.from_pretrained.

Here’s a quick example (which will download a small GPT-2 model to your computer):

from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import predict

# Load a model and its tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Which planet is closer to the Sun: Mercury or Earth?"
completions = ("Mercury", "Earth")

pred = predict(prompt, completions, model_and_tokenizer=(model, tokenizer))
print(pred)
# Mercury

So far, CAPPr has been tested for code correctness on the following architectures:

  • Llama, Llama 2, and (since CAPPr v0.9.6) Llama 3 and 3.1

  • Mistral

  • Gemma 2

  • Phi

  • GPT-2

  • GPT-J

  • GPT-NeoX (including StableLM)

  • (Q)LoRA models whose base model is one of the above.

You’ll need access to beefier hardware to run models from the Hugging Face hub, as cappr.huggingface currently assumes you’ve locally loaded the model. Hugging Face Inference Endpoints are not yet supported by this package.

ctransformers model objects are not yet supported. (I think I’m just waiting on this issue.)

vllm model objects are not yet supported.

Which CAPPr Hugging Face module should I use?#

There are two CAPPr Hugging Face modules. In general, stick to cappr.huggingface.classify.

cappr.huggingface.classify has the greatest throughput. By default, prompts are processed two-at-a-time and completions are processed in parallel. These settings are controlled by the batch_size and batch_size_completions keyword arguments, respectively. Decreasing their values decreases peak memory usage but costs runtime. Increasing their values decreases runtime but costs memory.

cappr.huggingface.classify can also cache shared instructions for prompts, resulting in a modest speedup. See cappr.huggingface.classify.cache_model().

cappr.huggingface.classify_no_cache may be compatible with a slightly broader class of architectures and model interfaces. Here, the model is only assumed to input token/input IDs and an attention mask, and then output logits for each input ID.

Note

For transformers>=4.35.0, AWQ models can be loaded using transformers.AutoModelForCausalLM.from_pretrained. AWQ models loaded this way are compatible with cappr.huggingface.classify.

In particular, cappr.huggingface.classify_no_cache is compatible with models loaded via:

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
   model_id,
   ...,
   batch_size=batch_size_completions,
)
model.device = "cuda"

Examples#

For an example of running Llama 2, see this notebook.

For an example of running an AutoGPTQ Mistral model, where we cache shared prompt instructions to save time and batch completions to save memory, see this notebook.

For a minimal example of running an AutoAWQ Mistral model, see this notebook.

For minimal examples you can quickly run, see the Example section for each of these functions:

cappr.huggingface.classify.predict()

cappr.huggingface.classify.predict_examples()

Llama CPP#

To work with models stored in the GGUF format, CAPPr depends on the llama-cpp-python package. Search the Hugging Face model hub for these models.

Here’s a quick example (which assumes you’ve downloaded this 6 MB model):

from llama_cpp import Llama
from cappr.llama_cpp.classify import predict

# Load model
model = Llama("./TinyLLama-v0.Q8_0.gguf", verbose=False)

prompt = """Gary told Spongebob a story:
There once was a man from Peru; who dreamed he was eating his shoe. He
woke with a fright, in the middle of the night, to find that his dream
had come true.

The moral of the story is to"""

completions = (
   "look at the bright side",
   "use your imagination",
   "eat shoes",
)

pred = predict(prompt, completions, model)
print(pred)
# use your imagination

Examples#

For an example of running Llama 2 on the COPA challenge, see this notebook.

For an example of running Llama 2 on the AG News challenge, where we cache shared prompt instructions to save time, see this notebook.

For minimal examples you can quickly run, see the Example section for each of these functions:

cappr.llama_cpp.classify.predict()

cappr.llama_cpp.classify.predict_examples()

OpenAI#

Here’s a quick example:

from cappr.openai.classify import predict

prompt = """
Tweet about a movie: "Oppenheimer was pretty good. But 3 hrs...cmon Nolan."
This tweet contains the following criticism:
""".strip("\n")

completions = ("bad message", "too long", "unfunny")

pred = predict(prompt, completions, model="text-ada-001")
print(pred)
# too long

CAPPr is currently only compatible with /v1/completions models where log-probabilities of inputted tokens can be requested, via echo=True, logprobs=1. On January 4, 2024, OpenAI will deprecate all of these models except davinci-002 and babbage-002—weak, non-instruction-trained models. While gpt-3.5-turbo-instruct is compatible with /v1/completions, this model stopped supporting echo=True, logprobs=1 on October 5, 2023. So CAPPr can’t support this model.

Warning

Currently, cappr.openai.classify must repeat the prompt for however many completions there are. So if your prompt is long and you have many completions, you may end up spending much more with CAPPr. (cappr.huggingface.classify and cappr.llama_cpp.classify do not repeat the prompt because they cache its representation.)

Examples#

COPA

WSC

Decent performance on RAFT training sets is demonstrated in these notebooks.

For minimal examples you can quickly run, see the Example section for each of these functions:

cappr.openai.classify.predict()

cappr.openai.classify.predict_examples()