Select a language model#
CAPPr typically works better with larger, instruction-trained models. But it may be able to squeeze more out of smaller models than other methods. So don’t sleep on the little Llamas or Mistrals out there, especially if they’ve been trained for your application.
Besides that, selecting a language model is almost entirely a process of trial and error, balancing statistical performance with computational constraints. It should be easy to plug and play though.
For CAPPr, GPTQ models are
the most computationally performant. Mistral trained on OpenOrca is statistically
performant. These models are compatible with cappr.huggingface.classify.
Hugging Face#
To work with models which implement the transformers CausalLM interface, including
AutoGPTQ and AutoAWQ models, CAPPr depends on the transformers package. Search
the Hugging Face model hub for these models.
Note
For transformers>=4.32.0, GPTQ models can be loaded
using transformers.AutoModelForCausalLM.from_pretrained.
Here’s a quick example (which will download a small GPT-2 model to your computer):
from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import predict
# Load a model and its tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Which planet is closer to the Sun: Mercury or Earth?"
completions = ("Mercury", "Earth")
pred = predict(prompt, completions, model_and_tokenizer=(model, tokenizer))
print(pred)
# Mercury
So far, CAPPr has been tested for code correctness on the following architectures:
Llama, Llama 2, and (since CAPPr v0.9.6) Llama 3 and 3.1
Mistral
Gemma 2
Phi
GPT-2
GPT-J
GPT-NeoX (including StableLM)
(Q)LoRA models whose base model is one of the above.
You’ll need access to beefier hardware to run models from the Hugging Face hub, as
cappr.huggingface currently assumes you’ve locally loaded the model. Hugging Face
Inference Endpoints are not yet supported by this package.
ctransformers model objects are not yet supported. (I think I’m just waiting on
this issue.)
vllm model objects are not yet supported.
Which CAPPr Hugging Face module should I use?#
There are two CAPPr Hugging Face modules. In general, stick to
cappr.huggingface.classify.
cappr.huggingface.classify has the greatest throughput. By default,
prompts are processed two-at-a-time and completions are processed in parallel. These
settings are controlled by the batch_size and batch_size_completions keyword
arguments, respectively. Decreasing their values decreases peak memory usage but costs
runtime. Increasing their values decreases runtime but costs memory.
cappr.huggingface.classify can also cache shared instructions for prompts,
resulting in a modest speedup. See cappr.huggingface.classify.cache_model().
cappr.huggingface.classify_no_cache may be compatible with a slightly
broader class of architectures and model interfaces. Here, the model is only assumed to
input token/input IDs and an attention mask, and then output logits for each input ID.
Note
For transformers>=4.35.0, AWQ models can be loaded
using transformers.AutoModelForCausalLM.from_pretrained. AWQ models
loaded this way are compatible with cappr.huggingface.classify.
In particular, cappr.huggingface.classify_no_cache is compatible with models
loaded via:
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
model_id,
...,
batch_size=batch_size_completions,
)
model.device = "cuda"
Examples#
For an example of running Llama 2, see this notebook.
For an example of running an AutoGPTQ Mistral model, where we cache shared prompt instructions to save time and batch completions to save memory, see this notebook.
For a minimal example of running an AutoAWQ Mistral model, see this notebook.
For minimal examples you can quickly run, see the Example section for each of these functions:
Llama CPP#
To work with models stored in the GGUF format, CAPPr depends on the llama-cpp-python package. Search the Hugging Face model hub for these models.
Here’s a quick example (which assumes you’ve downloaded this 6 MB model):
from llama_cpp import Llama
from cappr.llama_cpp.classify import predict
# Load model
model = Llama("./TinyLLama-v0.Q8_0.gguf", verbose=False)
prompt = """Gary told Spongebob a story:
There once was a man from Peru; who dreamed he was eating his shoe. He
woke with a fright, in the middle of the night, to find that his dream
had come true.
The moral of the story is to"""
completions = (
"look at the bright side",
"use your imagination",
"eat shoes",
)
pred = predict(prompt, completions, model)
print(pred)
# use your imagination
Examples#
For an example of running Llama 2 on the COPA challenge, see this notebook.
For an example of running Llama 2 on the AG News challenge, where we cache shared prompt instructions to save time, see this notebook.
For minimal examples you can quickly run, see the Example section for each of these functions:
OpenAI#
Here’s a quick example:
from cappr.openai.classify import predict
prompt = """
Tweet about a movie: "Oppenheimer was pretty good. But 3 hrs...cmon Nolan."
This tweet contains the following criticism:
""".strip("\n")
completions = ("bad message", "too long", "unfunny")
pred = predict(prompt, completions, model="text-ada-001")
print(pred)
# too long
CAPPr is currently only compatible with /v1/completions models where
log-probabilities of inputted tokens can be requested, via echo=True, logprobs=1. On
January 4, 2024, OpenAI will deprecate all of these models except davinci-002 and
babbage-002—weak, non-instruction-trained models. While gpt-3.5-turbo-instruct
is compatible with /v1/completions, this model stopped supporting echo=True,
logprobs=1 on October 5, 2023. So CAPPr can’t support this model.
Warning
Currently, cappr.openai.classify must repeat the prompt for
however many completions there are. So if your prompt is long and you have
many completions, you may end up spending much more with CAPPr.
(cappr.huggingface.classify and cappr.llama_cpp.classify do
not repeat the prompt because they cache its representation.)
Examples#
Decent performance on RAFT training sets is demonstrated in these notebooks.
For minimal examples you can quickly run, see the Example section for each of these functions: