Computational performance ========================= One concern is that CAPPr requires as many model calls as there are choices/classes. But in the CAPPr scheme, we can cache each attention block's keys and values for the prompts. This feature is already supported by ``AutoModelForCausalLM``\ s. See `this module `_ for the implementation. Here's a somewhat cherry-picked experiment: .. figure:: _static/scaling_classes/batch_size_16.png :align: center `COPA`_ dataset, repeating the choices to simulate multi-class classification tasks. `GPT-2 (small)`_ was run on a T4 GPU. 48 classification inputs were processed in batches of size 16. For a controlled runtime comparison, GPU RAM was held (roughly) constant for each method and each number of classes. Each point in the graph is a median of 5 runs. For text generation, exactly 4 tokens were generated for each prompt, which is the number of tokens in ``'\n\nAnswer A'``. 1-token times are also shown. But for COPA (and other multiple-choice style prompts), that may result in lower zero-shot accuracy, as most of the sampled choices come after the first token. .. _COPA: https://people.ict.usc.edu/~gordon/copa.html .. _GPT-2 (small): https://huggingface.co/gpt2 See `this notebook `_ for the code which produced the figure above. .. note:: For :mod:`cappr.llama_cpp.classify`, batch inference currently isn't possible. As a result, text generation is typically faster than CAPPr when there are many completions. Weaknesses ---------- CAPPr does not computationally perform well when there are 10s of classes and the prompt is so long that only one or two fit in memory during processing. For a demonstration of this weakness, see the `Banking 77 demo `_. This weakness isn't apparent in the COPA demo above because the prompt can be short (and the completions long) without sacrificing accuracy. Are there more classification tasks where classes don't need to be provided in context, and instead provided as a completion for CAPPr to statistically perform well? If so, CAPPr's computational issues can be worked around through prompt engineering. And the model's context window can be reduced. Based on a few experiments, it seems like the answer to this question is no; mentioning choices in the prompt improves accuracy. From an engineering standpoint, another weakness of CAPPr is that computational optimizations require the ability to control the cache and batch inputs. The model must implement the Hugging Face transformers interface—including accepting and returning ``past_key_values``. With sampling, caching is simpler, and batching isn't required.