Motivation
==========

Why does this package exist? The short answer is to create a more usable text
classification interface.

Now for the long answer, which expands on the meaning of *usable*.


Problem
-------

There are many ways to do text classification. The one that this package is competing
against is text generation.

To make text generation more concrete, let's work through an example. In text
generation, you write up your classification task in a ``prompt`` string. For example,
to classify a product review:

.. code:: python

   from cappr import openai

   class_names = (
       "The product is too expensive",
       "The product uses low quality materials",
       "The product is difficult to use",
       "The product doesn't look good",
       "The product is great",
   )
   class_names_str = "\n".join(class_names)

   product_review = "I can't figure out how to integrate it into my setup."
   prompt = f"""
   A customer left this product review: {product_review}

   Every product review belongs to exactly one of these categories:
   {class_names_str}

   Pick exactly one category which the product review belongs to.
   """

   api_resp = openai.api.gpt_complete(
       prompt,
       model="gpt-3.5-turbo-instruct",
       max_tokens=10,
       temperature=0,
   )
   completion = api_resp[0]["text"]
   print(repr(completion))
   # '\nThe product is difficult to use'

This usually works well. But if you've ever run text generation on a slightly larger
scale, then you know that there may be a considerable fraction of cases where the
``completion`` is not actually in ``class_names``. For your LLM application to work
well, these cases need to be handled. So you add:

.. code:: python

   if completion not in class_names:
       completion = post_process(completion)

   assert completion in class_names

Properly implementing ``post_process`` can be challenging, as the ``completion`` is
sampled from all possible sequences of tokens. This means you'll likely have to deal
with the cases where:

- The ``completion`` includes a bit of fluff

- The ``completion`` includes multiple plausible classes from ``class_names``

- The ``completion``\ 's word casing is different than the one used in ``class_names``,
  or it's spelled or phrased slightly differently

- The LM says ``"I'm not sure"`` in three different ways.

When faced with this problem, one solution is to iterate the prompt based on observed
completions. Another solution is to refer to each choice using a single token, as in a
multiple choice question. But single-token references can sacrifice performance when you
have quite a few classes, as it's not a typical instruction format. Other modifications
include mapping the ``completion`` to one of the ``class_names`` using a similarity
model. Common to all of these solutions is the need to spend developer time and
sacrifice simplicity.

The fact is: text generation can be endlessley accomodated, but you'll still have to
work around its arbitrary outputs. Fundamentally, unconstrained sampling is not a clean
solution to a classification problem.


Solution
--------

With CAPPr's ``predict`` interface, your job starts and stops at writing up your
classification task as a ``{prompt} {completion}`` string.

Let's now run CAPPr on that product review classification task. Also, let's:

- supply a prior (optional)

- predict a probability distribution over classes (optional)

- use a smaller, "worse" model—``text-curie-001``

  - Text generation with ``text-curie-001`` does not work well for slightly complicated
    tasks, e.g., run the text generation code above with ``model="text-curie-001"``\ .

.. code:: python

   from cappr.openai.classify import predict_proba

   class_names = (
       "The product is too expensive",
       "The product uses low quality materials",
       "The product is difficult to use",
       "The product doesn't look good",
       "The product is great",
   )
   prior = (
       2 / 6,
       1 / 6,
       1 / 6,
       1 / 6,
       1 / 6,
   )  # set to None if you don't have a prior
   # 2/6 reflects that perhaps we already expect customers to say it's expensive

   product_review = "I can't figure out how to integrate it into my setup."
   prompt = f"""
   This product review: {product_review}

   is best summarized as:"""

   completions = [class_name.lower() for class_name in class_names]

   pred_probs = predict_proba(
       prompt, completions, model="text-curie-001", prior=prior
   )

   print(repr(pred_probs.round(1)))
   # array([0.1, 0. , 0.8, 0. , 0.1])

   pred_class_idx = pred_probs.argmax(axis=-1)
   print(class_names[pred_class_idx])
   # The product is difficult to use

CAPPr is guaranteed to output exactly one choice from a given set of choices. As a
result, your work is reduced to designing a prompt-completion string format.

In the age of large language models, text classification should be boring and easy.
CAPPr aims to be just that.