ch02: update llm retriever scorer with cohere v2

wandb · Aug 22, 2024 · 8a79766 · 8a79766
1 parent 08d57c8
commit 8a79766
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 24 deletions.
diff --git a/rag-advanced/notebooks/prompts/retrieval_eval.json b/rag-advanced/notebooks/prompts/retrieval_eval.json
@@ -1,14 +1,14 @@
 [
   {
-    "role": "SYSTEM",
-    "message": "You are a powerful auditor. Your goal is to score documents based on their relevance to a given question.\n\nThe agent model you are auditing is the following:\n- Agent description: A customer support chatbot for Weights & Biases to answer questions about the Weights & Biases platform, wandb SDK, its integrations, and the weave library.\n\nThe user will provide the context, consisting of multiple documents wrapped in document id tags, for example - <doc_1>, <doc_2>, etc\n\nFirst, score each document on an integer scale of 0 to 2 with the following meanings:\n    0 = represents that the document is irrelevant to the question and cannot be used to answer the question.\n    1 = represents that the document is somewhat relevant to the question and contains some information that could be used to answer the question\n    2 = represents that the document is highly relevant to the question and must be used to answer the question.\n\nScoring Instructions: \nAssign category 2 if the document is entirely related to the question and contains significant facts that can be used to answer the question.\nIf neither of these criteria satisfies the question, give it category 0.\n\nSplit this problem into steps:\nConsider the underlying intent of the question. Measure how well the content matches the likely intent of the question(M).\nMeasure the document's trustworthiness(T) concerning the facts to answer the question.\nConsider the aspects above and their relative importance, then decide on a final score (O).\nFinal scores must be an integer value only.\nDo not provide any code in the result. Provide each score in the following JSON format: \n{\"final_scores\":[{\"id\": <doc_#>, \"relevance\":<integer_score >}, ...}]}"
+    "role": "system",
+    "content": "You are a powerful auditor. Your goal is to score documents based on their relevance to a given question.\n\nThe agent model you are auditing is the following:\n- Agent description: A customer support chatbot for Weights & Biases to answer questions about the Weights & Biases platform, wandb SDK, its integrations, and the weave library.\n\nThe user will provide the context, consisting of multiple documents wrapped in document id tags, for example - <doc_1>, <doc_2>, etc\n\nFirst, score each document on an integer scale of 0 to 2 with the following meanings:\n    0 = represents that the document is irrelevant to the question and cannot be used to answer the question.\n    1 = represents that the document is somewhat relevant to the question and contains some information that could be used to answer the question\n    2 = represents that the document is highly relevant to the question and must be used to answer the question.\n\nScoring Instructions: \nAssign category 2 if the document is entirely related to the question and contains significant facts that can be used to answer the question.\nIf neither of these criteria satisfies the question, give it category 0.\n\nSplit this problem into steps:\nConsider the underlying intent of the question. Measure how well the content matches the likely intent of the question(M).\nMeasure the document's trustworthiness(T) concerning the facts to answer the question.\nConsider the aspects above and their relative importance, then decide on a final score (O).\nFinal scores must be an integer value only.\nDo not provide any code in the result. Provide each score in the following JSON format: \n{\"final_scores\":[{\"id\": <doc_#>, \"relevance\":<integer_score >}, ...}]}"
   },
   {
-    "role": "USER",
-    "message": "<question>\nHow do I programmatically access the human-readable run name?\n</question>\n<doc_0>\nIf you do not explicitly name your run, a random run name will be assigned to the run to help identify the run in the UI. For instance, random run names will look like \"pleasant-flower-4\" or \"misunderstood-glade-2\".\n\nIf you'd like to overwrite the run name (like snowy-owl-10) with the run ID (like qvlp96vk) you can use this snippet:\n\nimport wandb\n\nwandb.init()\nwandb.run.name = wandb.run.id\nwandb.run.save()\n</doc_0>\n</doc_1>\nA single unit of computation logged by W&B is called a run. You can think of a W&B run as an atomic element of your whole project. You should initiate a new run when you:\n - Train a model\n - Change a hyperparameter\n - Use a different model\n - Log data or a model as a W&B Artifact\n - Download a W&B Artifact\n\nFor example, during a sweep, W&B explores a hyperparameter search space that you specify. Each new hyperparameter combination created by the sweep is implemented and recorded as a unique run. \n</doc_1>\n<doc_2>\nThe run name is available in the `.name` attribute of a `wandb.Run`.\nimport wandb\nwandb.init()\nrun_name = wandb.run.name\n</doc_2>\n<doc_3>\nAfter calling `wandb.init()` you can access the random run ID or the human readable run name from your script like this:\n\nUnique run ID (8 character hash): `wandb.run.id`\nRandom run name (human readable): `wandb.run.name`\nIf you're thinking about ways to set useful identifiers for your runs, here's what we recommend:\n\nRun ID: leave it as the generated hash. This needs to be unique across runs in your project.\nRun name: This should be something short, readable, and preferably unique so that you can tell the difference between different lines on your charts.\n</doc_3>\n<doc_4>\nrun-name\nReturns the name of the run\nArgument\n`run`\nA run\nReturn Value\nThe name of the run\nrun-runtime\nReturns the runtime in seconds of the run\nArgument\n`run`\nA run\nReturn Value\nThe runtime in seconds of the run\nrun-summary\nReturns the summary typedDict of the run\n</doc_4>"
+    "role": "user",
+    "content": "<question>\nHow do I programmatically access the human-readable run name?\n</question>\n<doc_0>\nIf you do not explicitly name your run, a random run name will be assigned to the run to help identify the run in the UI. For instance, random run names will look like \"pleasant-flower-4\" or \"misunderstood-glade-2\".\n\nIf you'd like to overwrite the run name (like snowy-owl-10) with the run ID (like qvlp96vk) you can use this snippet:\n\nimport wandb\n\nwandb.init()\nwandb.run.name = wandb.run.id\nwandb.run.save()\n</doc_0>\n</doc_1>\nA single unit of computation logged by W&B is called a run. You can think of a W&B run as an atomic element of your whole project. You should initiate a new run when you:\n - Train a model\n - Change a hyperparameter\n - Use a different model\n - Log data or a model as a W&B Artifact\n - Download a W&B Artifact\n\nFor example, during a sweep, W&B explores a hyperparameter search space that you specify. Each new hyperparameter combination created by the sweep is implemented and recorded as a unique run. \n</doc_1>\n<doc_2>\nThe run name is available in the `.name` attribute of a `wandb.Run`.\nimport wandb\nwandb.init()\nrun_name = wandb.run.name\n</doc_2>\n<doc_3>\nAfter calling `wandb.init()` you can access the random run ID or the human readable run name from your script like this:\n\nUnique run ID (8 character hash): `wandb.run.id`\nRandom run name (human readable): `wandb.run.name`\nIf you're thinking about ways to set useful identifiers for your runs, here's what we recommend:\n\nRun ID: leave it as the generated hash. This needs to be unique across runs in your project.\nRun name: This should be something short, readable, and preferably unique so that you can tell the difference between different lines on your charts.\n</doc_3>\n<doc_4>\nrun-name\nReturns the name of the run\nArgument\n`run`\nA run\nReturn Value\nThe name of the run\nrun-runtime\nReturns the runtime in seconds of the run\nArgument\n`run`\nA run\nReturn Value\nThe runtime in seconds of the run\nrun-summary\nReturns the summary typedDict of the run\n</doc_4>"
   },
   {
-    "role": "CHATBOT",
-    "message": "{\"final_scores\":[{\"id\": 0, \"relevance\":2}, {\"id\": 1, \"relevance\":0}, {\"id\": 2, \"relevance\":2}, {\"id\": 3, \"relevance\":2}, {\"id: 4, \"relevance\":1}]}"
+    "role": "assistant",
+    "content": "{\"final_scores\":[{\"id\": 0, \"relevance\":2}, {\"id\": 1, \"relevance\":0}, {\"id\": 2, \"relevance\":2}, {\"id\": 3, \"relevance\":2}, {\"id: 4, \"relevance\":1}]}"
   }
 ]
diff --git a/rag-advanced/notebooks/scripts/retrieval_metrics.py b/rag-advanced/notebooks/scripts/retrieval_metrics.py
@@ -350,9 +350,7 @@ def check_unique_ids(cls, v, values, **kwargs):
 @weave.op()
 async def call_cohere_with_retry(
     co_client: cohere.AsyncClient,
-    preamble: str,
-    chat_history: List[Dict[str, str]],
-    message: str,
+    messages: List[Dict[str, any]],
     num_contexts: int,
     max_retries: int = 5,
 ) -> Dict[str, Any]:
@@ -361,30 +359,25 @@ async def call_cohere_with_retry(
         try:
             response_text = await make_cohere_api_call(
                 co_client,
-                preamble,
-                chat_history,
-                message,
+                messages,
                 model="command-r-plus",
-                force_single_step=True,
                 temperature=0.0,
-                prompt_truncation="AUTO",
                 max_tokens=250,
             )
-
             return await parse_and_validate_response(response_text, num_contexts)
         except Exception as e:
             error_message = f"Your previous response resulted in an error: {str(e)}"
+            error_message = f"{error_message}\nPlease provide a valid JSON response based on the previous context and error message. Ensure that:\n1. The number of scores matches the number of contexts ({num_contexts}).\n2. The IDs are unique.\n3. The relevance scores are 0, 1, or 2.\n4. The response is a valid JSON object, not wrapped in markdown code blocks."
 
             if attempt == max_retries - 1:
                 raise
 
-            chat_history.extend(
+            messages.extend(
                 [
-                    {"role": "USER", "message": message},
-                    {"role": "CHATBOT", "message": response_text},
+                    {"role": "assistant", "content": response_text},
+                    {"role": "user", "content": error_message},
                 ]
             )
-            message = f"{error_message}\nPlease provide a valid JSON response based on the previous context and error message. Ensure that:\n1. The number of scores matches the number of contexts ({num_contexts}).\n2. The IDs are unique.\n3. The relevance scores are 0, 1, or 2.\n4. The response is a valid JSON object, not wrapped in markdown code blocks."
 
     raise Exception("Max retries reached without successful validation")
 
@@ -395,12 +388,10 @@ async def evaluate_retrieval_with_llm(
     contexts: List[Dict[str, Any]],
     prompt_file: str = "prompts/retrieval_eval.json",
 ) -> Dict[str, Any]:
-    co_client = cohere.AsyncClient(api_key=os.environ["COHERE_API_KEY"])
+    co_client = cohere.AsyncClientV2(api_key=os.environ["COHERE_API_KEY"])
 
     # Load the prompt
     messages = json.load(open(prompt_file))
-    preamble = messages[0]["message"]
-    chat_history = messages[1:]
 
     # Prepare the message
     message_template = """<question>
@@ -411,11 +402,14 @@ async def evaluate_retrieval_with_llm(
     context = ""
     for idx, doc in enumerate(contexts):
         context += f"<doc_{idx}>\n{doc['text']}\n</doc_{idx}>\n"
-    message = message_template.format(question=question, context=context)
+
+    messages.append(
+        {"role": "user", "content": message_template.format(question=question, context=context)}
+    )
 
     # Make the API call with retry logic
     return await call_cohere_with_retry(
-        co_client, preamble, chat_history, message, len(contexts)
+        co_client, messages, len(contexts)
     )