Search Google Drive documents and retrieve contents #265

byrro · 2025-02-25T00:13:47Z

This tool will be useful in scenarios akin to RAG, where someone wants to ask questions or request the production of a summary, for instance, about a bunch of documents related to a particular topic. Currently, to fulfill such requests, the LLM needs to first list_documents, then get_document_by_id for each document.

We also implement a utility functions to return documents in Markdown and HTML, since the Drive API JSON is verbose and would waste too many tokens unnecessarily.

Limitations: the Markdown/HTML utilities do not handle table of contents (which I think aren't really useful here), headers, footers, or footnotes.

This PR deprecates list_documents and implements search_documents, apart from search_and_retrieve_documents). This configuration makes it easier for LLMs to understand when to call each tool.

Both tools had their interfaces refactored to remove Google API-specific arguments that were confusing LLMs sometimes, such as "corpora" and "support_all_drives". It now accepts arguments that better relate to expected user requests.

…entations)

…part of the document content, not its name)

… 'and' operators together

…rsions of arcade-ai

…r the LLM

codecov · 2025-02-25T00:17:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

EricGustin · 2025-02-26T00:44:35Z

@byrro I'm unable to run this locally. Looks like an issue with all of the double and single quotes in the arcade_google/utils.py

toolkits/google/arcade_google/tools/drive.py

…kdown, HTML, and JSON format options

…terface abstracting away args that are specific to the Google API

… the worker

byrro · 2025-02-27T14:55:46Z

@EricGustin pushed a new implementation of the tool, refactored list_documents and worked around the worker's 'unmatched (' issue

EricGustin · 2025-03-03T20:07:10Z

toolkits/google/arcade_google/utils.py

+    document_contains: Optional[list[str]] = None,
+    document_not_contains: Optional[list[str]] = None,
+) -> str:
+    query = ["mimeType = 'application/vnd.google-apps.document' and trashed = false"]


Looking ahead, we will need to search for more file types beyond document. For example, searching for a spreadsheet by name. Perhaps mime type can be a parameter so that we don't have to worry about that debt in the future

makes sense, just pushed a new version with mime_type as an argument to build files_list query / params.

EricGustin · 2025-03-03T21:52:20Z

toolkits/google/arcade_google/utils.py

+            name_contains = keyword.replace("'", "\\'")
+            full_text_contains = keyword.replace("'", "\\'")
+            keyword_query = (
+                f"name contains '{name_contains}' or fullText contains '{full_text_contains}'"


we need to group the left and right side of the or inside parentheses, otherwise Google interprets the query in a way that we don't intend. See slack dm for more details

good catch! just pushed a fix

EricGustin

Looks great!

byrro added 20 commits February 20, 2025 15:10

move utils and models to top-level dir, out of tools dir

b9cb400

support ordering by multiple fields (without breaking previous implem…

ecc893a

…entations)

add support to pagination token on list_documents

6132e02

change title_keywords to name_keywords (a title / heading is usually …

d61d8ab

…part of the document content, not its name)

update page_token to pagination_token

f56b075

improve how query elements are joined to avoid a trailing or multiple…

c2338db

… 'and' operators together

arguments to negate keywords in doc name and contents

b8525c9

remove reference to ToolContext function not available in previous ve…

4ceeb32

…rsions of arcade-ai

highlight negation arg annotation; remove obvious comments

811ec24

basic implementation for RAG tool that searches and retrieves docs

4a7e1f5

convert google document json to markdown (save tokens in llm context)

ccede21

unit test for doc-to-markdown

d811706

add document metadata at the top of the markdown generated

5d7e9cc

reference enum values, instead of hard-coded strings

3233785

unit test for search-and-retrieve-docs tool

b17cb45

fix return type and improve function name

99dc2f5

update / improve evals

7d72377

evals for search-and-retrieve tool

455d0d2

merge title and body query arguments in list_documents to simplify fo…

bbe371c

…r the LLM

improve argument annotations; update evals

1fac35f

byrro requested a review from EricGustin February 25, 2025 00:13

byrro self-assigned this Feb 25, 2025

byrro added 2 commits February 25, 2025 17:12

Merge branch 'main' into renato/drive-search-files

f62b48b

fix bug in doc-to-markdown/html when document had a non-textual element

5813cf7

EricGustin reviewed Feb 26, 2025

View reviewed changes

toolkits/google/arcade_google/tools/drive.py Outdated Show resolved Hide resolved

toolkits/google/arcade_google/tools/drive.py Show resolved Hide resolved

toolkits/google/arcade_google/tools/drive.py Outdated Show resolved Hide resolved

byrro added 3 commits February 26, 2025 17:28

Merge branch 'main' into renato/drive-search-files

12c6f52

make return format an argument in search-and-retrieve-docs; offer Mar…

fbd3d8c

…kdown, HTML, and JSON format options

deprecate 'list_functions' in favor of 'search_documents'; improve in…

b5022d3

…terface abstracting away args that are specific to the Google API

change document query str build logic to avoid 'unmatched (' issue in…

b57be9b

… the worker

byrro added 4 commits March 3, 2025 15:01

organize files.list param building in a util function

5d611bf

improve arg annotations and document format conversion

b3319bf

merge main into renato/drive-search-files

8839ff4

merge main into renato/drive-search-files*

c12fc5a

byrro added the toolkit: major Toolkit changes that are not backward-compatible and will result in a major version increment label Mar 3, 2025

EricGustin reviewed Mar 3, 2025

View reviewed changes

byrro added 2 commits March 3, 2025 17:49

do not include shared drives by default when searching documents

13a53ec

make mimetype an argument to build files_list query/paramns

ac93f60

byrro mentioned this pull request Mar 3, 2025

Update Google Drive docs with SearchDocuments and SearchAndRetrieveDocuments tools ArcadeAI/docs#180

Open

byrro changed the title ~~Drive tool to search & retrieve document contents in markdown~~ Search & retrieve Google Drive documents with contents Mar 3, 2025

byrro changed the title ~~Search & retrieve Google Drive documents with contents~~ Search Google Drive documents and retrieve contents Mar 3, 2025

EricGustin reviewed Mar 3, 2025

View reviewed changes

byrro and others added 2 commits March 4, 2025 17:00

fix bug in files.list query building

d6c61a8

Fix test

3574603

EricGustin approved these changes Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Google Drive documents and retrieve contents #265

Search Google Drive documents and retrieve contents #265

byrro commented Feb 25, 2025 •

edited

Loading

codecov bot commented Feb 25, 2025

EricGustin commented Feb 26, 2025

byrro commented Feb 27, 2025

EricGustin Mar 3, 2025

byrro Mar 3, 2025

EricGustin Mar 3, 2025

EricGustin Mar 3, 2025

byrro Mar 4, 2025

EricGustin left a comment

Search Google Drive documents and retrieve contents #265

Are you sure you want to change the base?

Search Google Drive documents and retrieve contents #265

Conversation

byrro commented Feb 25, 2025 • edited Loading

codecov bot commented Feb 25, 2025

Codecov Report

EricGustin commented Feb 26, 2025

byrro commented Feb 27, 2025

EricGustin Mar 3, 2025

Choose a reason for hiding this comment

byrro Mar 3, 2025

Choose a reason for hiding this comment

EricGustin Mar 3, 2025

Choose a reason for hiding this comment

EricGustin Mar 3, 2025

Choose a reason for hiding this comment

byrro Mar 4, 2025

Choose a reason for hiding this comment

EricGustin left a comment

Choose a reason for hiding this comment

byrro commented Feb 25, 2025 •

edited

Loading