Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Google Drive documents and retrieve contents #265

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

byrro
Copy link
Member

@byrro byrro commented Feb 25, 2025

This tool will be useful in scenarios akin to RAG, where someone wants to ask questions or request the production of a summary, for instance, about a bunch of documents related to a particular topic. Currently, to fulfill such requests, the LLM needs to first list_documents, then get_document_by_id for each document.

We also implement a utility functions to return documents in Markdown and HTML, since the Drive API JSON is verbose and would waste too many tokens unnecessarily.

Limitations: the Markdown/HTML utilities do not handle table of contents (which I think aren't really useful here), headers, footers, or footnotes.


This PR deprecates list_documents and implements search_documents, apart from search_and_retrieve_documents). This configuration makes it easier for LLMs to understand when to call each tool.

Both tools had their interfaces refactored to remove Google API-specific arguments that were confusing LLMs sometimes, such as "corpora" and "support_all_drives". It now accepts arguments that better relate to expected user requests.

@byrro byrro requested a review from EricGustin February 25, 2025 00:13
@byrro byrro self-assigned this Feb 25, 2025
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

@EricGustin
Copy link
Member

@byrro I'm unable to run this locally. Looks like an issue with all of the double and single quotes in the arcade_google/utils.py

image

@byrro
Copy link
Member Author

byrro commented Feb 27, 2025

@EricGustin pushed a new implementation of the tool, refactored list_documents and worked around the worker's 'unmatched (' issue

@byrro byrro added the toolkit: major Toolkit changes that are not backward-compatible and will result in a major version increment label Mar 3, 2025
document_contains: Optional[list[str]] = None,
document_not_contains: Optional[list[str]] = None,
) -> str:
query = ["mimeType = 'application/vnd.google-apps.document' and trashed = false"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking ahead, we will need to search for more file types beyond document. For example, searching for a spreadsheet by name. Perhaps mime type can be a parameter so that we don't have to worry about that debt in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, just pushed a new version with mime_type as an argument to build files_list query / params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@byrro byrro changed the title Drive tool to search & retrieve document contents in markdown Search & retrieve Google Drive documents with contents Mar 3, 2025
@byrro byrro changed the title Search & retrieve Google Drive documents with contents Search Google Drive documents and retrieve contents Mar 3, 2025
name_contains = keyword.replace("'", "\\'")
full_text_contains = keyword.replace("'", "\\'")
keyword_query = (
f"name contains '{name_contains}' or fullText contains '{full_text_contains}'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to group the left and right side of the or inside parentheses, otherwise Google interprets the query in a way that we don't intend. See slack dm for more details

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! just pushed a fix

Copy link
Member

@EricGustin EricGustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
toolkit: major Toolkit changes that are not backward-compatible and will result in a major version increment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants