diff --git a/examples/py/enron/embedding-cache b/examples/py/enron/embedding-cache new file mode 100644 index 0000000000..e4602b2ee4 Binary files /dev/null and b/examples/py/enron/embedding-cache differ diff --git a/examples/py/enron/enron-vectors.ipynb b/examples/py/enron/enron-vectors.ipynb new file mode 100644 index 0000000000..2d2c6f04af --- /dev/null +++ b/examples/py/enron/enron-vectors.ipynb @@ -0,0 +1,647 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6785c319", + "metadata": {}, + "source": [ + "# Using Raphtory similarity search to uncover Enron criminal network" + ] + }, + { + "cell_type": "markdown", + "id": "cd195489", + "metadata": {}, + "source": [ + "The Enron scandal was one of the largest corporate fraud cases in history, leading to the downfall of the company and the conviction of several executives. The graph below illustrates the significant decline in Enron's stock price between August 2000 and December 2001, providing valuable insights into the company's downfall." + ] + }, + { + "cell_type": "markdown", + "id": "c3ce5767", + "metadata": {}, + "source": [ + "![enron stock price](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/EnronStockPriceAugust2000toJanuary2001.svg/567px-EnronStockPriceAugust2000toJanuary2001.svg.png)" + ] + }, + { + "cell_type": "markdown", + "id": "4b5521bb", + "metadata": {}, + "source": [ + "Now, put yourself in the judge's seat, confronted with a vast dataset comprising hundreds of thousands of emails. Your responsibility? Identifying every culprit and related elements within. How long would it take to uncover all of them and their connections?\n", + "\n", + "Prepare for a revelation: Raphtory now boasts seamless similarity search functionality. This enables swift exploration across the entire network of email messages, swiftly pinpointing diverse criminal activities. All it takes is submitting a semantic query pertaining to the specific crimes under investigation." + ] + }, + { + "cell_type": "markdown", + "id": "d7667c42", + "metadata": {}, + "source": [ + "## Wait, how does this work? And what is similarity search in the first place?\n", + "\n", + "Similarity search isn't a novel concept. It's a powerful technique that sifts through a collection of documents to identify those bearing semantic resemblance to a given query.\n", + "\n", + "Consider a query like `hiding information`. Imagine applying this query across a corpus of email messages; the result would likely yield documents discussing various aspects of concealing information.\n", + "\n", + "And which role does that play in Raphtory land? To traverse the bridge, we must represent entities within graphs as documents or sets of documents. These documents are then transformed into embeddings or vectors using an embedding function for effective searchability. In Raphtory, this process is referred to as 'vectorising' a graph, and it is as easy as:\n", + "\n", + "```\n", + "vg = g.vectorise(embeddding_function)\n", + "```\n", + "\n", + "Raphtory has a default way to translate graph entities into documents. However, if we have a deep understanding of our graph's semantics, we can always create those documents ourselves, insert them as properties, and let Raphtory know which property name use to pick them up:\n", + "\n", + "```\n", + "g.add_vertex(0, 'Kenneth Lay', {'document': 'Kenneth Lay is the former CEO of Enron'})\n", + "vg = g.vectorise(embeddding_function, nodes=\"document\")\n", + "```\n", + "\n", + "Voila! Executing a similarity search query on the graph is now straightforward. Using methods within `VectorisedGraph`, we can select and retrieve documents based on a query:\n", + "\n", + "```\n", + "vg.append_by_similarity('hiding information', limit=10).get_documents()\n", + "```\n", + "\n", + "This example is a basic query, capturing the top 10 highest-scoring documents. However, Raphtory offers an array of advanced methods, enabling the implementation of complex similarity search algorithms. You can combine different queries into a single selection or even leverage the graph's space between documents to add more context to one selection using an similarity based expansion.\n", + "\n", + "Now, armed with these fundamentals, let's embark on the quest to unearth some potential criminals!" + ] + }, + { + "cell_type": "markdown", + "id": "60fcf479", + "metadata": {}, + "source": [ + "## Preparing the investigation" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "e1ceb7e8", + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "import pandas as pd\n", + "import altair as alt\n", + "from raphtory import *\n", + "from raphtory import algorithms\n", + "from raphtory import export\n", + "from raphtory.vectors import *\n", + "from langchain.embeddings import HuggingFaceEmbeddings\n", + "from email.utils import parsedate_to_datetime, parsedate\n", + "from datetime import timezone, datetime\n", + "from time import mktime\n" + ] + }, + { + "cell_type": "markdown", + "id": "aa41dc7f", + "metadata": {}, + "source": [ + "First, we define some auxiliary functions for parsing of the Enron dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "e97aba2c", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_sender(text):\n", + " sender_cut = text.split(\"\\nFrom: \")\n", + " if len(sender_cut) > 1:\n", + " email_cut = sender_cut[1].split(\"\\n\")[0].split(\"@\")\n", + " if len(email_cut) > 1:\n", + " return email_cut[0]\n", + " else: \n", + " return\n", + " else:\n", + " return\n", + " \n", + "def extract_sender_domain(text):\n", + " sender_cut = text.split(\"\\nFrom: \")\n", + " if len(sender_cut) > 1:\n", + " email_cut = sender_cut[1].split(\"\\n\")[0].split(\"@\")\n", + " if len(email_cut) > 1:\n", + " return email_cut[1]\n", + " else: \n", + " return\n", + " else:\n", + " return\n", + " \n", + "def extract_recipient(text):\n", + " recipient_cut = text.split(\"\\nTo: \")\n", + " if len(recipient_cut) > 1:\n", + " email_cut = recipient_cut[1].split(\"\\n\")[0].split(\"@\")\n", + " if len(email_cut) > 1:\n", + " return email_cut[0]\n", + " else:\n", + " return\n", + " else:\n", + " return\n", + " \n", + "def extract_actual_message(text):\n", + " try:\n", + " body = re.split(\"X-FileName: .*\\n\\n\", text)[1]\n", + " return re.split('-{3,}\\s*Original Message\\s*-{3,}', body)[0][:1000]\n", + " except:\n", + " return\n", + "\n", + "extract_date = lambda text: text.split(\"Date: \")[1].split(\"\\n\")[0]" + ] + }, + { + "cell_type": "markdown", + "id": "9e943817", + "metadata": {}, + "source": [ + "Then, we ingest the email dataset and carry out some cleaning using pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "2de7a3d2", + "metadata": {}, + "outputs": [], + "source": [ + "enron = pd.DataFrame()\n", + "enron['email'] = pd.read_csv('emails.csv', usecols=['message'])['message']\n", + "enron['src'] = enron['email'].apply(extract_sender)\n", + "enron['dst'] = enron['email'].apply(extract_recipient)\n", + "enron['time'] = enron['email'].apply(extract_date)\n", + "enron['message'] = enron['email'].apply(extract_actual_message)\n", + "enron['message'] = enron['message'].str.strip()\n", + "\n", + "enron = enron.dropna(subset=[\"src\", \"dst\", \"time\", \"message\"])\n", + "enron = enron.drop_duplicates(['src', 'dst', 'time', 'message'])\n", + "enron = enron[enron['message'].str.len() > 5]\n", + "enron = enron[enron['dst'] != 'undisclosed.recipients']\n", + "enron = enron[enron['email'].apply(extract_sender_domain) == 'enron.com']\n", + "\n", + "enron['document'] = enron['src'] + \" sent a message to \" + enron['dst'] + \" at \" + enron['time'] + \" with the following content:\\n\" + enron['message']" + ] + }, + { + "cell_type": "markdown", + "id": "15433650", + "metadata": {}, + "source": [ + "Next, we ingest those emails into a Raphtory graph. Here individuals serve as nodes,\n", + "most of them belonging to Enron, and the edges repesent email exchanges between them. Our criminal investigation targets the last four months of 2001, coinciding with the Enron bankruptcy, so we will create a window over the graph for that period." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "6496d0a3", + "metadata": {}, + "outputs": [], + "source": [ + "raw_graph = Graph()\n", + "def ingest_edge(record):\n", + " e = raw_graph.add_edge(record['time'], record['src'], record['dst'], {'document': record['document']})\n", + " raw_graph.add_vertex(record['time'], record['src']).add_constant_properties({'document': ''})\n", + " raw_graph.add_vertex(record['time'], record['dst']).add_constant_properties({'document': ''})\n", + "enron.apply(ingest_edge, axis=1)\n", + "g = raw_graph.window('2001-09-01 00:00:00', '2002-01-01 00:00:00')" + ] + }, + { + "cell_type": "markdown", + "id": "5e3b7549", + "metadata": {}, + "source": [ + "And our `vectors` module comes into play at this stage. We are going to vectorise the graph we just built. As previously outlined, this involves employing an embedding function that translates documents into vectors. For this purpose, we've selected a local model from Langchain named `gte-small`. It's important to note that this operation is computationally very expensive. When initiated from scratch, the process can span several hours. However, to streamline this, the vectorising process enables the setup of a cache file. By utilizing this cache, embeddings for previously processed documents are readily available, avoiding the need to invoke the resource-intensive model repeatedly. Fortunately, we've already taken this step for you, and there already exists a file named `embedding-cache` in the current directory, containing all the necessary embeddings for today's task, so the execution will be instant" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "79cd8203", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "computing embeddings for nodes\n", + "computing embeddings for edges\n" + ] + } + ], + "source": [ + "embeddings = HuggingFaceEmbeddings(model_name=\"thenlper/gte-small\")\n", + "embedding_function = lambda texts: embeddings.embed_documents(texts)\n", + "\n", + "vg = g.vectorise(\n", + " embedding_function,\n", + " \"./embedding-cache\",\n", + " node_document=\"document\",\n", + " edge_document=\"document\",\n", + " verbose=True)\n" + ] + }, + { + "cell_type": "markdown", + "id": "cecacfe0", + "metadata": {}, + "source": [ + "## Finding the criminal network" + ] + }, + { + "cell_type": "markdown", + "id": "3a9a6e34", + "metadata": {}, + "source": [ + "Congratulations! You've successfully loaded a vectorised graph containing four critical months of the Enron email dataset. Using our similarity search engine, we can now submit queries to uncover potential criminal behavior or, at the very least, steer the investigation in the right direction. Our aim is to identify individuals who might hold pertinent information regarding Enron's internal practices. You're encouraged to experiment with your own queries and methodologies. However, for starters, we'll concentrate on three pivotal topics that were crucial in past investigations:\n", + "1. Hiding company debt through special purpose entities (SPEs)\n", + "2. Manipulation of the energy market\n", + "3. Withholding crucial information from investors\n", + "\n", + "We will make use of the following auxiliary functions to help us get the job done. Feel free to explore these functions for insights on handling common tasks." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "228a51dd", + "metadata": {}, + "outputs": [], + "source": [ + "def print_emails(query, limit):\n", + " for doc in vg.append_edges_by_similarity(query, limit).get_documents():\n", + " print(doc.content)\n", + " print('===========================================================================================')\n", + " \n", + "def show_network_for_query(query, limit):\n", + " edges = vg.append_edges_by_similarity(query, limit).edges()\n", + " network = Graph()\n", + " for edge in edges:\n", + " network.add_edge(0, edge.src.name, edge.dst.name)\n", + "\n", + " network = export.to_pyvis(graph=network, edge_color=\"#FF0000\")\n", + " return network.show('nx.html')\n" + ] + }, + { + "cell_type": "markdown", + "id": "27856c61", + "metadata": {}, + "source": [ + "### Hiding company debt" + ] + }, + { + "cell_type": "markdown", + "id": "e361f6b9", + "metadata": {}, + "source": [ + "To uncover pertinent communications on this subject, we will use the query:\n", + "- `hide company debt`" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "5c74b3c0", + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "rod.hayslett sent a message to james.saunders at Sun, 18 Nov 2001 08:05:22 -0800 (PST) with the following content:\n", + "That is one of the things we are working on. At this point in time Dynegy will be responsible for this debt, if theyexercise their rights under the preferred stock agreements, which would leave them with common pledged to the lenders. The price they paid recognized the debt was there, if it is not there, the price will be higher. Suffice it to say all of these things will be tken care of before it funds.\n", + "--------------------------\n", + "Sent from my BlackBerry Wireless Handheld (www.BlackBerry.net)\n", + "===========================================================================================\n", + "mariella.mahan sent a message to stanley.horton at Tue, 13 Nov 2001 12:53:54 -0800 (PST) with the following content:\n", + "Something for us to talk about during our next staff meeting.\n", + "\n", + "There are three projects which have significant cash flow problems and thus=\n", + " difficulties in meeting debt obligations: these are: SECLP, Panama and Gaz=\n", + "a. In the past, as I suppose we have done in Dabhol, we have taken the pos=\n", + "ition that we would not inject cash into these companies and would be prepa=\n", + "red to face a default and possible acceleration of the loans. SECLP has be=\n", + "en the biggest issue/problem. Panama is much less (a few million of floati=\n", + "ng of our receivables from the company) would be sufficient to meet the cas=\n", + "h crunch in April of this year. Note that, in Panama, the debt is fully gu=\n", + "aranteed by the government and is non-recoursed to the operating company, B=\n", + "LM. In the past, we have discussed letting the debt default, which would c=\n", + "ause the bank to potentially seek complete payment and acceleration from th=\n", + "e GoPanama. The reason: the vast majority of BLM's problems stem from acti=\n", + "ons taken by\n", + "===========================================================================================\n" + ] + } + ], + "source": [ + "print_emails(\"hide company debt\", 2)" + ] + }, + { + "cell_type": "markdown", + "id": "0dc644e0", + "metadata": {}, + "source": [ + "Here, we can find an interesting email in the second position. This message highlights significant cash flow problems in three projects (SECLP, Panama, and Gaza) that face difficulties in meeting debt obligations. It discusses the position of not injecting cash into these companies and being prepared to face default and possible loan acceleration. This might be a good starting point for some investigations." + ] + }, + { + "cell_type": "markdown", + "id": "b716ab6f", + "metadata": {}, + "source": [ + "### Manipulation of the energy market" + ] + }, + { + "cell_type": "markdown", + "id": "63e22285", + "metadata": {}, + "source": [ + "Among the charges leveled against Enron were allegations of manipulating the energy market leveraging their influential position. To explore conversations pertaining to this matter and potentially uncover concrete evidence, we'll employ the query:\n", + "- `manipulating the energy market`" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "97244400", + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "paulo.issler sent a message to zimin.lu at Mon, 22 Oct 2001 09:49:02 -0700 (PDT) with the following content:\n", + "Ed has written the draft for the first assignment. I will be checking it today and making my comments by tomorrow morning. Feel free to make yours. I beleive the book Mananging Energy Price Risk is a great reference for the second assignment. \n", + "\n", + "Thanks.\n", + "Paulo Issler\n", + "===========================================================================================\n", + "bill.williams sent a message to alan.comnes at Thu, 25 Oct 2001 12:59:06 -0700 (PDT) with the following content:\n", + "Alan,\n", + "\n", + "I have a few questions regarding the emminent implementation of the new target pricing mechanism.\n", + "1. Does uninstructed energy still get paid (if not, we cannot hedge financials)\n", + "2. The CISO Table 1. lists an unintended consequence as \" Target price may be manipulated due to no obligation to deliver\"\n", + "\tWhy is there no obligation to deliver?\n", + "3. Is there still a load deviation penalty? Or would that be considered seperately?\n", + "\n", + "Thanks for the help.\n", + "\n", + "Bill\n", + "===========================================================================================\n" + ] + } + ], + "source": [ + "print_emails(\"manipulating the energy market\", 2)" + ] + }, + { + "cell_type": "markdown", + "id": "19f7757c", + "metadata": {}, + "source": [ + "Here, in the second returned email, discussions revolve around legal boundaries concerning the new target pricing mechanism. They highlight the potential for manipulation within the new mechanism due to the absence of an obligation to deliver. While this doesn't explicitly confirm illegal conduct, individuals engaged in this conversation might possess valuable insights to elucidate Enron's practices concerning this topic." + ] + }, + { + "cell_type": "markdown", + "id": "7a2c5598", + "metadata": {}, + "source": [ + "### Withholding crucial information from investors" + ] + }, + { + "cell_type": "markdown", + "id": "ea03a72c", + "metadata": {}, + "source": [ + "Finally, to address the last point of our investigation, we'll employ the query:\n", + "\n", + "- `lie to investors`\n", + "\n", + "This time we'll just show the subgraph comprising these communications, aiming to uncover the network of individuals that might be involved in this." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e17aa6f8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Warning: When cdn_resources is 'local' jupyter notebook has issues displaying graphics on chrome/safari. Use cdn_resources='in_line' or cdn_resources='remote' if you have issues viewing graphics in a notebook.\n", + "nx.html\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "show_network_for_query(\"lie to investors\", 8)" + ] + }, + { + "cell_type": "markdown", + "id": "00a54a1b", + "metadata": {}, + "source": [ + "And at the very center of this network, we find Kenneth Lay, the former Enron CEO. He was indeed pleaded not guilty to eleven criminal charges. He was convicted of six counts of securities and wire fraud and was subject to a maximum of 45 years in prison. However, Lay died on July 5, 2006, before sentencing was to occur." + ] + }, + { + "cell_type": "markdown", + "id": "3ffc59bc", + "metadata": {}, + "source": [ + "## Bonus: Integrating Raphtory with an LLM" + ] + }, + { + "cell_type": "markdown", + "id": "ee5eb00f", + "metadata": {}, + "source": [ + "As a bonus for this tutorial, we are going to look at how we can easily integrate Raphtory with a Large Language Model (LLM). There are many ways we can accomplish this, but leveraging the Langchain ecosystem seems like an excellent starting point. One of the options is defining a langchain Retriever using Raphtory. This allows the creation of various chains for diverse purposes such as Question/Answer setups, or agent-driven pipelines.\n", + "\n", + "In this example, we'll build the most basic QA setup, a Retrieval-augmented generation (RAG) pipeline. This kind of pipeline icombines of a document retriever and an LLM. When a question is submitted to the pipeline, that question initially goes through the document retriever, which extract relevants documents from a set. These documents are then fed into the LLM alongside the question to provide context for generating the final answer.\n", + "\n", + "The first step involves creating a Langchain retriever interface for a Raphtroy vectorised graph. To do this, we extend the `BaseRetriever` class and implement the `_get_relevant_documents` method, as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "ac7e3496", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.schema.retriever import BaseRetriever, Document\n", + "from langchain.callbacks.manager import CallbackManagerForRetrieverRun\n", + "from typing import Optional, Dict, Any, List\n", + "\n", + "def adapt_document(document):\n", + " return Document(\n", + " page_content=document.content,\n", + " metadata={\n", + " 'src': document.entity.src.name,\n", + " 'dst': document.entity.dst.name\n", + " }\n", + " )\n", + "\n", + "class RaphtoryRetriever(BaseRetriever):\n", + " graph: VectorisedGraph\n", + " \"\"\"Source graph.\"\"\"\n", + " top_k: Optional[int]\n", + " \"\"\"Number of items to return.\"\"\"\n", + " \n", + " def _get_relevant_documents(\n", + " self,\n", + " query: str,\n", + " *,\n", + " run_manager: CallbackManagerForRetrieverRun,\n", + " metadata: Optional[Dict[str, Any]] = None,\n", + " ) -> List[Document]:\n", + " docs = self.graph.append_edges_by_similarity(query, self.top_k).get_documents()\n", + " return [adapt_document(doc) for doc in docs]" + ] + }, + { + "cell_type": "markdown", + "id": "50ab1b81", + "metadata": {}, + "source": [ + "Next, we can define the RAG chain using our retriever as the context. In this instance, we'll create a placeholder LLM model that answres with the statement, `\"I'm a dummy LLM model that got the input:\"` and returns the input it receives. While this dummy model might not serve as a practical investigative tool, it allows us to observe the output from our retriever in action. If you do want to build a proper pipeline, you can replace the placeholder LLM with a real one using the code snippet beolw:\n", + "\n", + "```python\n", + "from langchain.chat_models import ChatOpenAI\n", + "llm = ChatOpenAI()\n", + "```\n", + "\n", + "This, however, requires an OpenAI access token. For now, let's proceed with our dummy model:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b1f297e9", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "from langchain.schema.output_parser import StrOutputParser\n", + "from langchain.schema.runnable import RunnablePassthrough\n", + "\n", + "retriever = RaphtoryRetriever(graph=vg, top_k=3)\n", + "\n", + "template = \"\"\"Answer the question based only on the following context:\n", + "{context}\n", + "\n", + "Question: {question}\n", + "\"\"\"\n", + "prompt = PromptTemplate.from_template(template)\n", + "\n", + "llm = lambda input: f\"I'm a dummy LLM model that got the input:\\n{input.text}\"\n", + "\n", + "rag_chain = (\n", + " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "28876334", + "metadata": {}, + "source": [ + "And finally we can invoke the chain providing a question:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "dd478c3c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I'm a dummy LLM model that got the input:\n", + "Answer the question based only on the following context:\n", + "[Document(page_content='marie.heard sent a message to kate.cole at Wed, 31 Oct 2001 07:54:40 -0800 (PST) with the following content:\\nKate and Teresa:\\n\\nWe are working on a time sensitive project and ask that you assist us in determining the names of various Enron entities by doing an historical analysis or whatever means you use.\\n\\nThe entities I need info on are:\\n\\nEnron Gas Processing Company\\nEnron Gas Marketing Canada\\nEnron Oil Canada Ltd.\\nEnron Oil Corporation\\n\\nAre these prior names of current Enron entities, are they existing entities, etc.?\\n\\nPlease call me if you need any additional info or if my request is confusing.\\n\\nThanks for your help!\\n\\nMarie\\nx33907', metadata={'src': 'marie.heard', 'dst': 'kate.cole'}), Document(page_content=\"linda.robertson sent a message to richard.shapiro at Mon, 29 Oct 2001 14:55:14 -0800 (PST) with the following content:\\nQuestion Two:\\nNo. Enron's lobbying expenditures is very consistent with other entities in its field. In addition, because of Enron's diverse portfolio of business investments it has a broad and diverse set of public policy interests. Enron's DC expenditures also reflect the presence of our regulatory efforts.\\n \\nQuestion Three:\\nAt the Federal level Enron's public policy advocacy costs will be approximately the same in CY200l. \\n\\nQuestion Four:\\nSimilar to other companies, Enron constantly alters its use of outside consultants depending on our needs at that time. All firms we have retained should have registrations on file. \\n\\nQuestion Five:\\nFar from unusual, Linda Robertson was hired following a lengthy search in November 2000 to replace the retiring head of the DC office.\\n\\nQuestion Six:\\nDon't believe so.\\n\\nQuestion Eight:\\nWe anticipate that the SEC inquiry will be handled by Enron's legal department.\", metadata={'src': 'linda.robertson', 'dst': 'richard.shapiro'}), Document(page_content='rakhi.israni sent a message to gerald.nemec at Tue, 25 Sep 2001 12:49:00 -0700 (PDT) with the following content:\\nGerald,\\n\\nCan you tell me which Enron entity is involved in the Sorrento R/W Agmt? This is for wiring purposes.\\n\\nRakhi\\nx3-7871', metadata={'src': 'rakhi.israni', 'dst': 'gerald.nemec'})]\n", + "\n", + "Question: which person should I investigate to know more about Enron usage of special purpose entities\n", + "\n" + ] + } + ], + "source": [ + "answer = rag_chain.invoke('which person should I investigate to know more about Enron usage of special purpose entities')\n", + "print(answer)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "raphtory", + "language": "python", + "name": "raphtory" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/py/enron/nx.html b/examples/py/enron/nx.html new file mode 100644 index 0000000000..d4d35188ac --- /dev/null +++ b/examples/py/enron/nx.html @@ -0,0 +1,155 @@ + + + + + + + + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + + + + + + \ No newline at end of file diff --git a/raphtory-graphql/src/server.rs b/raphtory-graphql/src/server.rs index 28b5363255..591598f31a 100644 --- a/raphtory-graphql/src/server.rs +++ b/raphtory-graphql/src/server.rs @@ -94,6 +94,7 @@ impl RaphtoryServer { .vectorise_with_template( Box::new(embedding.clone()), Some(graph_cache), + true, template.clone(), true, ) diff --git a/raphtory/src/python/graph/views/graph_view.rs b/raphtory/src/python/graph/views/graph_view.rs index 5c9eb1de9b..6259786a12 100644 --- a/raphtory/src/python/graph/views/graph_view.rs +++ b/raphtory/src/python/graph/views/graph_view.rs @@ -308,11 +308,24 @@ impl PyGraphView { GraphIndex::new(self.graph.clone()) } - #[pyo3(signature = (embedding, cache = None, node_document = None, edge_document = None, verbose = false))] + /// Create a VectorisedGraph from the current graph + /// + /// Args: + /// embedding (Callable[[list], list]): the embedding function to translate documents to embeddings + /// cache (str): the file to be used as a cache to avoid calling the embedding function (optional) + /// overwrite_cache (bool): whether or not to overwrite the cache if there are new embeddings (optional) + /// node_document (str): the property name to be used as document for nodes (optional) + /// edge_document (str): the property name to be used as document for edges (optional) + /// verbose (bool): whether or not to print logs reporting the progress + /// + /// Returns: + /// A VectorisedGraph with all the documents/embeddings computed and with an initial empty selection + #[pyo3(signature = (embedding, cache = None, overwrite_cache = false, node_document = None, edge_document = None, verbose = false))] fn vectorise( &self, embedding: &PyFunction, cache: Option, + overwrite_cache: bool, node_document: Option, edge_document: Option, verbose: bool, @@ -323,7 +336,13 @@ impl PyGraphView { let template = PyDocumentTemplate::new(node_document, edge_document); execute_async_task(move || async move { graph - .vectorise_with_template(Box::new(embedding.clone()), cache, template, verbose) + .vectorise_with_template( + Box::new(embedding.clone()), + cache, + overwrite_cache, + template, + verbose, + ) .await }) } diff --git a/raphtory/src/python/packages/vectors.rs b/raphtory/src/python/packages/vectors.rs index 746aa76930..99b76377c9 100644 --- a/raphtory/src/python/packages/vectors.rs +++ b/raphtory/src/python/packages/vectors.rs @@ -156,8 +156,16 @@ fn translate(window: Window) -> Option<(i64, i64)> { window.map(|(start, end)| (start.into_time(), end.into_time())) } +/// A vectorised graph, containing a set of documents positioned in the graph space and a selection +/// over those documents #[pymethods] impl PyVectorisedGraph { + /// Save the embeddings present in this graph to `file` so they can be further used in a call to `vectorise` + fn save_embeddings(&self, file: String) { + self.0.save_embeddings(file.into()); + } + + /// Return the nodes present in the current selection fn nodes(&self) -> Vec { self.0 .nodes() @@ -166,6 +174,7 @@ impl PyVectorisedGraph { .collect_vec() } + /// Return the edges present in the current selection fn edges(&self) -> Vec { self.0 .edges() @@ -174,6 +183,7 @@ impl PyVectorisedGraph { .collect_vec() } + /// Return the documents present in the current selection fn get_documents(&self, py: Python) -> Vec { self.get_documents_with_scores(py) .into_iter() @@ -181,6 +191,7 @@ impl PyVectorisedGraph { .collect_vec() } + /// Return the documents alongside their scores present in the current selection fn get_documents_with_scores(&self, py: Python) -> Vec<(PyGraphDocument, f32)> { let docs = self.0.get_documents_with_scores(); @@ -210,13 +221,61 @@ impl PyVectorisedGraph { .collect_vec() } - #[pyo3(signature = (hops, window=None))] - fn expand(&self, hops: usize, window: Window) -> DynamicVectorisedGraph { - self.0.expand(hops, translate(window)) + /// Add all the documents from `nodes` and `edges` to the current selection + /// + /// Documents added by this call are assumed to have a score of 0. + /// + /// Args: + /// nodes (list): a list of the vertex ids or vertices to add + /// edges (list): a list of the edge ids or edges to add + /// + /// Returns: + /// A new vectorised graph containing the updated selection + fn append( + &self, + nodes: Vec, + edges: Vec<(VertexRef, VertexRef)>, + ) -> DynamicVectorisedGraph { + self.0.append(nodes, edges) + } + + /// Add all the documents from `nodes` to the current selection + /// + /// Documents added by this call are assumed to have a score of 0. + /// + /// Args: + /// nodes (list): a list of the vertex ids or vertices to add + /// + /// Returns: + /// A new vectorised graph containing the updated selection + fn append_nodes(&self, nodes: Vec) -> DynamicVectorisedGraph { + self.append(nodes, vec![]) } + /// Add all the documents from `edges` to the current selection + /// + /// Documents added by this call are assumed to have a score of 0. + /// + /// Args: + /// edges (list): a list of the edge ids or edges to add + /// + /// Returns: + /// A new vectorised graph containing the updated selection + fn append_edges(&self, edges: Vec<(VertexRef, VertexRef)>) -> DynamicVectorisedGraph { + self.append(vec![], edges) + } + + /// Add the top `limit` documents to the current selection using `query` + /// + /// Args: + /// query (str or list): the text or the embedding to score against + /// limit (int): the maximum number of new documents to add + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection #[pyo3(signature = (query, limit, window=None))] - fn expand_by_similarity( + fn append_by_similarity( &self, query: PyQuery, limit: usize, @@ -224,11 +283,20 @@ impl PyVectorisedGraph { ) -> DynamicVectorisedGraph { let embedding = compute_embedding(&self.0, query); self.0 - .expand_by_similarity(&embedding, limit, translate(window)) + .append_by_similarity(&embedding, limit, translate(window)) } + /// Add the top `limit` node documents to the current selection using `query` + /// + /// Args: + /// query (str or list): the text or the embedding to score against + /// limit (int): the maximum number of new documents to add + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection #[pyo3(signature = (query, limit, window=None))] - fn expand_nodes_by_similarity( + fn append_nodes_by_similarity( &self, query: PyQuery, limit: usize, @@ -236,11 +304,20 @@ impl PyVectorisedGraph { ) -> DynamicVectorisedGraph { let embedding = compute_embedding(&self.0, query); self.0 - .expand_nodes_by_similarity(&embedding, limit, translate(window)) + .append_nodes_by_similarity(&embedding, limit, translate(window)) } + /// Add the top `limit` edge documents to the current selection using `query` + /// + /// Args: + /// query (str or list): the text or the embedding to score against + /// limit (int): the maximum number of new documents to add + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection #[pyo3(signature = (query, limit, window=None))] - fn expand_edges_by_similarity( + fn append_edges_by_similarity( &self, query: PyQuery, limit: usize, @@ -248,27 +325,46 @@ impl PyVectorisedGraph { ) -> DynamicVectorisedGraph { let embedding = compute_embedding(&self.0, query); self.0 - .expand_edges_by_similarity(&embedding, limit, translate(window)) - } - - fn append( - &self, - nodes: Vec, - edges: Vec<(VertexRef, VertexRef)>, - ) -> DynamicVectorisedGraph { - self.0.append(nodes, edges) - } - - fn append_nodes(&self, nodes: Vec) -> DynamicVectorisedGraph { - self.append(nodes, vec![]) + .append_edges_by_similarity(&embedding, limit, translate(window)) } - fn append_edges(&self, edges: Vec<(VertexRef, VertexRef)>) -> DynamicVectorisedGraph { - self.append(vec![], edges) + /// Add all the documents `hops` hops away to the selection + /// + /// Two documents A and B are considered to be 1 hop away of each other if they are on the same + /// entity or if they are on the same node/edge pair. Provided that, two nodes A and C are n + /// hops away of each other if there is a document B such that A is n - 1 hops away of B and B + /// is 1 hop away of C. + /// + /// Args: + /// hops (int): the number of hops to carry out the expansion + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection + #[pyo3(signature = (hops, window=None))] + fn expand(&self, hops: usize, window: Window) -> DynamicVectorisedGraph { + self.0.expand(hops, translate(window)) } + /// Add the top `limit` adjacent documents with higher score for `query` to the selection + /// + /// The expansion algorithm is a loop with two steps on each iteration: + /// 1. All the documents 1 hop away of some of the documents included on the selection (and + /// not already selected) are marked as candidates. + /// 2. Those candidates are added to the selection in descending order according to the + /// similarity score obtained against the `query`. + /// + /// This loops goes on until the current selection reaches a total of `limit` documents or + /// until no more documents are available + /// + /// Args: + /// query (str or list): the text or the embedding to score against + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection #[pyo3(signature = (query, limit, window=None))] - fn append_by_similarity( + fn expand_by_similarity( &self, query: PyQuery, limit: usize, @@ -276,11 +372,22 @@ impl PyVectorisedGraph { ) -> DynamicVectorisedGraph { let embedding = compute_embedding(&self.0, query); self.0 - .append_by_similarity(&embedding, limit, translate(window)) + .expand_by_similarity(&embedding, limit, translate(window)) } + /// Add the top `limit` adjacent node documents with higher score for `query` to the selection + /// + /// This function has the same behavior as expand_by_similarity but it only considers nodes. + /// + /// Args: + /// query (str or list): the text or the embedding to score against + /// limit (int): the maximum number of new documents to add + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection #[pyo3(signature = (query, limit, window=None))] - fn append_nodes_by_similarity( + fn expand_nodes_by_similarity( &self, query: PyQuery, limit: usize, @@ -288,11 +395,22 @@ impl PyVectorisedGraph { ) -> DynamicVectorisedGraph { let embedding = compute_embedding(&self.0, query); self.0 - .append_nodes_by_similarity(&embedding, limit, translate(window)) + .expand_nodes_by_similarity(&embedding, limit, translate(window)) } + /// Add the top `limit` adjacent edge documents with higher score for `query` to the selection + /// + /// This function has the same behavior as expand_by_similarity but it only considers edges. + /// + /// Args: + /// query (str or list): the text or the embedding to score against + /// limit (int): the maximum number of new documents to add + /// window ((int | str, int | str)): the window where documents need to belong to in order to be considered + /// + /// Returns: + /// A new vectorised graph containing the updated selection #[pyo3(signature = (query, limit, window=None))] - fn append_edges_by_similarity( + fn expand_edges_by_similarity( &self, query: PyQuery, limit: usize, @@ -300,7 +418,7 @@ impl PyVectorisedGraph { ) -> DynamicVectorisedGraph { let embedding = compute_embedding(&self.0, query); self.0 - .append_edges_by_similarity(&embedding, limit, translate(window)) + .expand_edges_by_similarity(&embedding, limit, translate(window)) } } diff --git a/raphtory/src/vectors/document_template.rs b/raphtory/src/vectors/document_template.rs index a60a56f854..e45e6d2304 100644 --- a/raphtory/src/vectors/document_template.rs +++ b/raphtory/src/vectors/document_template.rs @@ -9,8 +9,12 @@ use crate::{ use itertools::Itertools; use std::{convert::identity, sync::Arc}; +/// Trait to be implemented for custom document templates pub trait DocumentTemplate: Send + Sync { + /// A function that translate a node into an iterator of documents fn node(&self, vertex: &VertexView) -> Box>; + + /// A function that translate an edge into an iterator of documents fn edge(&self, edge: &EdgeView) -> Box>; } diff --git a/raphtory/src/vectors/embedding_cache.rs b/raphtory/src/vectors/embedding_cache.rs index ce982f4d2d..193e73dfc1 100644 --- a/raphtory/src/vectors/embedding_cache.rs +++ b/raphtory/src/vectors/embedding_cache.rs @@ -16,6 +16,11 @@ pub(crate) struct EmbeddingCache { } impl EmbeddingCache { + pub(crate) fn new(path: PathBuf) -> Self { + let cache = RwLock::new(CacheStore::new()); + Self { cache, path } + } + pub(crate) fn from_path(path: PathBuf) -> Self { let inner_cache = Self::try_reading_from_disk(&path).unwrap_or(HashMap::new()); let cache = RwLock::new(inner_cache); diff --git a/raphtory/src/vectors/mod.rs b/raphtory/src/vectors/mod.rs index aca1af0eae..b1e187b2a2 100644 --- a/raphtory/src/vectors/mod.rs +++ b/raphtory/src/vectors/mod.rs @@ -156,20 +156,27 @@ mod vector_tests { g.add_vertex(0, "test", NO_PROPS).unwrap(); // the following succeeds with no cache set up - g.vectorise(Box::new(fake_embedding), None, false).await; + g.vectorise(Box::new(fake_embedding), None, true, false) + .await; let path = "/tmp/raphtory/very/deep/path/embedding-cache-test"; let _ = remove_file(path); // the following creates the embeddings, and store them on the cache - g.vectorise(Box::new(fake_embedding), Some(PathBuf::from(path)), false) - .await; + g.vectorise( + Box::new(fake_embedding), + Some(PathBuf::from(path)), + true, + false, + ) + .await; // the following uses the embeddings from the cache, so it doesn't call the panicking // embedding, which would make the test fail g.vectorise( Box::new(panicking_embedding), Some(PathBuf::from(path)), + true, false, ) .await; @@ -182,7 +189,7 @@ mod vector_tests { let g = Graph::new(); let cache = PathBuf::from("/tmp/raphtory/vector-cache-lotr-test"); let vectors = g - .vectorise(Box::new(fake_embedding), Some(cache), false) + .vectorise(Box::new(fake_embedding), Some(cache), true, false) .await; let embedding: Embedding = fake_embedding(vec!["whatever".to_owned()]).await.remove(0); let docs = vectors @@ -263,6 +270,7 @@ age: 30"###; .vectorise_with_template( Box::new(fake_embedding), Some(PathBuf::from("/tmp/raphtory/vector-cache-multi-test")), + true, FakeMultiDocumentTemplate, false, ) @@ -316,6 +324,7 @@ age: 30"###; .vectorise_with_template( Box::new(fake_embedding), Some(PathBuf::from("/tmp/raphtory/vector-cache-window-test")), + true, FakeTemplateWithIntervals, false, ) @@ -393,6 +402,7 @@ age: 30"###; .vectorise_with_template( Box::new(openai_embedding), Some(PathBuf::from("/tmp/raphtory/vector-cache-lotr-test")), + true, CustomTemplate, false, ) diff --git a/raphtory/src/vectors/vectorisable.rs b/raphtory/src/vectors/vectorisable.rs index f413b030ae..2315ba2cf7 100644 --- a/raphtory/src/vectors/vectorisable.rs +++ b/raphtory/src/vectors/vectorisable.rs @@ -25,17 +25,40 @@ struct IndexedDocumentInput { #[async_trait(?Send)] pub trait Vectorisable { + /// Create a VectorisedGraph from the current graph + /// + /// # Arguments: + /// * embedding - the embedding function to translate documents to embeddings + /// * cache - the file to be used as a cache to avoid calling the embedding function + /// * overwrite_cache - whether or not to overwrite the cache if there are new embeddings + /// * verbose - whether or not to print logs reporting the progress + /// + /// # Returns: + /// A VectorisedGraph with all the documents/embeddings computed and with an initial empty selection async fn vectorise( &self, embedding: Box, cache_file: Option, + override_cache: bool, verbose: bool, ) -> VectorisedGraph; + /// Create a VectorisedGraph from the current graph + /// + /// # Arguments: + /// * embedding - the embedding function to translate documents to embeddings + /// * cache - the file to be used as a cache to avoid calling the embedding function + /// * overwrite_cache - whether or not to overwrite the cache if there are new embeddings + /// * template - the template to use to translate entities into documents + /// * verbose - whether or not to print logs reporting the progress + /// + /// # Returns: + /// A VectorisedGraph with all the documents/embeddings computed and with an initial empty selection async fn vectorise_with_template>( &self, embedding: Box, - cache_file: Option, + cache: Option, + override_cache: bool, template: T, verbose: bool, ) -> VectorisedGraph; @@ -46,17 +69,19 @@ impl Vectorisable for G { async fn vectorise( &self, embedding: Box, - cache_file: Option, + cache: Option, + overwrite_cache: bool, verbose: bool, ) -> VectorisedGraph { - self.vectorise_with_template(embedding, cache_file, DefaultTemplate, verbose) + self.vectorise_with_template(embedding, cache, overwrite_cache, DefaultTemplate, verbose) .await } async fn vectorise_with_template>( &self, embedding: Box, - cache_file: Option, + cache: Option, + overwrite_cache: bool, template: T, verbose: bool, ) -> VectorisedGraph { @@ -83,19 +108,21 @@ impl Vectorisable for G { }) }); - let cache = cache_file.map(EmbeddingCache::from_path); + let cache_storage = cache.map(EmbeddingCache::from_path); if verbose { - println!("compute embeddings for nodes"); + println!("computing embeddings for nodes"); } - let node_refs = compute_embedding_groups(nodes, embedding.as_ref(), &cache).await; + let node_refs = compute_embedding_groups(nodes, embedding.as_ref(), &cache_storage).await; if verbose { - println!("compute embeddings for edges"); + println!("computing embeddings for edges"); } - let edge_refs = compute_embedding_groups(edges, embedding.as_ref(), &cache).await; // FIXME: re-enable + let edge_refs = compute_embedding_groups(edges, embedding.as_ref(), &cache_storage).await; // FIXME: re-enable - cache.iter().for_each(|cache| cache.dump_to_disk()); + if overwrite_cache { + cache_storage.iter().for_each(|cache| cache.dump_to_disk()); + } VectorisedGraph::new( self.clone(), diff --git a/raphtory/src/vectors/vectorised_graph.rs b/raphtory/src/vectors/vectorised_graph.rs index 7a24e452ce..9182239a16 100644 --- a/raphtory/src/vectors/vectorised_graph.rs +++ b/raphtory/src/vectors/vectorised_graph.rs @@ -6,13 +6,15 @@ use crate::{ }, prelude::*, vectors::{ - document_ref::DocumentRef, document_template::DocumentTemplate, entity_id::EntityId, - Document, Embedding, EmbeddingFunction, + document_ref::DocumentRef, document_template::DocumentTemplate, + embedding_cache::EmbeddingCache, entity_id::EntityId, Document, DocumentOps, Embedding, + EmbeddingFunction, }, }; use itertools::{chain, Itertools}; use std::{ collections::{HashMap, HashSet}, + path::PathBuf, sync::Arc, }; @@ -68,25 +70,19 @@ impl> VectorisedGraph { } } - /// This assumes forced documents to have a score of 0 - pub fn append>(&self, nodes: Vec, edges: Vec<(V, V)>) -> Self { - let node_docs = nodes.into_iter().flat_map(|id| { - let vertex = self.source_graph.vertex(id); - let opt = vertex.map(|vertex| self.node_documents.get(&EntityId::from_node(&vertex))); - opt.flatten().unwrap_or(&self.empty_vec) - }); - let edge_docs = edges.into_iter().flat_map(|(src, dst)| { - let edge = self.source_graph.edge(src, dst); - let opt = edge.map(|edge| self.edge_documents.get(&EntityId::from_edge(&edge))); - opt.flatten().unwrap_or(&self.empty_vec) + /// Save the embeddings present in this graph to `file` so they can be further used in a call to `vectorise` + pub fn save_embeddings(&self, file: PathBuf) { + let cache = EmbeddingCache::new(file); + chain!(self.node_documents.iter(), self.edge_documents.iter()).for_each(|(_, group)| { + group.iter().for_each(|doc| { + let original = doc.regenerate(&self.source_graph, self.template.as_ref()); + cache.upsert_embedding(original.content(), doc.embedding.clone()); + }) }); - let new_selected = chain!(node_docs, edge_docs).map(|doc| (doc.clone(), 0.0)); - Self { - selected_docs: extend_selection(self.selected_docs.clone(), new_selected, usize::MAX), - ..self.clone() - } + cache.dump_to_disk(); } + /// Return the nodes present in the current selection pub fn nodes(&self) -> Vec> { self.selected_docs .iter() @@ -97,6 +93,7 @@ impl> VectorisedGraph { .collect_vec() } + /// Return the edges present in the current selection pub fn edges(&self) -> Vec> { self.selected_docs .iter() @@ -107,6 +104,7 @@ impl> VectorisedGraph { .collect_vec() } + /// Return the documents present in the current selection pub fn get_documents(&self) -> Vec { self.get_documents_with_scores() .into_iter() @@ -114,6 +112,7 @@ impl> VectorisedGraph { .collect_vec() } + /// Return the documents alongside their scores present in the current selection pub fn get_documents_with_scores(&self) -> Vec<(Document, f32)> { self.selected_docs .iter() @@ -126,6 +125,43 @@ impl> VectorisedGraph { .collect_vec() } + /// Add all the documents from `nodes` and `edges` to the current selection + /// + /// Documents added by this call are assumed to have a score of 0. + /// + /// # Arguments + /// * nodes - a list of the vertex ids or vertices to add + /// * edges - a list of the edge ids or edges to add + /// + /// # Returns + /// A new vectorised graph containing the updated selection + pub fn append>(&self, nodes: Vec, edges: Vec<(V, V)>) -> Self { + let node_docs = nodes.into_iter().flat_map(|id| { + let vertex = self.source_graph.vertex(id); + let opt = vertex.map(|vertex| self.node_documents.get(&EntityId::from_node(&vertex))); + opt.flatten().unwrap_or(&self.empty_vec) + }); + let edge_docs = edges.into_iter().flat_map(|(src, dst)| { + let edge = self.source_graph.edge(src, dst); + let opt = edge.map(|edge| self.edge_documents.get(&EntityId::from_edge(&edge))); + opt.flatten().unwrap_or(&self.empty_vec) + }); + let new_selected = chain!(node_docs, edge_docs).map(|doc| (doc.clone(), 0.0)); + Self { + selected_docs: extend_selection(self.selected_docs.clone(), new_selected, usize::MAX), + ..self.clone() + } + } + + /// Add the top `limit` documents to the current selection using `query` + /// + /// # Arguments + /// * query - the text or the embedding to score against + /// * limit - the maximum number of new documents to add + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn append_by_similarity( &self, query: &Embedding, @@ -136,6 +172,15 @@ impl> VectorisedGraph { self.add_top_documents(joined, query, limit, window) } + /// Add the top `limit` node documents to the current selection using `query` + /// + /// # Arguments + /// * query - the text or the embedding to score against + /// * limit - the maximum number of new documents to add + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn append_nodes_by_similarity( &self, query: &Embedding, @@ -144,6 +189,16 @@ impl> VectorisedGraph { ) -> Self { self.add_top_documents(self.node_documents.as_ref(), query, limit, window) } + + /// Add the top `limit` edge documents to the current selection using `query` + /// + /// # Arguments + /// * query - the text or the embedding to score against + /// * limit - the maximum number of new documents to add + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn append_edges_by_similarity( &self, query: &Embedding, @@ -153,7 +208,19 @@ impl> VectorisedGraph { self.add_top_documents(self.edge_documents.as_ref(), query, limit, window) } - /// This assumes forced documents to have a score of 0 + /// Add all the documents `hops` hops away to the selection + /// + /// Two documents A and B are considered to be 1 hop away of each other if they are on the same + /// entity or if they are on the same node/edge pair. Provided that, two nodes A and C are n + /// hops away of each other if there is a document B such that A is n - 1 hops away of B and B + /// is 1 hop away of C. + /// + /// # Arguments + /// * hops - the number of hops to carry out the expansion + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn expand(&self, hops: usize, window: Option<(i64, i64)>) -> Self { match window { None => self.expand_with_window(hops, window, &self.source_graph), @@ -185,6 +252,23 @@ impl> VectorisedGraph { } } + /// Add the top `limit` adjacent documents with higher score for `query` to the selection + /// + /// The expansion algorithm is a loop with two steps on each iteration: + /// 1. All the documents 1 hop away of some of the documents included on the selection (and + /// not already selected) are marked as candidates. + /// 2. Those candidates are added to the selection in descending order according to the + /// similarity score obtained against the `query`. + /// + /// This loops goes on until the current selection reaches a total of `limit` documents or + /// until no more documents are available + /// + /// # Arguments + /// * query - the text or the embedding to score against + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn expand_by_similarity( &self, query: &Embedding, @@ -194,6 +278,17 @@ impl> VectorisedGraph { self.expand_by_similarity_with_path(query, limit, window, ExpansionPath::Both) } + /// Add the top `limit` adjacent node documents with higher score for `query` to the selection + /// + /// This function has the same behavior as expand_by_similarity but it only considers nodes. + /// + /// # Arguments + /// * query - the text or the embedding to score against + /// * limit - the maximum number of new documents to add + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn expand_nodes_by_similarity( &self, query: &Embedding, @@ -203,6 +298,17 @@ impl> VectorisedGraph { self.expand_by_similarity_with_path(query, limit, window, ExpansionPath::Nodes) } + /// Add the top `limit` adjacent edge documents with higher score for `query` to the selection + /// + /// This function has the same behavior as expand_by_similarity but it only considers edges. + /// + /// # Arguments + /// * query - the text or the embedding to score against + /// * limit - the maximum number of new documents to add + /// * window - the window where documents need to belong to in order to be considered + /// + /// # Returns + /// A new vectorised graph containing the updated selection pub fn expand_edges_by_similarity( &self, query: &Embedding,