forked from Data-drone/ANZ_LLM_Bootcamp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
0.3_Vector_DBs.py
172 lines (117 loc) · 5.03 KB
/
0.3_Vector_DBs.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# Databricks notebook source
# MAGIC %md
# MAGIC # Exploring Vector DBs
# MAGIC In this notebook we will explore the process of converting text to numbers and what that means for our sentences
# MAGIC We will use the faiss library which provides a large variety of different algorithms that you can try out.
# MAGIC The difference between FAISS and a full Vector Database solution is around things like governance,
# MAGIC convenience features like updates and production grade featuers like failover and backups.
# COMMAND ----------
# MAGIC %pip install faiss-cpu wikipedia
# MAGIC dbutils.library.restartPython()
# COMMAND ----------
# DBTITLE 1,Setup
# MAGIC %run ./utils
# COMMAND ----------
import faiss
import wikipedia
import os
# COMMAND ----------
# MAGIC %md
# MAGIC # Load some sample data
# MAGIC We will use wikipedia for our initial sample data
# COMMAND ----------
# Load Sample Data
result = wikipedia.search("Neural networks")
print(result)
# COMMAND ----------
# get the first article
page = wikipedia.page(result[0])
len(page.content.split())
# COMMAND ----------
# MAGIC %md
# MAGIC # Load Embedding Model
# MAGIC In this example, we will use the tokeniser from zephyr-7b to start
# MAGIC
# MAGIC *NOTE* When we build out our full architecture there will be two functions that turn text to tokens.
# MAGIC - Model Tokenizer - This component we are experimenting with here
# MAGIC - Embedding Tokenizer - This will be explored later and is used to populate the VectorDB
# MAGIC
# MAGIC Whilst the _Model Tokenizer_ is set, you have to use the one intended for your model, the _Embedding Tokenizer_ is something
# MAGIC that we can select to suit our use case
# COMMAND ----------
from transformers import AutoTokenizer
# We will use the tokenizer from 'mistralai/Mistral-7B-v0.1'
tokenizer = AutoTokenizer.from_pretrained(f'{downloads_dir}/mistral_7b_instruct',
local_files_only=True)
# COMMAND ----------
# MAGIC %md
# MAGIC # Explore tokenization
# MAGIC Lets explore the way that words are encoded for our LLM
# COMMAND ----------
tokenizer.encode('test')
# COMMAND ----------
tokenizer.encode('party')
tokenizer.encode('partying')
# this word exists!!!!
tokenizer.encode('Pneumonoultramicroscopicsilicovolcanoconiosis')
# COMMAND ----------
# lets decode a little and see what the codes mean
medical_encode = tokenizer.encode('Pneumonoultramicroscopicsilicovolcanoconiosis')
tokenizer.decode(medical_encode[0])
# COMMAND ----------
tokenizer.encode('I am happily eating pizza all day long')
# COMMAND ----------
# MAGIC %md
# MAGIC # Sentence Transformers for Embedding tokenization
# MAGIC The Sentence Transformers library provides a series of embedding algorithms that can be used to popuiate our VectorDB.
# MAGIC Unlike the _Model Tokenizer_ which produced a variable length output depending on the input.
# MAGIC An embedding algorithm produces a fixed length vector so that we can run approximate nearest neighbour algorithms.
# COMMAND ----------
from sentence_transformers import SentenceTransformer
# initialize sentence transformer model
model = SentenceTransformer('bert-base-nli-mean-tokens')
# COMMAND ----------
# Split the document into paragraphs
paragraph_form = page.content.split('\n\n')
len(paragraph_form)
# COMMAND ----------
# MAGIC %md
# MAGIC Encode the paragraphs into dence vectors
# MAGIC Different models will produce a different length vector
# MAGIC In theory, a model that produces a longer length can represent the input data better.
# MAGIC But really it depends on the type of data it is trained on.
# MAGIC
# MAGIC ie a Sentence Transformer that produces 512 length vectors BUT is trained on medical data
# MAGIC will provide a better representation for medical documents than a Sentence Transformer that produces 1024 length vectors
# MAGIC but is only trained on social media.
# COMMAND ----------
sentence_encode = model.encode(paragraph_form)
sentence_encode.shape
# COMMAND ----------
# MAGIC %md
# MAGIC # Lets build out a FAISS index
# MAGIC FAISS lets us experiment with a wide variety of different search algorithms
# MAGIC Most VectorDBs will offer just one option.
# COMMAND ----------
index = faiss.IndexFlatL2(sentence_encode.shape[1])
# COMMAND ----------
index.add(sentence_encode)
# COMMAND ----------
# now we can search!
num_results = 3
question = 'Were animals used in neural network development'
query_vector = model.encode([question])
score, index_id = index.search(query_vector, num_results)
# COMMAND ----------
# Retrieve Index id
print(f'index ids retrieved are: {index_id}\n')
for x in index_id[0]:
print(f'Entry: {x}')
print(f'{paragraph_form[x]}\n')
# COMMAND ----------
# MAGIC %md
# MAGIC # Discussion
# MAGIC The main goal in this exercise is the find the best snippets.\
# MAGIC Specifically for Vector embeddings there are many algorithms\
# MAGIC You can look at some of the varieties here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes\
# MAGIC Generally you are trading off between speed of indexing / retrieval and accuracy.