2024-02-22 embeddings/test.http

POST http://localhost:11434/api/embeddings

{"model":"jmorgan/nomic-embed-text", "prompt":"Maybe you don't have a GPU. Or you would rather not hear your fans all the time. You want your local machine to run a coding model, but need access to a different model for other tasks as well. You are streaming with OBS and don't want things to lag. You are running different agents and want them to respond, together, faster. There is probably an infinite number of reasons to need more. \n\nSo let's search for a solution. We need a cloud provider that has access to virtual machines with GPUs. Not just 'shared' GPUs that have generic names where you don't know what's really there, like on Azure. But to know that it's an Nvidia card for which CUDA drivers actually exist. \n\nI have friends who love Paperspace for this, which is now owned by DigitalOcean. Paperspace is pretty awesome, apart from the fact that I have to make quota requests that get approved 20 minutes later, every freaking time I try to start a machine. But they are pretty much the only reliable source of Windows-based instances with GPU that I have been able to find. I don't use a lot of Windows, but I did for this channel recently. I know others who adore FLY.io, but I felt it was a bit awkward to use. I spent a while with the docs and just couldn't get it. \n\nI really love brev.dev for this particular problem. Before, I tried using services like lambdalabs or GCP, but I always ran into issues getting access to a GPU. And I wouldn't know there was a problem for sometimes as long as 5 minutes. Eventually, you see a red exclamation point in the console saying no more GPUs in that region, try again. And then you try a different region and hope for the best. Why can't they just tell you where GPU's exist? I have no idea. \n\nWell, that's what I find to be really amazing about Brev.  They make it super easy to find a GPU somewhere on the planet with a few different providers. Usually, a fraction of a second of latency doesn't really make a lot of difference for this use case, so if the machine is in Singapore or anywhere else where it is late at night, it doesn't really matter. And you pay them rather than signing up for AWS and GCP or others. \n\nNow, you might be thinking that this will be very expensive. Well, how long do you use models on any given day? 2 hours? 3 hours? 5 days a week? So you could do that for about $6 a month. I think that's pretty reasonable. So let's check out how this works.\n\nI'll login to my brev.dev account and click the new instance button. So which gpu do I want? I tend to go for a t4. Its cheap and pretty fast often 40 something tokens per second a lot of the time. Pricing changes depending on what's available, but I'll often choose spot pricing to drop that lower and here is what it costs when recording this. The price was different a few hours ago. \n\nI can give it more or less disk and then set a name. let's call this remoteollama and click Deploy. So far in my experience the machine is up and running in about a minute, maybe a bit less. ollama takes another 4 seconds to install because all the gpu drivers are already there. Often on GCP when I do get an instance, I am waiting for five minutes or more just to install CUDA drivers. And then pulling and running llama2 is another 20 seconds. So that's pretty quick. \n\nDid you notice how I logged in? Normally with most cloud providers you would have to give it an ssh key at the beginning, or download one to connect, or download some other file to connect. With Brev you don't deal with any of those things. You install one command when you setup your account called brev. I can type `brev shell remoteollama` and I am ssh'd into the host. Perhaps even cooler is that I can type `brev open remoteollama` and vscode opens with everything setup to work against that remote machine. I think that is pretty cool. \n\nBut I would like to be able to just run ollama and have my ollama client access the remote machine. So there are a few steps to getting that working. We need to tell the client machine where the ollama service is running. We need to tell the ollama server that we should accept requests from other machines. And we need to enable remote machines to access the brev server. That last option can be the easiest and it can also be the hardest. On some platforms you might just grant all access to all visitors. That is super dangerous and probably just stupid. There are search engines out there that make it easy to find open ports all over the world. I tried it once 4 months ago and found dozens and dozens of ollama servers wide open. don't do that. Its amazing how much free compute you can get if you 1. try and 2. don't have morals. \n\nBrev offers the ability to open up a service in their ui that you can share with folks, and then you use brev to authenticate and get access to the service. If you would like to see that, I can cover it in a future video. \n\nBut the approach I love to take is to use Tailscale. Tailscale is like really secure VPN done really simply. It is amazing how quickly you can be up and running. And for 3 users even with a custom domain, its free....with 100 devices. I don't know about you but I don't have 100 devices. Beyond that its 6 bucks per ACTIVE user per month.  I'm not going to show setting up Tailscale from the beginning but I can if you want it. Just ask for that below. \n\nI do want to add remoteollama to this tailscale network. So I will choose add device in tailscale and choose linux. here is a shell script to run. Copy that. now `brev shell remoteollama` and paste the command. then run `sudo tailscale up`. It gives me a url to open and that will log the machine in to my network. I'll just come into the UI again and rename this host to remoteollama. We are almost there. \n\nNow on remoteollama we need to add an environment variable to tell the ollama service to take requests from remote machines. So we need to set ollama_host to 0.0.0.0. The right way to do this is to run `sudo systemctl edit ollama.service`. The first time we do this, we get a blank file. Add `[Service]` at the top then `Environment='OLLAMA_HOST=0.0.0.0'`. That's a little strange. The second equals sign is inside the double quotes. Save that out and then `sudo systemctl daemon-reload` and then `sudo systemctl restart ollama` to restart ollama. \n\nOK, final stretch. When you setup Tailscale, you get this cool icon in your menu bar. You should see remoteollama listed there. So in the terminal on your machine, run `OLLAMA_HOST=remoteollama ollama run llama2`. And boom, you are in and running llama2 on ollama on a machine somewhere in the world. And it just works. When its time to stop the instance, visit brev.dev and click the delete button. \n\nYou might wonder why I put the environment variable on the line when running the ollama client. Well, maybe I am running ollama on this machine for help with coding. If I set the environment variable the right way for the service, it will screw up the service. I just want this to take effect for the cli client and let vscode and the local service to continue to work fine.\n\nSo what else can you do now that you have ollama running with tailscale. Well, maybe you want a web ui that you can run from your phone. That will just work. And maybe you do that with your regular machine instead of a hosted server. Maybe a friend has a super powerful machine that you share access to and that becomes your ollama server you both use? Usually if setting up remote networking was easy, you probably did it wrong. But Tailscale makes this and so many other situations super easy and you did it right. And that's why I like brev. It makes something hard super easy. I guess that's why I like Ollama so much too. It makes something that is pretty hard super easy. \n\nAnd that's what I have for you this time. let me know if you have any more questions. I am watching the comments all the time and love to hear what you have to say. thanks so much for watching. goodbye."
}