hot-to-local-llm.md

Install ollama
```
brew install ollama
```

Start ollama

We'll start ollama using the server mode, this way we have an HTTP interface ready to be consumed by whoever wants.

OLLAMA_MAX_LOADED_MODELS=4 \
OLLAMA_MAX_NUM_PARALLEL=10 \
OLLAMA_ORIGINS="\*" \
OLLAMA_FLASH_ATTENTION=1 \
ollama serve

OUTPUT

# output
2024/06/28 15:48:32 routes.go:1064: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/iguatemi/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[\\* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-06-28T15:48:32.326+01:00 level=INFO source=images.go:730 msg="total blobs: 29"
time=2024-06-28T15:48:32.328+01:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env:	export GIN_MODE=release
- using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-28T15:48:32.330+01:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11434 (version 0.1.46)"
time=2024-06-28T15:48:32.331+01:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/rw/0syygy1528v0515md9pnvt9r0000gp/T/ollama1714798501/runners
time=2024-06-28T15:48:32.357+01:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-06-28T15:48:32.403+01:00 level=INFO source=types.go:98 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="21.3 GiB" available="21.3 GiB"
...

You can check all available configurations by running ❯ ollama serve --help

Test it

Ollama pulls LLM images automatically for us. You can browse available models at ollama.com/library

ollama run llama3 Hello, world!

OUTPUT

# output
pulling manifest
pulling 6a0746a1ec1a... 100% ▕██████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕██████████████▏  12 KB
pulling 8ab4849b038c... 100% ▕██████████████▏  254 B
pulling 577073ffcc6c... 100% ▕██████████████▏  110 B
pulling 3f8eb4da87fa... 100% ▕██████████████▏  485 B
verifying sha256 digest
writing manifest
removing any unused layers
success
Hello there! It's great to meet you!

In vscode or jetbrains, you can use the Continue extension.

iguatemigarin/hot-to-local-llm.md