If you want a capable AI coding assistant that respects your Mac Mini’s limits, you can run a small, quantized open‑source model locally and wire it up to VS Code. The flow below keeps things beginner‑friendly, fast, and on‑demand.
- VRAM (Video RAM): This is the memory built into your graphics card. It is the "workspace" for the AI.
- The Rule: If the model's size is larger than your VRAM, it will either run extremely slowly or not at all.
Start by picking a 7–8B parameter code model in a quantized format (e.g., Q4_K_M or Q5_K_M). These fit in memory and remain responsive on Apple Silicon.
| Model | Size class | Best use | License | Notes |
|---|---|---|---|---|
| DeepSeek‑Coder:6.7B | 6.7B | Very strong at structured code generation, unit tests, multi‑file context; good with TypeScript | DeepSeek License & MIT License | Less tuned for natural language explanations. Best for project‑level code generation and refactoring |
| DeepSeek‑R1:7B | 7B | Good at debugging, algorithm design, math‑heavy code | MIT License | Not trained primarily on code corpora. Best for problem solving and debugging logic |
| Qwen2.5‑Coder | 7B | Python/JS/TypeScript | Apache 2.0 | Solid structured code & doc reasoning |
| Llama 3.2 | 3B | Rephrasing, light chat, and fast reasoning | Llama 3.2 License | Excellent for query rephrasing and fast context processing |
| StarCoder2 | 7B | Code completion | BigCode OpenRAIL | Good for autocomplete and snippets |
| Code Llama | 7B | C/C++/Python | LLAMA 2 Community | Older but stable for classic tasks |
Sources: these are widely supported in local runtimes and available in quantized GGUF or via Ollama model registries.
- Memory fit: Q4/Q5 quantized 7B models typically use ~4–6GB VRAM-equivalent in RAM, leaving room for OS and tools.
- Speed: Apple Silicon accelerates integer math well; quantized models run fast enough for interactive coding.
- Quality: These models are tuned for code, not general chat, so they autocomplete and refactor more reliably.
Use Ollama for the simplest local install and model management on Apple Silicon. It bundles Metal acceleration and an HTTP API VS Code can call.
- Homebrew:
- Install or update Homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Verify:
brew doctor
- Install or update Homebrew:
- Ollama:
- Install:
brew install ollama
- Confirm:
ollama --version
- Install:
- VS Code:
- Download from code.visualstudio.com and install
- Verify:
code --version(optional, if you installed command line launcher)
- Server running:
- Ensure
ollama serveis active and listening athttp://localhost:11434.
- Ensure
- Choose one (example: DeepSeek‑Coder):
ollama pull qwen2.5-coder:7b-instruct-q5_K_Mollama pull nomic-embed-text:latestollama pull qwen3-embedding:0.6bollama pull dengcao/qwen3-reranker-0.6b:q8_0ollama pull llama3.2:3b-instruct-q5_K_M
- Alternative options:
ollama pull deepseek-r1:8b-0528-qwen3-q4_K_Mollama pull deepseek-coder:6.7bollama pull starcoder2:7bollama pull codellama:7b
- To remove the model from your system using Ollama:
ollama rm deepseek-r1:8b
- Quick run test:
- Check the Model's Maximum Support:
ollama show <model_name> ollama run qwen2.5-coder:7b-instruct-q5_K_M- Ask: “Write a Python function to validate an email address.” Press Ctrl+C to exit OR Use Ctrl + d or /bye to exit.
- Test your Ollama server directly via HTTP requests to http://localhost:11434/. Here’s a simple example using curl in the terminal:
curl http://localhost:11434/api/generate \ -d '{ "model": "qwen2.5-coder:7b-instruct-q5_K_M", "prompt": "Write a TypeScript function that adds two numbers." }'
- Check the Model's Maximum Support:
Tip: Append a quantization tag if offered (e.g.,
:q4_K_M). Many Ollama registries default to efficient quantizations.
Use the Continue extension (open‑source) to connect VS Code to your local Ollama server. It provides inline completions, chat, and a “fix/modify” workflow.
- Get the extension:
- In VS Code, go to Extensions and search “Continue”
- Install “Continue - open-source AI code agent”
- Open the Continue sidebar: Click the Continue icon on the Activity Bar
- Create or edit config:
- In the Continue sidebar, open Settings
- Set Provider to “Ollama”
- Set Model to the one you pulled (e.g.,
qwen2.5-coder:7b-instruct-q5_K_M) - Ensure Endpoint is
http://localhost:11434 - config:
name: Local Config version: 1.0.0 schema: v1 models: - name: Qwen2.5 7B (Chat) # Current provider: ollama model: qwen2.5-coder:7b-instruct-q5_K_M roles: - chat - edit - apply - name: DeepSeek-R1 8B (Reasoning) # DeepSeek-R1 8B can only be used for plain chat/reasoning in Continue — it does not support tools like @Codebase. provider: ollama model: deepseek-r1:8b-0528-qwen3-q4_K_M roles: - chat - edit - apply - name: Qwen2.5 1.5B (Fast) provider: ollama model: qwen2.5-coder:1.5b-instruct-q5_K_M roles: - autocomplete - name: Nomic Embed Text # Current provider: ollama model: nomic-embed-text:latest roles: - embed - name: Qwen3 Embedding provider: ollama model: qwen3-embedding:0.6b roles: - embed - name: Llama 3.2 3B (Rephrasing) provider: ollama model: llama3.2:3b-instruct-q5_K_M roles: - chat - edit - apply - name: Qwen3 Reranker # Current provider: ollama model: dengcao/qwen3-reranker-0.6b:q8_0 roles: - rerank context: - provider: codebase # Deep search across all files - provider: code # High-level code intelligence - provider: tree # Visualizes file structure - provider: problems # Sees VS Code "Problems" tab errors - provider: open # References open/pinned tabs params: onlyPinned: true - provider: terminal # References terminal output - provider: repo-map # The "skeleton" of your project params: includeSignatures: false - provider: debugger # Local variable state params: stackDepth: 3
- Enable Codebase Indexing:
- In the Continue sidebar, click the gear/settings icon.
- Find the section labeled Indexing.
- Toggle the switch to Enable.
- Rebuild the Index
- Under @codebase index, look for the option Click to re-index.
- Troubleshooting:
- The Continue VS Code extension relies on LanceDB to store and search code embeddings. On macOS ARM64, it needs a native library (
@lancedb/vectordb-darwin-arm64). Codebase indexing and deep search features don’t work without this library:- Embedding: lance.connect is not a function
- Full text search: SQLITE_ERROR: no such table: chunks
- Installing the extension might resolve these issues:
- Open Terminal on your Mac.
- Navigate to the Continue extension folder:
cd ~/.vscode/extensions/continue.continue-1.2.11-darwin-arm64
- Install the missing LanceDB native library:
npm install @lancedb/vectordb-darwin-arm64
- Verify installation (optional):
You should see a version number.
npm list @lancedb/vectordb-darwin-arm64
- Restart VS Code so the extension reloads with the new dependency.
- The Continue VS Code extension relies on LanceDB to store and search code embeddings. On macOS ARM64, it needs a native library (
- Project prompts (optional):
- Add repo-specific instructions (framework, language, style) in Continue’s “System Prompt” to improve consistency
- Completions settings:
- Turn on “Inline Autocomplete”
- Set “Autocomplete provider” to the same Ollama model
- Test in an editor file:
- Type a function signature; pause a moment for suggestions
- Use Tab to accept or keep typing to refine
- Generate Git commit message, use the extension’s command @Git Diff to send the git diff to your local LLM.
@Git Diff Write a simple commit message summarizing the modifications
- CodeGPT or AICode: Extensions that support local endpoints via OpenAI-compatible APIs.
- LM Studio: If you prefer a GUI runner; it exposes a local API similar to Ollama. Configure the extension to point at its endpoint.
By default, Ollama can run a background server. We’ll disable autostart and launch it only when you need it.
- Stop and disable the service:
brew services stop ollama- This prevents it from starting at login. You’ll run it manually instead.
- Manual launch:
ollama serve- Keep this terminal open while using VS Code. Close it to stop the server and free memory.
- Quick start alias (optional):
- Add to your shell config (
~/.zshrc):alias aiup='nohup ollama serve >/tmp/ollama.out 2>&1 &'alias aidown="pkill -f 'ollama serve'"
- Now run
aiupto start in the background;aidownto stop.
- Add to your shell config (
- Create
.vscode/tasks.json:{ "version": "2.0.0", "tasks": [ { "label": "Start Ollama (on-demand)", "type": "shell", "command": "ollama serve", "isBackground": true, "problemMatcher": [] }, { "label": "Stop Ollama", "type": "shell", "command": "pkill -f 'ollama serve'" } ] } - Usage:
- Run the “Start Ollama (on-demand)” task when you begin coding
- Run “Stop Ollama” when you finish to release memory
- Pause inline completions:
- In Continue settings, toggle autocomplete off when you don’t need it
- Model selection per workspace:
- Use workspace settings to avoid loading heavy models in large projects
- Rate limits (optional):
- Configure lower token limits and disable streaming when idle
- Model size: Prefer 7B with Q4_K_M or Q5_K_M quantization for balance of speed and quality.
- Single model loaded: Avoid switching models mid-session; each swap reloads weights.
- No Docker for the server: Native Ollama uses Metal acceleration and lower overhead than containers.
- Context discipline: Select only relevant files or snippets in Continue chat for accurate edits.
- System prompt: Add language/framework preferences, code style rules, linting conventions.
- Templates: Save recurring tasks (e.g., “write unit test using pytest with fixtures”) as prompts.
- Model cleanup:
ollama listto view;ollama rm <model>to remove unused models. - Cache control: Clear extension caches occasionally; archive old projects.
- When using Ollama, the LLM consumes your RAM as soon as it is first prompted and stays resident for a 5-minute "Keep-Alive" window before unloading.
- When it runs: It starts loading when you send a prompt; it stays in RAM even when idle to ensure the next response is instant.
- The 16GB Strategy: To prevent the iOS Simulator from lagging, rely on the default timeout or manually stop the model to reclaim RAM for your build processes.
- To remove a model from memory in Ollama, you primarily use the unload command.:
ollama ps: Finds the exact name of whatever model is currently sitting in your RAM.ollama stop <model_name>: Immediately ejects that model from your 16GB memory pool.
- To remove a model from memory in Ollama, you primarily use the unload command.:
- Install Homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Install Ollama:
brew install ollama
- Install VS Code and Continue extension:
- VS Code Extensions → search “Continue” → Install
- Pull a 7B code model:
- Launch
ollama serve ollama pull qwen2.5-coder:7b-instruct-q5_K_M- Or
deepseek-r1:8b-0528-qwen3-q4_K_M,qwen2.5-coder:7b,starcoder2:7b,codellama:7b
- Launch
- Disable autostart and use on-demand:
brew services stop ollama
- Configure Continue:
- Provider: Ollama
- Endpoint:
http://localhost:11434 - Model:
qwen2.5-coder:7b-instruct-q5_K_M - Reranker:
dengcao/qwen3-reranker-0.6b:q8_0 - Rephraser:
llama3.2:3b-instruct-q5_K_M
- Enable inline completions:
- Continue settings → Autocomplete on
- Start/stop from VS Code tasks (optional):
- Add
.vscode/tasks.jsonfrom above
- Add
- Tune prompts and context:
- Add project style and testing conventions
- Cleanup as needed:
ollama rm <model>to free disk space
- Model loads but completions are slow:
- Quantization: Use Q4_K_M variant; reduce max tokens in Continue settings.
- Concurrency: Keep one VS Code window open; too many requests can bog down the server.
- VS Code can’t connect:
- Server running: Ensure
ollama serveis active and listening athttp://localhost:11434. - Endpoint mismatch: Verify Continue’s endpoint and model name match what you pulled.
- Server running: Ensure
- High memory usage:
- Stop server: Close the
ollama serveterminal or runpkill -f 'ollama serve'. - Smaller model: Try a 3–4B model if you need extreme lightness (quality trade‑off).
- Stop server: Close the
- Check Python version
bash brew install python@3.11
- Then confirm
python3.11 --version
- pipx installs Python applications in isolated environments and makes them available globally
brew install pipxpipx ensurepathsource ~/.zshrcpipx install --python python3.11 open-webuiopen-webui serve # This will start the Open WebUI server, which you can access at http://localhost:8080
Great question — by default, when you run open-webui serve, it binds to localhost, which means only your own machine can access it. To expose it to other devices on the same network (LAN), you need to make the server listen on your machine’s IP address instead of just 127.0.0.1.
Here’s how you can do it on macOS 15 (ARM):
-
Find your local IP address
ifconfig | grep inetLook for something like
192.168.x.xunder your active network interface (usuallyen0for Wi-Fi).inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 # Wi-Fi inet6 fe80::1c96:7dfa:4fe7:4d81%en0 prefixlen 64 secured scopeid 0x7 inet 192.168.1.102 netmask 0xffffff00 broadcast 192.168.1.255 # inet6 fe80::4aaf:7981:a868:f74a%utun0 prefixlen 64 scopeid 0x11 inet6 fe80::144d:e5ec:836:46bf%en1 prefixlen 64 secured scopeid 0xf inet 192.168.1.104 netmask 0xffffff00 broadcast 192.168.1.255
-
Run Open WebUI bound to all interfaces Instead of:
open-webui serve
Use:
open-webui serve --host 192.168.1.102 --port 8080
--host 192.168.1.102makes the server accessible from any network interface.--port 8080keeps the same port.
-
Access from another device On another device connected to the same Wi-Fi/LAN, open:
http://192.168.1.102:8080 -
(Optional) Allow firewall access
- macOS may prompt you to allow incoming connections for Python/Open WebUI.
- If blocked, go to System Settings → Network → Firewall → Options and allow incoming connections for the app.
Since LangChain is a library (not a CLI app), pipx normally isn’t the right tool. But if you still want to install it with pipx, you can include dependencies so the package is available in your environment:
pipx install langchain --include-depsThis will place LangChain and all its dependencies into a pipx-managed virtual environment, even though you won’t get a command‑line executable out of it. You’ll then need to run Python scripts with:
pipx runpip langchain install langchain-ollama langchain-community
# pipx uninstall langchainAfter installing, you can check with:
pipx list
# Then confirm it’s there:
pipx runpip langchain show langchain-ollama langchain-communityOR
Create a new virtual environment
python3 -m venv .venv
source .venv/bin/activate # On macOS/Linux
# or .venv\Scripts\activate on WindowsInstall LangChain + Ollama integration
pip install langchain langchain-ollama langchain-communityAfter installing, you can check with:
pip list
# Then confirm it’s there:
pip show langchain langchain-ollama langchain-communityQuick run test:
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="qwen2.5-coder:7b-instruct-q5_K_M")
response = llm.invoke("Explain quantum computing in simple terms.")
print(response)Allowing you to use local models for Chat and Code Editing without relying on third-party "bridge" extensions like AI Toolkit or Continue.
This feature is currently available in VS Code (including Insiders) using the "Bring Your Own Model" (BYOM) functionality built into the GitHub Copilot extension.
- VS Code (Version 1.94 or later recommended).
- GitHub Copilot & GitHub Copilot Chat extensions installed.
- Ollama installed and running on your machine.
- An active GitHub Copilot subscription (required to initialize the Copilot interface).
Ensure your desired model is downloaded and the Ollama server is running.
- Open your terminal.
- Pull a coding model (e.g., Qwen2.5-Coder):
ollama pull qwen2.5-coder:7b-instruct-q5_K_M- Ollama typically runs on
http://localhost:11434by default.
You need to tell GitHub Copilot where to find your local Ollama endpoint.
- Open VS Code Settings (
Ctrl + ,orCmd + ,). - Search for
github.copilot.chat.customOAIModels. - Click Edit in settings.json and add your Ollama configuration. Since Ollama is OpenAI-compatible, you can add it like this:
"telemetry.telemetryLevel": "off",
"github.copilot.chat.codesearch.enabled": true,
"github.copilot.chat.customOAIModels": {
"qwen2.5-coder:7b-instruct-q5_K_M": {
"name": "Ollama Qwen 7B",
"url": "http://localhost:11434/v1/chat/completions",
"toolCalling": true,
"vision": false,
"maxInputTokens": 32768,
"maxOutputTokens": 4096,
"requiresAPIKey": false
},
"deepseek-r1:8b-0528-qwen3-q4_K_M": {
"name": "Ollama DeepSeek R1",
"url": "http://localhost:11434/v1/chat/completions",
"toolCalling": false,
"vision": false,
"maxInputTokens": 16384,
"maxOutputTokens": 8192,
"requiresAPIKey": false
},
"llama3.2:3b-instruct-q5_K_M": {
"name": "Ollama Llama 3.2 (Rephrasing)",
"url": "http://localhost:11434/v1/chat/completions",
"toolCalling": true,
"vision": false,
"maxInputTokens": 131072,
"maxOutputTokens": 4096,
"requiresAPIKey": false
}
}Note: The
/v1suffix is crucial as it tells Copilot to use the OpenAI-compatible API layer provided by Ollama.
Once configured, you can switch to your local model directly within the Copilot interface.
- Open the GitHub Copilot Chat panel (
Ctrl + Alt + IorCmd + Shift + L). - Look for the Model Picker (the dropdown menu at the top of the chat input field, usually showing "GPT-4o").
- Select Manage Models.
- In the list, you should now see your local Ollama model (e.g.,
qwen2.5-coder:7b-instruct-q5_K_M). Check the box to enable it. - Go back to the chat window and select your model from the dropdown.
You can also use your local model for "Inline Chat" (refactoring and generating code in the editor).
- Open the Chat menu in the VS Code title bar.
- Select Configure Inline Suggestions.
- Choose Change Completions Model and select your Ollama model.
- CORS Issues: If VS Code cannot connect, you may need to set an environment variable for Ollama to allow requests from the VS Code extension. Set
OLLAMA_ORIGINS="vscode-webview://*"and restart Ollama. - Missing Model: If the model doesn't appear, ensure the model name in your
settings.jsonmatches exactly what you see when you runollama list. - Internet Connection: Even when using a local model, GitHub Copilot requires an active internet connection to verify your subscription license.
- Prompt presets: Create a small library of task-specific prompts for refactors, tests, docs, and debugging.
- Workspace configs: Maintain separate Continue configs per repo (backend vs frontend).
- Evaluate models: Try two models and keep the one that best fits your language stack and coding style.