Skip to content

Instantly share code, notes, and snippets.

@tshego3
Last active January 15, 2026 14:13
Show Gist options
  • Select an option

  • Save tshego3/5e8d92132338ede08008c19952caf6a3 to your computer and use it in GitHub Desktop.

Select an option

Save tshego3/5e8d92132338ede08008c19952caf6a3 to your computer and use it in GitHub Desktop.
AI coding assistant that respects your Mac Mini’s limits, you can run a small, quantized open‑source model locally and wire it up to VS Code. The flow below keeps things beginner‑friendly, fast, and on‑demand.

Local AI coding model on macOS with VS Code

If you want a capable AI coding assistant that respects your Mac Mini’s limits, you can run a small, quantized open‑source model locally and wire it up to VS Code. The flow below keeps things beginner‑friendly, fast, and on‑demand.

  • VRAM (Video RAM): This is the memory built into your graphics card. It is the "workspace" for the AI.
  • The Rule: If the model's size is larger than your VRAM, it will either run extremely slowly or not at all.

Model selection for a Mac Mini M4 (16GB RAM)

Start by picking a 7–8B parameter code model in a quantized format (e.g., Q4_K_M or Q5_K_M). These fit in memory and remain responsive on Apple Silicon.

Recommended models (local-friendly, quantized)

Model Size class Best use License Notes
DeepSeek‑Coder:6.7B 6.7B Very strong at structured code generation, unit tests, multi‑file context; good with TypeScript DeepSeek License & MIT License Less tuned for natural language explanations. Best for project‑level code generation and refactoring
DeepSeek‑R1:7B 7B Good at debugging, algorithm design, math‑heavy code MIT License Not trained primarily on code corpora. Best for problem solving and debugging logic
Qwen2.5‑Coder 7B Python/JS/TypeScript Apache 2.0 Solid structured code & doc reasoning
Llama 3.2 3B Rephrasing, light chat, and fast reasoning Llama 3.2 License Excellent for query rephrasing and fast context processing
StarCoder2 7B Code completion BigCode OpenRAIL Good for autocomplete and snippets
Code Llama 7B C/C++/Python LLAMA 2 Community Older but stable for classic tasks

Sources: these are widely supported in local runtimes and available in quantized GGUF or via Ollama model registries.

Why 7–8B and quantized

  • Memory fit: Q4/Q5 quantized 7B models typically use ~4–6GB VRAM-equivalent in RAM, leaving room for OS and tools.
  • Speed: Apple Silicon accelerates integer math well; quantized models run fast enough for interactive coding.
  • Quality: These models are tuned for code, not general chat, so they autocomplete and refactor more reliably.

macOS setup and dependencies

Use Ollama for the simplest local install and model management on Apple Silicon. It bundles Metal acceleration and an HTTP API VS Code can call.

Install core tools

  • Homebrew:
    • Install or update Homebrew:
      • /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    • Verify:
      • brew doctor
  • Ollama:
    • Install:
      • brew install ollama
    • Confirm:
      • ollama --version
  • VS Code:
    • Download from code.visualstudio.com and install
    • Verify:
      • code --version (optional, if you installed command line launcher)
  • Server running:
    • Ensure ollama serve is active and listening at http://localhost:11434.

Pull a model (quantized)

  • Choose one (example: DeepSeek‑Coder):
    • ollama pull qwen2.5-coder:7b-instruct-q5_K_M
    • ollama pull nomic-embed-text:latest
    • ollama pull qwen3-embedding:0.6b
    • ollama pull dengcao/qwen3-reranker-0.6b:q8_0
    • ollama pull llama3.2:3b-instruct-q5_K_M
  • Alternative options:
    • ollama pull deepseek-r1:8b-0528-qwen3-q4_K_M
    • ollama pull deepseek-coder:6.7b
    • ollama pull starcoder2:7b
    • ollama pull codellama:7b
  • To remove the model from your system using Ollama:
    • ollama rm deepseek-r1:8b

Verify the model

  • Quick run test:
    • Check the Model's Maximum Support: ollama show <model_name>
    • ollama run qwen2.5-coder:7b-instruct-q5_K_M
    • Ask: “Write a Python function to validate an email address.” Press Ctrl+C to exit OR Use Ctrl + d or /bye to exit.
    • Test your Ollama server directly via HTTP requests to http://localhost:11434/. Here’s a simple example using curl in the terminal:
      curl http://localhost:11434/api/generate \
      -d '{
        "model": "qwen2.5-coder:7b-instruct-q5_K_M",
        "prompt": "Write a TypeScript function that adds two numbers."
      }'

Tip: Append a quantization tag if offered (e.g., :q4_K_M). Many Ollama registries default to efficient quantizations.


Integrate with VS Code

Use the Continue extension (open‑source) to connect VS Code to your local Ollama server. It provides inline completions, chat, and a “fix/modify” workflow.

Install Continue

  • Get the extension:
  • Open the Continue sidebar: Click the Continue icon on the Activity Bar

Configure Continue to use Ollama

  • Create or edit config:
    • In the Continue sidebar, open Settings
    • Set Provider to “Ollama”
    • Set Model to the one you pulled (e.g., qwen2.5-coder:7b-instruct-q5_K_M)
    • Ensure Endpoint is http://localhost:11434
    • config:
      name: Local Config
      version: 1.0.0
      schema: v1
      models:
        - name: Qwen2.5 7B (Chat)                       # Current
          provider: ollama
          model: qwen2.5-coder:7b-instruct-q5_K_M
          roles:
            - chat
            - edit
            - apply
        - name: DeepSeek-R1 8B (Reasoning)              # DeepSeek-R1 8B can only be used for plain chat/reasoning in Continue — it does not support tools like @Codebase.
          provider: ollama
          model: deepseek-r1:8b-0528-qwen3-q4_K_M
          roles:
            - chat
            - edit
            - apply
        - name: Qwen2.5 1.5B (Fast)
          provider: ollama
          model: qwen2.5-coder:1.5b-instruct-q5_K_M
          roles:
            - autocomplete
        - name: Nomic Embed Text                        # Current
          provider: ollama
          model: nomic-embed-text:latest
          roles:
            - embed
        - name: Qwen3 Embedding                         
          provider: ollama
          model: qwen3-embedding:0.6b
          roles:
            - embed
        - name: Llama 3.2 3B (Rephrasing)
          provider: ollama
          model: llama3.2:3b-instruct-q5_K_M
          roles:
            - chat
            - edit
            - apply
        - name: Qwen3 Reranker                          # Current
          provider: ollama
          model: dengcao/qwen3-reranker-0.6b:q8_0
          roles:
            - rerank
      context:
        - provider: codebase                            # Deep search across all files
        - provider: code                                # High-level code intelligence
        - provider: tree                                # Visualizes file structure
        - provider: problems                            # Sees VS Code "Problems" tab errors
        - provider: open                                # References open/pinned tabs
          params:
            onlyPinned: true
        - provider: terminal                            # References terminal output
        - provider: repo-map                            # The "skeleton" of your project
          params:
            includeSignatures: false 
        - provider: debugger                            # Local variable state
          params:
            stackDepth: 3
  • Enable Codebase Indexing:
    • In the Continue sidebar, click the gear/settings icon.
    • Find the section labeled Indexing.
    • Toggle the switch to Enable.
    • Rebuild the Index
      • Under @codebase index, look for the option Click to re-index.
    • Troubleshooting:
      • The Continue VS Code extension relies on LanceDB to store and search code embeddings. On macOS ARM64, it needs a native library (@lancedb/vectordb-darwin-arm64). Codebase indexing and deep search features don’t work without this library:
        • Embedding: lance.connect is not a function
        • Full text search: SQLITE_ERROR: no such table: chunks
      • Installing the extension might resolve these issues:
        • Open Terminal on your Mac.
        • Navigate to the Continue extension folder:
          cd ~/.vscode/extensions/continue.continue-1.2.11-darwin-arm64
        • Install the missing LanceDB native library:
          npm install @lancedb/vectordb-darwin-arm64
        • Verify installation (optional):
          npm list @lancedb/vectordb-darwin-arm64
          You should see a version number.
        • Restart VS Code so the extension reloads with the new dependency.
  • Project prompts (optional):
    • Add repo-specific instructions (framework, language, style) in Continue’s “System Prompt” to improve consistency

Enable inline completions

  • Completions settings:
    • Turn on “Inline Autocomplete”
    • Set “Autocomplete provider” to the same Ollama model
  • Test in an editor file:
    • Type a function signature; pause a moment for suggestions
    • Use Tab to accept or keep typing to refine
    • Generate Git commit message, use the extension’s command @Git Diff to send the git diff to your local LLM.
      • @Git Diff Write a simple commit message summarizing the modifications

Alternative integrations

  • CodeGPT or AICode: Extensions that support local endpoints via OpenAI-compatible APIs.
  • LM Studio: If you prefer a GUI runner; it exposes a local API similar to Ollama. Configure the extension to point at its endpoint.

On-demand usage (no continuous resource drain)

By default, Ollama can run a background server. We’ll disable autostart and launch it only when you need it.

Prevent background autostart

  • Stop and disable the service:
    • brew services stop ollama
    • This prevents it from starting at login. You’ll run it manually instead.

Start the server only when needed

  • Manual launch:
    • ollama serve
    • Keep this terminal open while using VS Code. Close it to stop the server and free memory.
  • Quick start alias (optional):
    • Add to your shell config (~/.zshrc):
      • alias aiup='nohup ollama serve >/tmp/ollama.out 2>&1 &'
      • alias aidown="pkill -f 'ollama serve'"
    • Now run aiup to start in the background; aidown to stop.

VS Code tasks to control the server

  • Create .vscode/tasks.json:
    {
      "version": "2.0.0",
      "tasks": [
        {
          "label": "Start Ollama (on-demand)",
          "type": "shell",
          "command": "ollama serve",
          "isBackground": true,
          "problemMatcher": []
        },
        {
          "label": "Stop Ollama",
          "type": "shell",
          "command": "pkill -f 'ollama serve'"
        }
      ]
    }
  • Usage:
    • Run the “Start Ollama (on-demand)” task when you begin coding
    • Run “Stop Ollama” when you finish to release memory

Extension-side toggles

  • Pause inline completions:
    • In Continue settings, toggle autocomplete off when you don’t need it
  • Model selection per workspace:
    • Use workspace settings to avoid loading heavy models in large projects
  • Rate limits (optional):
    • Configure lower token limits and disable streaming when idle

Workflow best practices on 16GB RAM

Keep it lean

  • Model size: Prefer 7B with Q4_K_M or Q5_K_M quantization for balance of speed and quality.
  • Single model loaded: Avoid switching models mid-session; each swap reloads weights.
  • No Docker for the server: Native Ollama uses Metal acceleration and lower overhead than containers.

Improve responses without heavier models

  • Context discipline: Select only relevant files or snippets in Continue chat for accurate edits.
  • System prompt: Add language/framework preferences, code style rules, linting conventions.
  • Templates: Save recurring tasks (e.g., “write unit test using pytest with fixtures”) as prompts.

Storage hygiene (256GB)

  • Model cleanup: ollama list to view; ollama rm <model> to remove unused models.
  • Cache control: Clear extension caches occasionally; archive old projects.

Memory Management: Balancing Speed and Development Performance

  • When using Ollama, the LLM consumes your RAM as soon as it is first prompted and stays resident for a 5-minute "Keep-Alive" window before unloading.
    • When it runs: It starts loading when you send a prompt; it stays in RAM even when idle to ensure the next response is instant.
    • The 16GB Strategy: To prevent the iOS Simulator from lagging, rely on the default timeout or manually stop the model to reclaim RAM for your build processes.
      • To remove a model from memory in Ollama, you primarily use the unload command.:
        • ollama ps: Finds the exact name of whatever model is currently sitting in your RAM.
        • ollama stop <model_name>: Immediately ejects that model from your 16GB memory pool.

Full setup recap (step-by-step)

  1. Install Homebrew:
    • /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  2. Install Ollama:
    • brew install ollama
  3. Install VS Code and Continue extension:
    • VS Code Extensions → search “Continue” → Install
  4. Pull a 7B code model:
    • Launch ollama serve
    • ollama pull qwen2.5-coder:7b-instruct-q5_K_M
    • Or deepseek-r1:8b-0528-qwen3-q4_K_M, qwen2.5-coder:7b, starcoder2:7b, codellama:7b
  5. Disable autostart and use on-demand:
    • brew services stop ollama
  6. Configure Continue:
    • Provider: Ollama
    • Endpoint: http://localhost:11434
    • Model: qwen2.5-coder:7b-instruct-q5_K_M
    • Reranker: dengcao/qwen3-reranker-0.6b:q8_0
    • Rephraser: llama3.2:3b-instruct-q5_K_M
  7. Enable inline completions:
    • Continue settings → Autocomplete on
  8. Start/stop from VS Code tasks (optional):
    • Add .vscode/tasks.json from above
  9. Tune prompts and context:
    • Add project style and testing conventions
  10. Cleanup as needed:
    • ollama rm <model> to free disk space

Troubleshooting tips

  • Model loads but completions are slow:
    • Quantization: Use Q4_K_M variant; reduce max tokens in Continue settings.
    • Concurrency: Keep one VS Code window open; too many requests can bog down the server.
  • VS Code can’t connect:
    • Server running: Ensure ollama serve is active and listening at http://localhost:11434.
    • Endpoint mismatch: Verify Continue’s endpoint and model name match what you pulled.
  • High memory usage:
    • Stop server: Close the ollama serve terminal or run pkill -f 'ollama serve'.
    • Smaller model: Try a 3–4B model if you need extreme lightness (quality trade‑off).

Open WebUI Installation via Python pip

  • Check Python version
    • bash brew install python@3.11
  • Then confirm
    • python3.11 --version
  • pipx installs Python applications in isolated environments and makes them available globally
    • brew install pipx
    • pipx ensurepath
    • source ~/.zshrc
    • pipx install --python python3.11 open-webui
    • open-webui serve # This will start the Open WebUI server, which you can access at http://localhost:8080

Great question — by default, when you run open-webui serve, it binds to localhost, which means only your own machine can access it. To expose it to other devices on the same network (LAN), you need to make the server listen on your machine’s IP address instead of just 127.0.0.1.

Here’s how you can do it on macOS 15 (ARM):


Expose Open WebUI on LAN

  1. Find your local IP address

    ifconfig | grep inet

    Look for something like 192.168.x.x under your active network interface (usually en0 for Wi-Fi).

    inet6 ::1 prefixlen 128 
    inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
    # Wi-Fi
    inet6 fe80::1c96:7dfa:4fe7:4d81%en0 prefixlen 64 secured scopeid 0x7
    inet 192.168.1.102 netmask 0xffffff00 broadcast 192.168.1.255
    #
    inet6 fe80::4aaf:7981:a868:f74a%utun0 prefixlen 64 scopeid 0x11
    inet6 fe80::144d:e5ec:836:46bf%en1 prefixlen 64 secured scopeid 0xf
    inet 192.168.1.104 netmask 0xffffff00 broadcast 192.168.1.255
  2. Run Open WebUI bound to all interfaces Instead of:

    open-webui serve

    Use:

    open-webui serve --host 192.168.1.102 --port 8080
    • --host 192.168.1.102 makes the server accessible from any network interface.
    • --port 8080 keeps the same port.
  3. Access from another device On another device connected to the same Wi-Fi/LAN, open:

    http://192.168.1.102:8080
    
  4. (Optional) Allow firewall access

    • macOS may prompt you to allow incoming connections for Python/Open WebUI.
    • If blocked, go to System Settings → Network → Firewall → Options and allow incoming connections for the app.

Introduction to LangChain-Ollama

Since LangChain is a library (not a CLI app), pipx normally isn’t the right tool. But if you still want to install it with pipx, you can include dependencies so the package is available in your environment:

pipx install langchain --include-deps

This will place LangChain and all its dependencies into a pipx-managed virtual environment, even though you won’t get a command‑line executable out of it. You’ll then need to run Python scripts with:

pipx runpip langchain install langchain-ollama langchain-community
# pipx uninstall langchain

After installing, you can check with:

pipx list
# Then confirm it’s there:
pipx runpip langchain show langchain-ollama langchain-community

OR

Create a new virtual environment

python3 -m venv .venv
source .venv/bin/activate   # On macOS/Linux
# or .venv\Scripts\activate on Windows

Install LangChain + Ollama integration

pip install langchain langchain-ollama langchain-community

After installing, you can check with:

pip list
# Then confirm it’s there:
pip show langchain langchain-ollama langchain-community

Quick run test:

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="qwen2.5-coder:7b-instruct-q5_K_M")
response = llm.invoke("Explain quantum computing in simple terms.")
print(response)

Integrating Ollama directly with GitHub Copilot

Allowing you to use local models for Chat and Code Editing without relying on third-party "bridge" extensions like AI Toolkit or Continue.

This feature is currently available in VS Code (including Insiders) using the "Bring Your Own Model" (BYOM) functionality built into the GitHub Copilot extension.

Prerequisites

  • VS Code (Version 1.94 or later recommended).
  • GitHub Copilot & GitHub Copilot Chat extensions installed.
  • Ollama installed and running on your machine.
  • An active GitHub Copilot subscription (required to initialize the Copilot interface).

Step 1: Prepare Ollama

Ensure your desired model is downloaded and the Ollama server is running.

  1. Open your terminal.
  2. Pull a coding model (e.g., Qwen2.5-Coder):
ollama pull qwen2.5-coder:7b-instruct-q5_K_M
  1. Ollama typically runs on http://localhost:11434 by default.

Step 2: Configure VS Code Settings

You need to tell GitHub Copilot where to find your local Ollama endpoint.

  1. Open VS Code Settings (Ctrl + , or Cmd + ,).
  2. Search for github.copilot.chat.customOAIModels.
  3. Click Edit in settings.json and add your Ollama configuration. Since Ollama is OpenAI-compatible, you can add it like this:
"telemetry.telemetryLevel": "off",
"github.copilot.chat.codesearch.enabled": true,
    "github.copilot.chat.customOAIModels": {
        "qwen2.5-coder:7b-instruct-q5_K_M": {
            "name": "Ollama Qwen 7B",
            "url": "http://localhost:11434/v1/chat/completions",
            "toolCalling": true,
            "vision": false,
            "maxInputTokens": 32768,
            "maxOutputTokens": 4096,
            "requiresAPIKey": false
        },
        "deepseek-r1:8b-0528-qwen3-q4_K_M": {
            "name": "Ollama DeepSeek R1",
            "url": "http://localhost:11434/v1/chat/completions",
            "toolCalling": false,
            "vision": false,
            "maxInputTokens": 16384,
            "maxOutputTokens": 8192,
            "requiresAPIKey": false
        },
        "llama3.2:3b-instruct-q5_K_M": {
            "name": "Ollama Llama 3.2 (Rephrasing)",
            "url": "http://localhost:11434/v1/chat/completions",
            "toolCalling": true,
            "vision": false,
            "maxInputTokens": 131072,
            "maxOutputTokens": 4096,
            "requiresAPIKey": false
        }
    }

Note: The /v1 suffix is crucial as it tells Copilot to use the OpenAI-compatible API layer provided by Ollama.

Step 3: Select the Local Model in Chat

Once configured, you can switch to your local model directly within the Copilot interface.

  1. Open the GitHub Copilot Chat panel (Ctrl + Alt + I or Cmd + Shift + L).
  2. Look for the Model Picker (the dropdown menu at the top of the chat input field, usually showing "GPT-4o").
  3. Select Manage Models.
  4. In the list, you should now see your local Ollama model (e.g., qwen2.5-coder:7b-instruct-q5_K_M). Check the box to enable it.
  5. Go back to the chat window and select your model from the dropdown.

Step 4: Using Local Models for Inline Edits

You can also use your local model for "Inline Chat" (refactoring and generating code in the editor).

  1. Open the Chat menu in the VS Code title bar.
  2. Select Configure Inline Suggestions.
  3. Choose Change Completions Model and select your Ollama model.

Troubleshooting Tips

  • CORS Issues: If VS Code cannot connect, you may need to set an environment variable for Ollama to allow requests from the VS Code extension. Set OLLAMA_ORIGINS="vscode-webview://*" and restart Ollama.
  • Missing Model: If the model doesn't appear, ensure the model name in your settings.json matches exactly what you see when you run ollama list.
  • Internet Connection: Even when using a local model, GitHub Copilot requires an active internet connection to verify your subscription license.

Where to go next

  • Prompt presets: Create a small library of task-specific prompts for refactors, tests, docs, and debugging.
  • Workspace configs: Maintain separate Continue configs per repo (backend vs frontend).
  • Evaluate models: Try two models and keep the one that best fits your language stack and coding style.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment