Skip to main content

Command Palette

Search for a command to run...

How to set up free, local coding AI assistant for VS Code

Offline, private coding with Qwen2.5-Coder (or DeepSeek) and Continue

Updated
6 min read
How to set up free, local coding AI assistant for VS Code

Introduction

I like coding with AI assistants. They let me focus on what matters - shipping quickly while only tuning the most important details like performance, security or design.

You can’t use AI code completion and chat at all times though. You may be on a plane or in any other offline situation. There are also times where companies don’t want their code to be sent to third parties. Because of that, they impose strict policies on AI tool usage. That’s where local, open LLMs come in.

How good can local LLMs be for coding? Good enough to be useful but definitely not on par with popular offerings like OpenAI or Anthropic models through GitHub Copilot or Cursor. (Even after DeepSeek release, most people swear by Claude Sonnet).

I’d compare completions quality to the early versions of copilot — sometimes making mistakes but overall useful enough to keep them on. The chat quality is OK but agentic capabilities are functional only on more performant machines.

In this short tutorial, we will:

  1. Install LM studio and set up our models

  2. Set up Continue - a VS Code and JetBrains IDE extension

    1. Code completions (inline suggestions)

    2. Chat

  3. Bonus - set up agentic coding with Cline

I’ve tested performance on two machines: M1 MacBook Air (16GB RAM) and M1 Max MacBook Pro (64GB RAM).

The tutorial is compatible with MacOS, Windows and Linux but your model performance may vary. You can also try to replicate this setup in JetBrains IDEs.

Setting up LM Studio - local LLMs

In my opinion, LM Studio is the best and easiest to use local LLM UI. It doesn’t have many options other than model inference (like loading documents), but it serves its basic purpose well.

Go ahead, download and install LM Studio - https://lmstudio.ai.

Open the app. You will be greeted with a quick-start tutorial. Do it if you want or skip it by clicking a button in the upper right corner.

In the sidebar on the left, click 🔍 magnifying glass icon. This will open a model discovery feature which allows you to find and download new LLMs. Make sure to come back here in the future to test some models on your own.

Let’s find and download two models for different purposes.

  • Qwen2.5 Coder 3B, Q4_K variant - a small model (~2GB) that will easily fit into your memory. It works great for code completion and will not negatively affect your workflow even if you have only 8GB of RAM. Stick to this one only if you don’t have at least 32GB of memory and a faster GPU (like non-pro Apple M chips)

  • Qwen/Qwen2.5-Coder-14B-Instruct-GGUF, Q4_K_M variant - a bigger model, better for chat sessions and agentic use. Use 7B versions on machines with 16GB (V)RAM or less - but only for chat, agentic capabilities at these sizes are quite poor.

If you have more RAM and more compute, play around with some thinking models. I didn’t choose a thinking one for this tutorial, because they are usually not fast enough for use while coding. They are useful for tasks that are not time-sensitive though!

Generally, recommended models shown at the top of the list in LM Studio are good. Frustrated with the number of options? Experiment! Things are changing every week so there’s no one, easy recommendation.

Close the “Discover” window by clicking on the X.

Now let’s enable our local LLM server, so that our coding assistant can use our models. Go to “Developer” section (terminal icon). And flip the “Status: stopped” switch so that it changes to “running”.

Then, click “Select model to load” at the top of the window. Load our 3B model with default settings. For 7B/14B I recommend selecting 20,000 context window length if you want to use coding agents. This costs RAM though! Adjust to your needs.

Setting up Continue - open source Copilot alternative

Start by installing the extension - search for “Continue” in VS Code extensions or go to continue.dev. In VS Code, click install. You should see a new icon in your left sidebar. Click on it.

You will see a text box and a quickstart prompt - ignore it for now.

Before using it, let’s turn off GitHub Copilot completions to avoid conflicts. At the top of your screen, near the search bar, find Copilot icon, click on it and choose “Configure Completions” and then “Disable Completions” from the menu. You can turn them on in the same way afterwards.

Configuring chat

Let’s go back to Continue sidebar. Click on the model selector (by default showing Claude 3.5 Sonnet at the time of writing). Click “+ Add Chat model”.

In the popup window select:

  • Provider: LM Studio

  • Model: you can leave autodetect and select it later

  • Click “Connect”

Now your model selector will show our models. Let’s select one of the installed models and say hi!

You can do all the usual stuff you can do with other coding chats.

Configuring completions

To configure code completion click “Continue” on the right of the status bar. Select “Configure autocomplete options” from the menu. You will see a JSON configuration file. Add this as a property of the main object:

  "tabAutocompleteModel": {
    "apiBase": "<http://localhost:1234/v1/>",
    "title": "Qwen2.5 Coder 3B",
    "provider": "lmstudio",
    "model": "qwen2.5-coder-3b-instruct"
  },

If you want to use another model, just copy it’s name from the “Developer” section of ML Studio.

Hit save, go to some of your code files, start typing and enjoy completions!

(Optional) Agentic coding with Cline

If you’re into agentic coding (and you should be) and have a strong enough machine, let’s explore how to get these capabilities using local models. I’d recommend doing this only if you can run at least a 14B model. Cursor IDE and Copilot Edits using bigger models perform much better, but hey, we’re running local models here! (Which I hope won’t be a trade off in the near future.)

  1. Install Cline.

  2. After installing, find its icon in the sidebar.

  3. You’ll be greeted with a choice: “Get Started for Free” or “Use your own API key”. Select the API key.

  4. Select API Provider - LM Studio.

  5. Pick desired model from the list (here qwen2.5-coder-14b-instruct ).

  6. Click “Let’s go!”

  7. Give the agent a task!

In my experience, the generations are a bit slow (one reason is the massive Cline system prompt). That said, it is capable of providing useful code, addressing given tasks well.

Summary

As you can see, local coding assistants have come a long way, making it possible to work with AI even when offline or restricted by company policies. While they may not match the capabilities of cloud-based solutions like GitHub Copilot or Cursor, they offer a practical alternative for those of us who need strict privacy or offline functionality.

The setup works well even on computers with limited resources, though having more RAM and a beefy GPU gives you the flexibility to use larger, more capable models. It might not replace your primary AI coding assistant, but it's definitely worth having as a backup or alternative when privacy and offline access are priorities.

Let me know if you run into any issues or if you have some cool experiences to share!

Check out my other posts related to LLMs.

If you liked this content, please support me by sharing this post and subscribing to my Newsletter.

You can also find me here:

Large Language Models

Part 1 of 7

In this blog series, we explore the transformative potential of artificial intelligence tools and models in software development and other fields via tutorials and case studies.

Up next

Deploy a language model (LLM) on AWS Lambda

Serverless language model inference using Python, Docker, Lambda, Phi-2 model. A Practical guide.