Using Intel Arc for Devtools

Created:
Category: Devtools

Background

I wanted to share some of the ways I’ve been using the Intel Arc 770 personally for AI, now that the first battle mage cards are being released. I have not been doing much machine learning training, but I have been using the card's inference capacity by using it for a coding assistant.

The software-stack I use to make this happen made out of

  1. the Continue plugin
  2. the Ollama wrapper of Llama-cpp
  3. the Ipex-llm Intel fork/patches on top of Ollama

Usage

I haven't used Copilot, Cursor, Cody, Supermaven, Gitlab-Duo, Jetbrains-AI, Tabnine, Codium so perhaps some of these tools are much better. But I like using a chat agent from time to time to “spar” about a new feature or give a first implementation like stack overflow did in the past. Often there are errors, and after programming for over 15 years I have a certain taste on what I like and don't like, but in general it's quite useful. I use it mostly in Java and sometimes in Python or Golang. Using a typed language and a common programming language seem to help 1. I don't think it makes me a better programmer and the cases where it really sped up my development was where it knew how to use Apis with missing documentation. But this is only usable in places where the stakes are low and as a developer you have to expect this will cause a higher rate of mistakes (like copy and pasting something from stack-overflow which you cannot double-check). Still, for me there is a small but noticeable advantage to using it.

Using Continue

Continue is an open source plugin I use for asking questions about my codebase while developing inside Intellij or VSCode. With no effort I can add context for chatting, or ask it to generate new code within a file. Sadly, it often crashes. Functionality breaks from time to time. The plugin makes the editor regularly hang, and it forgets a chat-session only to (sometimes) later remember its state again. All in all there is a lot unstable with this plugin. But, when it works, it works quite nicely. Plus, I like that it isn't a complete new editor just for this feature. Being compatible with multiple editors while keeping the same config is also very nice. It has some basic RAG functionality and just asking questions with @codebase and see what files it uses to answer your question feels like magic. I think it's just hard to program against these IDE/Editor plugin APIs especially for this new type of functionality that did not exist a few years ago. Sadly, right now there is still a lot of room for improvement.

On general Ollama use

Ollama wraps Llama.cpp as an inference engine and makes downloading new models very easy. It has a build-in list of anointed models with configuration attached to each model (context size and such) which makes using them painless. The Llama.cpp main branch is updated constatly and builds break often, but the Ollama team update their dependency once every few months, plus they keep a few personal patches. Together this makes using Llama.cpp through Ollama more stable for me than a straight build.

Intel Arc specific Ollama

Now there are 2 ways to use Ollama in combination with Intel Arc:

First about Ollama's SYCL support, this is not really 100% finished. It seems to write logs saying that it is using the SYCL backend and the model fits in GPU memory, but metrics shows the model running on the CPU and not the GPU2. Then there is Ipex-llm version of Ollama. At the moment, it seems to only officially support Ubuntu and RedHat Linux, and only an older version of OneApi and Ollama. This means that the latest models do not work, and I have to look carefully into which versions of Oneapi I keep installed in Ubuntu. But when it runs, it can run quick, and I can use not-small 14b models if I push it. Sadly it's an older version of Ubuntu with older packages, so I'm not a fan of the rest of this setup. I've also tried to install Oneapi on my arch-linux install using the offline installer and then the compute level-0 engine from the AUR packages. This sometimes breaks when using together with ipex-llm but has the advantage that all the other os's packages are more up to date. My hope is that things like this stabilize and that more of ipex-llm is upstreamed when possible (if the stack is truly open).


  1. Asking the model to produce some Ponylang or Prolog seems to produce usable code as well but not as reliably. 

  2. It could be using the integrated GPU but I can only see it in the CPU's usage monitor and never in the GPU usage monitor. The inference speed also points strongly to CPU inference only at the moment