• 15 Posts
  • 74 Comments
Joined 4 years ago
cake
Cake day: November 25th, 2022

help-circle





  • robber@lemmy.mltoSelfhosted@lemmy.worldHardware for local inference?
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    14 days ago

    To add some practical advice:

    It depends on what you mean by more advanced models. I run Qwen3.6-27b on 48GB VRAM across 3 cards (RTX 2000e Ada), and with the recent software optimizations merged into llama.cpp (tensor parallelism & MTP) I get around 30 tokens per second in generation. I use the model through openwebui for (agentic) web research and simple Q&A mostly and I’m quite happy with what it can do.

    If you want something similar, maybe look at one or two second hand V100 PCIE 32GB. Or something from the Intel Arc Pro series, if you don’t mind the software support lacking behind a bit (as in less optimized).

    Also it might be worth reading into the difference of dense vs MoE models, if you’re new to that. For MoE models, if your system RAM is fast enough, it’s often viable to offload the “experts” (largest parts of such models) to RAM, reducing VRAM capacity needs. Note that server motherboards with e.g. octa-channel RAM have a huge advantage over consumer boards (making DDR4 interesting despite slower speed per module).

    And to adress your last question, while I have no direct experience, I’ve seen posts online about people connecting Strix Halo or DGX Spark devices, but usually via a 10+Gbit/s switch as interconnect is crucial (except if you just want to load balance).

    Self-hosting LLMs is a very fun thing to do, but also a time- and money-consuming rabbit hole. You might wanna check out the LocalLlama community over at shitjustworks.

    Edit: typos




  • A lot has been said, but to add to the list I’d say it gives them access to quite a large pool of free testers.

    LLM architectures and optimization techniques change rapidly and by releasing open-weight models a lot of enthusiasts will evaluate new models for free, help implement support in inference engines, catch bugs etc. (and in turn, ofc, get a new model to run for free, so it’s at least somewhat symbiotic).

    We have at least seen this quite obviously when Alibaba released Qwen3-Next, which was a somewhat undertrained but still useful model which introduced the architecture that their latest models now use “in production” (also their paid “Max” models).




  • I don’t follow the discussions on this topic very closely, but as I understood, there are different ways to achieve the goal, but all impact quality to some extent. Heretic is discussed as one one of the SOTA methods. The README posted above states the following, so it seems that heretic is some sort of next gen abliteration.

    It combines an advanced implementation of directional ablation, also known as “abliteration” (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.















  • It really depends on how much you enjoy to set things up for yourself and how much it hurts you to give up control over your data with managed solutions.

    If you want to do it yourself, I recommend taking a look at ZFS and its RAIDZ configurations, snapshots and replication capabilities. It’s probably the most solid setup you will achieve, but possibly also a bit complicated to wrap your head around at first.

    But there are a ton of options as beautifully represented by all the comments.