- 15 Posts
- 74 Comments
robber@lemmy.mlto
LocalLLaMA@sh.itjust.works•Your best local LLM for low-VRAM (6GB)?English
2·13 days agoLate to the party, but this was just released: LiquidAI/LFM2.5-8B-A1B-GGUF
I guess you could fully fit it at Q4 with a little context if you need all the speed you can get, or offload the experts to RAM if you prefer higher quality and/or more context.
Your biggest issue with 2010 cards will be software (inference engine) support, I assume.
To add some practical advice:
It depends on what you mean by more advanced models. I run Qwen3.6-27b on 48GB VRAM across 3 cards (RTX 2000e Ada), and with the recent software optimizations merged into llama.cpp (tensor parallelism & MTP) I get around 30 tokens per second in generation. I use the model through openwebui for (agentic) web research and simple Q&A mostly and I’m quite happy with what it can do.
If you want something similar, maybe look at one or two second hand V100 PCIE 32GB. Or something from the Intel Arc Pro series, if you don’t mind the software support lacking behind a bit (as in less optimized).
Also it might be worth reading into the difference of dense vs MoE models, if you’re new to that. For MoE models, if your system RAM is fast enough, it’s often viable to offload the “experts” (largest parts of such models) to RAM, reducing VRAM capacity needs. Note that server motherboards with e.g. octa-channel RAM have a huge advantage over consumer boards (making DDR4 interesting despite slower speed per module).
And to adress your last question, while I have no direct experience, I’ve seen posts online about people connecting Strix Halo or DGX Spark devices, but usually via a 10+Gbit/s switch as interconnect is crucial (except if you just want to load balance).
Self-hosting LLMs is a very fun thing to do, but also a time- and money-consuming rabbit hole. You might wanna check out the LocalLlama community over at shitjustworks.
Edit: typos
robber@lemmy.mlto
LocalLLaMA@sh.itjust.works•Llama.cpp MTP Support merged - up to 2.5x speed increaseEnglish
3·20 days agoUsing MTP combined with tensor parallelism, I was able to go from running Qwen3.6 27b at ~7t/s to ~30t/s which I think is an insane boost (3x RTX 2000e Ada).
robber@lemmy.mlto
LocalLLaMA@sh.itjust.works•Noob here: Why is Google making Gemma open-source?English
3·1 month agoA lot has been said, but to add to the list I’d say it gives them access to quite a large pool of free testers.
LLM architectures and optimization techniques change rapidly and by releasing open-weight models a lot of enthusiasts will evaluate new models for free, help implement support in inference engines, catch bugs etc. (and in turn, ofc, get a new model to run for free, so it’s at least somewhat symbiotic).
We have at least seen this quite obviously when Alibaba released Qwen3-Next, which was a somewhat undertrained but still useful model which introduced the architecture that their latest models now use “in production” (also their paid “Max” models).
robber@lemmy.mlto
Technology@lemmy.world•HP's ink-blocking firmware may violate new global sustainability rulesEnglish
5·3 months agoGlobal sustainability rules???
I don’t follow the discussions on this topic very closely, but as I understood, there are different ways to achieve the goal, but all impact quality to some extent. Heretic is discussed as one one of the SOTA methods. The README posted above states the following, so it seems that heretic is some sort of next gen abliteration.
It combines an advanced implementation of directional ablation, also known as “abliteration” (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.
Yeah I enjoy it as well. Just in case you missed it - a fix was merged into llama.cpp two days ago which is said to improve quality.
Edit: I stand corrected - the fix for the issue you’re experiencing has not yet been merged.
Depends on the version you’re running.
Wikipedia states the UI layer is propriertary, is that true?
The country’s official app for COVID immunity certificates or whatever they were called was available on F-Droid at the time.
A review from earlier this year didn’t sound too bad.
Edit: as pointed out, the review seems to be about the previous version of the phone.
robber@lemmy.mlto
Selfhosted@lemmy.world•Been seeing a lot of posts about replacing Spotify and such, so I wrote up a guide on how I did just thatEnglish
3·9 months agoOne reason could be that the audience on lemmy has a left-ish bias and there’s a political component to the Spotify exodus.
Edit: don’t get me wrong, I love seeing content and engagement on here.
SFTPGo is such an awesome project, never had any problems with it.
robber@lemmy.mlto
Selfhosted@lemmy.world•Best way to get IPv4 connectivity to my self-hosted servicesEnglish
5·1 year agoI’ll add Pangolin to the list, it’s a self-hosted Cloudflare tunnel alternative.
It really depends on how much you enjoy to set things up for yourself and how much it hurts you to give up control over your data with managed solutions.
If you want to do it yourself, I recommend taking a look at ZFS and its RAIDZ configurations, snapshots and replication capabilities. It’s probably the most solid setup you will achieve, but possibly also a bit complicated to wrap your head around at first.
But there are a ton of options as beautifully represented by all the comments.
Thanks for the hint to pocketID, haven’t heard of it before. That makes me think it’s time to upgrade my auth stack as well.
That sounds awesome! No issues at all so far?








You might want to check out heretic or similar tools. I did not try it but there are a lot of heretic finetunes available ond HF.