<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://blog.matyasprokop.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.matyasprokop.com/" rel="alternate" type="text/html" /><updated>2026-01-22T22:29:20+00:00</updated><id>https://blog.matyasprokop.com/feed.xml</id><title type="html">Matyas’ Notes</title><subtitle>Random notes about AI, Network Automation, DevOps, Linux and SDN</subtitle><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/rig/-/week/1/observations/2026/01/22/it-is-a-week-1.html" rel="alternate" type="text/html" title="" /><published>2026-01-22T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/rig/-/week/1/observations/2026/01/22/it-is-a-week-1</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/rig/-/week/1/observations/2026/01/22/it-is-a-week-1.html"><![CDATA[<p>It is a week 1 me running my AI rig after GPU has arrived last week. Few first observations:</p>

<ul>
  <li>GPUs are big and heavy! And expensive….</li>
  <li>The ability of having server which can generate tokens “for free” I would say is very libereting. You are just testing and running things against your own server and you are not worrying about your bills.</li>
  <li>You can run some fantastic LLMs on 5 years old hardware (my GPU is RTX 3090). This is probably one of the most fascinating things I have found out this week. Today’s models which you can fit into 24GB VRAM are seriously good and on the level of LLMs we had to pay OpenAI or Google last year. My current LLMs which I’m testing is Qwen3-Coder-30B and the things it can build are crazy good.</li>
  <li>I have switched from using LLM as assistant in your VScode to running agents (semi)autonomously like Goose AI. Goose is currently my favourite agent. It is open source and you can run it either in GUI or in console. It supports tools, sub-tools and MCP so it can do lots of interesting stuff.</li>
  <li>Speaking of Goose I’m learning how to use agents more efficiently - workflows of to plan and implement with agent is something I’m starting to learn and I would say getting better in it. It is fascinating seeing agents doing their own thing. I’m learning how to build Skills and use receipts.</li>
  <li>I chose vLLM as my platform to run LLMs. It took little bit of time of testing but I think nailed it down so everything works as it should be. One weakness is probably limited GGUF support.</li>
  <li>The pace is high. vLLM has new release every other week with new features. New better model is being dropped every other month on Huggingface.</li>
</ul>]]></content><author><name></name></author><category term="AI" /><category term="rig" /><category term="-" /><category term="week" /><category term="1" /><category term="observations" /><summary type="html"><![CDATA[It is a week 1 me running my AI rig after GPU has arrived last week. Few first observations: GPUs are big and heavy! And expensive…. The ability of having server which can generate tokens “for free” I would say is very libereting. You are just testing and running things against your own server and you are not worrying about your bills. You can run some fantastic LLMs on 5 years old hardware (my GPU is RTX 3090). This is probably one of the most fascinating things I have found out this week. Today’s models which you can fit into 24GB VRAM are seriously good and on the level of LLMs we had to pay OpenAI or Google last year. My current LLMs which I’m testing is Qwen3-Coder-30B and the things it can build are crazy good. I have switched from using LLM as assistant in your VScode to running agents (semi)autonomously like Goose AI. Goose is currently my favourite agent. It is open source and you can run it either in GUI or in console. It supports tools, sub-tools and MCP so it can do lots of interesting stuff. Speaking of Goose I’m learning how to use agents more efficiently - workflows of to plan and implement with agent is something I’m starting to learn and I would say getting better in it. It is fascinating seeing agents doing their own thing. I’m learning how to build Skills and use receipts. I chose vLLM as my platform to run LLMs. It took little bit of time of testing but I think nailed it down so everything works as it should be. One weakness is probably limited GGUF support. The pace is high. vLLM has new release every other week with new features. New better model is being dropped every other month on Huggingface.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/linux/homelab/llm/2025/11/24/ai-rig-update.html" rel="alternate" type="text/html" title="" /><published>2025-11-24T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/linux/homelab/llm/2025/11/24/ai-rig-update</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/linux/homelab/llm/2025/11/24/ai-rig-update.html"><![CDATA[<p>Quick update on my AI Rig. It has been pretty steep learning curve I must stay in the last month but I feel I was able to squeeze out of that HW as much as I could. My 64GB memory is pretty much filled up.</p>

<p><strong>Kubernetes</strong> 
I have installed pretty much the standard stack - Cillium, CSI NFS, Grafana, Prometheus. Everything is managed with GitOps using Flux. I have decided day 1 to deploy Vault for secrets management which can get handy in the future. The plan is to deploy vLLM on this stack and probably migrate my web server on the cluster. I have Jupyter playbooks running so it is ready for AI/ML sandboxing.</p>

<p><strong>vLLM VM</strong>
Speaking to vLLM this was completely new. Never tried to run vLLM. I have initially tried to run it on Kubernetes but with my limited memory and no GPU (yet) I have managed to kill my cluster few times. I have decided to take a step back and isolate vLLM from anything else which showed to be better approach. I’m now running small Qwen/Qwen3-1.7B model. It is more for testing purposes but pretty cool I can run small LLM on CPU only.</p>

<p><strong>Goose</strong>
I was starting to play with opensource AI agents. This is completely new space for me so was starting to testing it with Qwen model running vLLM. Does Goose support it? Yes. Is it useful with small Qwen model? Not really. I will have to wait to use with larger local models. I have tested it very briefly with Gemini 3.0 Pro and it worked great.</p>

<p>I have done lots of backend work outside of AI stack like Portworx backups, web server, mail server, Obsidian, important Grafana dashboards etc.</p>

<p>I think I have now solid platform I can start building on. In the following weeks I want to spend more time on tuning vLLM and probably migrate it to Kubernetes. Plan is to buy GPU in December to start more progressing on AI and finally start focusing on that.</p>]]></content><author><name></name></author><category term="AI" /><category term="Linux" /><category term="Homelab" /><category term="LLM" /><summary type="html"><![CDATA[Quick update on my AI Rig. It has been pretty steep learning curve I must stay in the last month but I feel I was able to squeeze out of that HW as much as I could. My 64GB memory is pretty much filled up. Kubernetes I have installed pretty much the standard stack - Cillium, CSI NFS, Grafana, Prometheus. Everything is managed with GitOps using Flux. I have decided day 1 to deploy Vault for secrets management which can get handy in the future. The plan is to deploy vLLM on this stack and probably migrate my web server on the cluster. I have Jupyter playbooks running so it is ready for AI/ML sandboxing. vLLM VM Speaking to vLLM this was completely new. Never tried to run vLLM. I have initially tried to run it on Kubernetes but with my limited memory and no GPU (yet) I have managed to kill my cluster few times. I have decided to take a step back and isolate vLLM from anything else which showed to be better approach. I’m now running small Qwen/Qwen3-1.7B model. It is more for testing purposes but pretty cool I can run small LLM on CPU only. Goose I was starting to play with opensource AI agents. This is completely new space for me so was starting to testing it with Qwen model running vLLM. Does Goose support it? Yes. Is it useful with small Qwen model? Not really. I will have to wait to use with larger local models. I have tested it very briefly with Gemini 3.0 Pro and it worked great. I have done lots of backend work outside of AI stack like Portworx backups, web server, mail server, Obsidian, important Grafana dashboards etc. I think I have now solid platform I can start building on. In the following weeks I want to spend more time on tuning vLLM and probably migrate it to Kubernetes. Plan is to buy GPU in December to start more progressing on AI and finally start focusing on that.]]></summary></entry><entry><title type="html">Cisco AI Summit 2025</title><link href="https://blog.matyasprokop.com/cisco/ai/conferences/2025/10/08/cisco-ai-summit-2025-2.html" rel="alternate" type="text/html" title="Cisco AI Summit 2025" /><published>2025-10-08T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/cisco/ai/conferences/2025/10/08/cisco-ai-summit-2025-2</id><content type="html" xml:base="https://blog.matyasprokop.com/cisco/ai/conferences/2025/10/08/cisco-ai-summit-2025-2.html"><![CDATA[<p>I’m at the Cisco AI Summit in Paris this week. This is an opportunity for Cisco, presenting their latest advancements in their AI portfolio. Under Jeetu’s leadership we’re seeing much more streamlined product ranges, simplified messaging (including the AI portfolio), and this isn’t just limited to AI – it also reflects shifts beyond. Some of my thoughts below.</p>

<!--more-->

<p>Regarding AI specifically, the move from isolated chatbots towards agentic AI seems undeniable. If you consider that the market was already fragmented during the “cloud native” era a few years ago, it has become even more so in today’s AI age. New tools are being created almost weekly.</p>

<p>This presents a challenge for vendors like Cisco. Attempting to differentiate might focus on performance – but this isn’t likely the factor where they compete against giants like Nvidia. Instead, Cisco’s leadership appears to believe differentiation lies in AI security, which could be a valid proposition. However, Cisco should also consider adding more vertical integration capabilities to their offering and more on AI inferencing.</p>

<p>Let me explain why this might be an opportunity for Cisco. Traditionally, AI builders have relied on public clouds (like AWS or Azure) which provide easy deployment and operation, reducing initial hardware investment needs. These platforms also allow for rapid iteration during the development phase before potentially moving workloads to production. However, when an AI project succeeds and moves towards production deployment, cloud costs frequently get out of control. This forces infrastructure teams to explore “on-premises” alternatives to manage costs.</p>

<p>Moving AI workloads “on premises” isn’t simply about deploying a few server GPUs. It requires the full stack – data, models, inference environment – running on top of robust hardware layers. While public clouds excel here, building a comparable “on-prem” solution is still very challenging.</p>

<p>Focus on inferencing, however, will be crucial. It is great that Cisco is the only networking vendor who provides compatibility with Spectrum X from Nvidia because I understand that training large models is where the significant investment often happens , however companies – especially their enterprise clients – primarily run these models rather than training them and this will only be growing.</p>

<p>There were some interesting announcements around own Cisco models i.e. Deep Network Model. I’m planing to do little bit more testing on this model and see in what tasks will it exactly stand out against models like GPT5 etc. There will be new Time Series Foundational Model being released in November 2025. However, I like the idea that vendors will be in the future building their models which than should allow them building their own more specialised agents - I think.</p>

<p>Moving more into the long term vision Cisco seems determined to play a more active role in Agentic AI, potentially through its AGNTCY project. The same way they were central to defining infrastructure for the Internet in the 90s, perhaps their goal now is to shape “agentic infrastructure.” This involves standardizing the way agents interact, communicating with each other, thereby addressing complexities associated with tools like LangChain, LangGraph, and MCP.</p>

<p>I like how Cisco takes a wider approach and experiments more than some of their competitors with new concepts. The key is now to address underlaying challenges around their compute portfolio and focus more on AI software stack and inferencing, not just training.</p>]]></content><author><name></name></author><category term="Cisco" /><category term="AI" /><category term="Conferences" /><summary type="html"><![CDATA[I’m at the Cisco AI Summit in Paris this week. This is an opportunity for Cisco, presenting their latest advancements in their AI portfolio. Under Jeetu’s leadership we’re seeing much more streamlined product ranges, simplified messaging (including the AI portfolio), and this isn’t just limited to AI – it also reflects shifts beyond. Some of my thoughts below.]]></summary></entry><entry><title type="html">My new home AI server - Part 3 - The Eagle has landed</title><link href="https://blog.matyasprokop.com/ai/linux/homelab/2025/09/29/ai-lab-part-3.html" rel="alternate" type="text/html" title="My new home AI server - Part 3 - The Eagle has landed" /><published>2025-09-29T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/linux/homelab/2025/09/29/ai-lab-part-3</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/linux/homelab/2025/09/29/ai-lab-part-3.html"><![CDATA[<p>The eagle has landed. I was finally able to finish the build of my AI rig - without GPU for now. I will explain later.</p>

<p><img src="/assets/img/2025-09-29-ai-lab-part-3/ai-rig.jpg" alt="" width="500" align="center" /></p>

<!--more-->
<h2 id="final-specs">Final specs</h2>

<p>Without long overdue this is the final built:</p>

<table>
  <thead>
    <tr>
      <th>Type</th>
      <th>Part Name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Motherboard</strong></td>
      <td>Supermicro H13SSL-N</td>
    </tr>
    <tr>
      <td><strong>CPU</strong></td>
      <td>AMD EPYC 9124</td>
    </tr>
    <tr>
      <td><strong>Memory</strong></td>
      <td>64GB DDR5 ECC (2 x 32GB)</td>
    </tr>
    <tr>
      <td><strong>Chassis</strong></td>
      <td>Sliger CX4200a</td>
    </tr>
    <tr>
      <td><strong>Chassis</strong></td>
      <td>4U Rear Exhaust Fan Mounting Bracket 120mm</td>
    </tr>
    <tr>
      <td><strong>Chassis Front Fan</strong></td>
      <td>3x Noctua NF-A12x25 PWM</td>
    </tr>
    <tr>
      <td><strong>Chassis Rear Fan</strong></td>
      <td>Noctua NF-A12x25 PWM</td>
    </tr>
    <tr>
      <td><strong>CPU Fan</strong></td>
      <td>ARCTIC Freezer 4U-SP5</td>
    </tr>
    <tr>
      <td><strong>PSU</strong></td>
      <td>Seasonic PRIME TX-1600</td>
    </tr>
    <tr>
      <td><strong>Boot/OS storage</strong></td>
      <td>2x Samsung 990 PRO NVMe M.2 SSD, 2 TB, PCIe 4.0</td>
    </tr>
    <tr>
      <td><strong>Data Storage</strong></td>
      <td>2x 3TB SATA HDD</td>
    </tr>
  </tbody>
</table>

<h2 id="build">Build</h2>
<p>It was fairly smooth sail. I have put so much effort into planing, reading and researching that the most of the built was pretty much done in one afternoon. Open the chassis, install all 4 fans, put in motherboard risers, attach M.2s to the motherboard, put in motherboard, plug in power supply, put in harddrives and connect everything up. Done. It would be nice to just power it on and see what happens but I was missing memories due to long lead times. That didn’t stop me so I wanted to test if I can at least connect to BMC (Supermicro out of band)…And I could. So if you ever wonder would I be able to connect to BMC without memories the answer is yes.</p>

<p>It looks like lead times for server graded DDR5 memories are currently in weeks so after 2+ weeks of waiting I had my memories and was able to power it up. And it came up without any issues! Massive success.</p>

<h2 id="hypervisor">Hypervisor</h2>
<p>I have done lots of thinking during waiting for my memories what hypervisor I will go for. I was considering Ubuntu KVM and Proxmox and in the end went for Proxmox. I felt like hypervisor configuration and operation is something where I want to spend minimum time and Proxmox does that. It is running on Ubuntu and after playing little bit with it is offering flexibility of Ubuntu with some extra features like nice GUI. Definitely recommend it.</p>

<h2 id="running-it">Running it</h2>
<p>I have started on Friday night and after few hours on Saturday and Sunday I was able to configure hypervisor, install web server and basic monitoring so I could migrate my couple of AWS servers on my new home server including this blog. I have even managed to set up backups on my NAS.</p>

<h2 id="what-next">What next?</h2>
<p>I still have to migrate my personal mail server from AWS which means I can shutdown my whole AWS environment. I will work on design deploying Kubernetes cluster where I’m planing to eventually migrate my web server and make it ready for the future AI experiments. After I will have this I will start looking on things like rack and obviously GPU. I think the server (but mainly my wallet) will be ready for the GPU towards the end of October.</p>]]></content><author><name></name></author><category term="AI" /><category term="Linux" /><category term="Homelab" /><summary type="html"><![CDATA[The eagle has landed. I was finally able to finish the build of my AI rig - without GPU for now. I will explain later.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/llm/gaia/oss/huggingface/2025/09/25/agents-new-gaia.html" rel="alternate" type="text/html" title="" /><published>2025-09-25T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/llm/gaia/oss/huggingface/2025/09/25/agents-new-gaia</id><content type="html" xml:base="https://blog.matyasprokop.com/llm/gaia/oss/huggingface/2025/09/25/agents-new-gaia.html"><![CDATA[<p>As Simon has recently <a href="https://simonwillison.net/2025/Sep/18/agents/">pointed out</a>:</p>

<blockquote>
  <p>I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now</p>
</blockquote>

<p>The example the industry feels the same is Huggingface <a href="https://huggingface.co/blog/gaia2">introducing</a> new GAIA tests called (surprise) Gaia2.</p>

<blockquote>
  <p>Where GAIA was read-only, Gaia2 is now a read-and-write benchmark, focusing on interactive behavior and complexity management.</p>
</blockquote>

<p>In their new framework they are approaching the whole benchmark as a test how human would be using agents to achieve their goals i.e. sending emails, create calendar events or simply chatting to other agents. I think this reflects the way we are starting to see using LLM: not just in “read-only” but in “read-write mode” where agents are becoming more and more interactive. Interesting document where Huggingface team describe how the benchamrk works, what they are testing and how you can run the test yourself. They are presenting their first results using typical models like Llama 3.3, GPT-4o or Gemini.</p>]]></content><author><name></name></author><category term="LLM" /><category term="Gaia" /><category term="OSS" /><category term="Huggingface" /><summary type="html"><![CDATA[As Simon has recently pointed out: I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now The example the industry feels the same is Huggingface introducing new GAIA tests called (surprise) Gaia2. Where GAIA was read-only, Gaia2 is now a read-and-write benchmark, focusing on interactive behavior and complexity management. In their new framework they are approaching the whole benchmark as a test how human would be using agents to achieve their goals i.e. sending emails, create calendar events or simply chatting to other agents. I think this reflects the way we are starting to see using LLM: not just in “read-only” but in “read-write mode” where agents are becoming more and more interactive. Interesting document where Huggingface team describe how the benchamrk works, what they are testing and how you can run the test yourself. They are presenting their first results using typical models like Llama 3.3, GPT-4o or Gemini.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/inferencing/positron/2025/09/23/inferencing-positron.html" rel="alternate" type="text/html" title="" /><published>2025-09-23T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/inferencing/positron/2025/09/23/inferencing-positron</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/inferencing/positron/2025/09/23/inferencing-positron.html"><![CDATA[<p>Without a doubt there is a strong momentum for AI inferencing and therefore for new AI inferencing hardware. We are beyond the point where AI inferencing has been dominated by Nvidia, AMD, Intel and maybe Groq. We see new vendors like MiTac, Nebius and Positron who are focusing purely on AI inferencing which allows them to focus on AI inferencing features. They are not trying to beat Nvidia in their core business - AI Training but they rather focus on AI inferencing market which has big potential.</p>

<p>I was able to meet with <a href="https://www.positron.ai/">Positron</a> only briefly and few interesting facts about them:</p>
<ul>
  <li>Co-founded by ex-Groq and ex-LambdaLabs, 30 employees</li>
  <li>They just went through series A funding</li>
  <li>Delivering their own AI inferencing product to traditional datacenters</li>
  <li>They are not focusing on speed - they focus on performance/W and performance/$</li>
  <li>They see future of inferencing on agentic MoE architecture - collection of small LLMs</li>
  <li>At this stage they are focusing on larger deployments
    <ul>
      <li>Offering AI server with 8x Positron Archer Transformer Accelerators, each with 32GB HBM</li>
    </ul>
  </li>
</ul>

<p>I will have a follow up with Positron focusing on their technical architecture in following few weeks so I will follow up with more detailed post. Very exciting to see new hardware vendors - this is exciting field opened to new companies.</p>]]></content><author><name></name></author><category term="AI" /><category term="Inferencing" /><category term="Positron" /><summary type="html"><![CDATA[Without a doubt there is a strong momentum for AI inferencing and therefore for new AI inferencing hardware. We are beyond the point where AI inferencing has been dominated by Nvidia, AMD, Intel and maybe Groq. We see new vendors like MiTac, Nebius and Positron who are focusing purely on AI inferencing which allows them to focus on AI inferencing features. They are not trying to beat Nvidia in their core business - AI Training but they rather focus on AI inferencing market which has big potential. I was able to meet with Positron only briefly and few interesting facts about them: Co-founded by ex-Groq and ex-LambdaLabs, 30 employees They just went through series A funding Delivering their own AI inferencing product to traditional datacenters They are not focusing on speed - they focus on performance/W and performance/$ They see future of inferencing on agentic MoE architecture - collection of small LLMs At this stage they are focusing on larger deployments Offering AI server with 8x Positron Archer Transformer Accelerators, each with 32GB HBM I will have a follow up with Positron focusing on their technical architecture in following few weeks so I will follow up with more detailed post. Very exciting to see new hardware vendors - this is exciting field opened to new companies.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/robotics/2025/09/17/run-jarmil-run.html" rel="alternate" type="text/html" title="" /><published>2025-09-17T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/robotics/2025/09/17/run-jarmil-run</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/robotics/2025/09/17/run-jarmil-run.html"><![CDATA[<p><strong><a href="https://www.youtube.com/watch?v=1XWP4e_67M0">Run, Jarmil, run!</a>.</strong> I have watched this video last week on holidays and couldn’t stop thinking about it. Live demo of <a href="https://www.unitree.com/g1">Unitree G1</a> robot. The video is in Czech language but turn on English subtitles and watch how Jarmil can walk, run and pick (fake) eggs. In the video they mention how they had to hack G1 with installing WiFi and cameras. They also mention plan to train the robot with <a href="https://developer.nvidia.com/isaac/sim">NVIDIA Isaac Sim</a>.</p>

<p>It is fascinating to see where robotics managed to get in the last few years where things like LLM, LVM (Large Vision Model) and hardware engineering merging together. We are at the very exciting times when it comes to robotics.</p>]]></content><author><name></name></author><category term="AI" /><category term="Robotics" /><summary type="html"><![CDATA[Run, Jarmil, run!. I have watched this video last week on holidays and couldn’t stop thinking about it. Live demo of Unitree G1 robot. The video is in Czech language but turn on English subtitles and watch how Jarmil can walk, run and pick (fake) eggs. In the video they mention how they had to hack G1 with installing WiFi and cameras. They also mention plan to train the robot with NVIDIA Isaac Sim. It is fascinating to see where robotics managed to get in the last few years where things like LLM, LVM (Large Vision Model) and hardware engineering merging together. We are at the very exciting times when it comes to robotics.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/linux/llm/2025/09/04/llm-d-large-scale.html" rel="alternate" type="text/html" title="" /><published>2025-09-04T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/linux/llm/2025/09/04/llm-d-large-scale</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/linux/llm/2025/09/04/llm-d-large-scale.html"><![CDATA[<p>Llm-d team released new <a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d">post</a> which is focusing on intelligent inference serving and how LLM is different from stateless web requests. Worth to read.</p>

<p>Looking little bit more deeper on what should be the right architecture for AI workloads in my homelab I came across <a href="https://llm-d.ai/">llm-d</a>. It has been launched by CoreWeave, Google, IBM Research, NVIDIA, and Red Hat. Their statement really resonates:</p>

<blockquote>
  <p>The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations within their existing deployment framework - Kubernetes.</p>
</blockquote>

<p>Llm-d building blocks are vLLM as inferencing engine, K8s as a core platform and Inference gateway to provide intelligent scheduling which is build for LLM type of workloads. I would highly recommend spend a bit more time to read through their <a href="https://llm-d.ai/blog/llm-d-announce">announcement</a> which explains very nicely the differences between typical workloads and LLM workloads.</p>

<p>Even though the focus is on deploying large scale inference on Kubernetes using large models (e.g. Llama-70B+, not Llama-8B) with longer input sequence lengths (e.g 10k ISL | 1k OSL, not 200 ISL | 200 OSL) and mostly tested on 16 or 8 Nvidia H200 GPUs but there are parts like Intelligent Inference Scheduling which has been tested and run on single GPU.</p>]]></content><author><name></name></author><category term="AI" /><category term="Linux" /><category term="LLM" /><summary type="html"><![CDATA[Llm-d team released new post which is focusing on intelligent inference serving and how LLM is different from stateless web requests. Worth to read. Looking little bit more deeper on what should be the right architecture for AI workloads in my homelab I came across llm-d. It has been launched by CoreWeave, Google, IBM Research, NVIDIA, and Red Hat. Their statement really resonates: The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations within their existing deployment framework - Kubernetes. Llm-d building blocks are vLLM as inferencing engine, K8s as a core platform and Inference gateway to provide intelligent scheduling which is build for LLM type of workloads. I would highly recommend spend a bit more time to read through their announcement which explains very nicely the differences between typical workloads and LLM workloads. Even though the focus is on deploying large scale inference on Kubernetes using large models (e.g. Llama-70B+, not Llama-8B) with longer input sequence lengths (e.g 10k ISL | 1k OSL, not 200 ISL | 200 OSL) and mostly tested on 16 or 8 Nvidia H200 GPUs but there are parts like Intelligent Inference Scheduling which has been tested and run on single GPU.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/linux/homelab/2025/09/02/gpu-passthrough-copy.html" rel="alternate" type="text/html" title="" /><published>2025-09-02T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/linux/homelab/2025/09/02/gpu-passthrough%20copy</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/linux/homelab/2025/09/02/gpu-passthrough-copy.html"><![CDATA[<p><a href="https://github.com/sbnb-io/sbnb/blob/main/README-GPU-PASSTHROUGH-BENCHMARK.md">Measuring GPU Passthrough Overhead Using vfio-pci on AI Linux</a>. I’m currently in the stage I’m assessing the impact of hypervisor on GPU performance. Does it make difference if I’m running AI workloads on Kubernetes or on hypervisor? It looks like the impact on performance of using GPU Passthrough seems negligable.</p>

<p>Abylay Ospan who is one of Kernel maintainers:</p>

<blockquote>
  <p>The performance impact of GPU passthrough via vfio-pci in AI Linux (Sbnb Linux) is impressively low-averaging around 1-2% across a range of LLM models. This makes it a highly viable option for running accelerated inference inside virtual machines, enabling isolation and flexibility without compromising performance.</p>
</blockquote>]]></content><author><name></name></author><category term="AI" /><category term="Linux" /><category term="Homelab" /><summary type="html"><![CDATA[Measuring GPU Passthrough Overhead Using vfio-pci on AI Linux. I’m currently in the stage I’m assessing the impact of hypervisor on GPU performance. Does it make difference if I’m running AI workloads on Kubernetes or on hypervisor? It looks like the impact on performance of using GPU Passthrough seems negligable. Abylay Ospan who is one of Kernel maintainers: The performance impact of GPU passthrough via vfio-pci in AI Linux (Sbnb Linux) is impressively low-averaging around 1-2% across a range of LLM models. This makes it a highly viable option for running accelerated inference inside virtual machines, enabling isolation and flexibility without compromising performance.]]></summary></entry><entry><title type="html"></title><link href="https://blog.matyasprokop.com/ai/podcast/2025/09/01/ai-security-crisis-2.html" rel="alternate" type="text/html" title="" /><published>2025-09-01T00:00:00+00:00</published><updated>2026-01-22T22:27:01+00:00</updated><id>https://blog.matyasprokop.com/ai/podcast/2025/09/01/ai-security-crisis-2</id><content type="html" xml:base="https://blog.matyasprokop.com/ai/podcast/2025/09/01/ai-security-crisis-2.html"><![CDATA[<p><strong><a href="https://www.lastweekinaws.com/podcast/screaming-in-the-cloud/ai-s-security-crisis-why-your-assistant-might-betray-you/">AI’s Security Crisis: Why Your Assistant Might Betray You</a>.</strong> I have signed up to Corey’s mailing list probably 7 years ago and reading it almost everytime since then. It was especially at the beginning great way how to learn more about AWS. Just by accident and ran into his podcast with Simon Willison. I have followed <a href="https://bsky.app/profile/simonwillison.net">Simon</a> on Bsky for a while and he is in my opinion currently the most inspiring persona in AI space. He is talking about AI inferencing price and impact on environment, open source, blogging. I like the big where he is explaining how he is talking to his AI assistant when he goes for a walk. I thought I’m the only doing it and feel uncomfortable about it. Highly recommend it.</p>]]></content><author><name></name></author><category term="AI" /><category term="Podcast" /><summary type="html"><![CDATA[AI’s Security Crisis: Why Your Assistant Might Betray You. I have signed up to Corey’s mailing list probably 7 years ago and reading it almost everytime since then. It was especially at the beginning great way how to learn more about AWS. Just by accident and ran into his podcast with Simon Willison. I have followed Simon on Bsky for a while and he is in my opinion currently the most inspiring persona in AI space. He is talking about AI inferencing price and impact on environment, open source, blogging. I like the big where he is explaining how he is talking to his AI assistant when he goes for a walk. I thought I’m the only doing it and feel uncomfortable about it. Highly recommend it.]]></summary></entry></feed>