Mohamed Elashri

The Security Model of Not Being Single

2026-05-08T00:00:00Z

There is a certain kind of security model that only works when you are single.

I don't mean single in the romantic drama sense. I mean single as in being the only real user of your systems. The only person who needs to understand your network, your devices, your password manager, your DNS setup, your backups, your home server, your VPN, your weird SSH keys, and the fact that sometimes the internet is down because you decided to update PiHole/Adguard Home at 1 AM.

When you live alone as a tech nerd, you can optimize your life around a very strange set of priorities. You can make decisions that are technically elegant, privacy preserving, educational, and completely unreasonable for normal people. You can self-host as much as possible. You can run your own password manager, your own DNS resolver, your own media server, your own monitoring stack, your own reverse proxy, your own file sync system, your own photo backup, and maybe even your own email if you are brave enough or foolish enough.

This can be great. It gives you control. It reduces dependency on third parties. It teaches you a lot. It makes your digital life feel like something you own rather than something you rent.

But it also works because the user interface is you.

You know that the service might be down after an update. You know that the password manager is behind a private tunnel. You know that some apps only work on the home network. You know why the DNS blocks some tracking domains. You know why a website breaks when it depends on an ad server or analytics script. You know how to bypass the problem temporarily. You know where the recovery codes are. You know which machine is the reverse proxy. You know which container needs to be restarted.

Now add another person and suddenly the security model changes. Your partner does not want to debug DNS at dinner. Your partner does not care that the home network blocks ads at the router level using a custom resolver. They care that the shopping website does not load, the airline payment page is broken, or the app they need for work refuses to open because some tracking domain is blocked. From your point of view, this is a good privacy setup. From their point of view, the internet is broken.

And they are not wrong. Security is not only about reducing risk. It is also about keeping life usable. A system that is secure only because one person is willing to tolerate pain is not a family system. It is a personal lab. Take media streaming. If you live alone, replacing Netflix with Jellyfin can feel like a win. You control the library. You avoid another subscription. You know where the files are. You know why transcoding is slow on one device but fine on another. You know that the server sometimes needs a restart. You know that remote access requires a VPN, a tunnel, or a carefully configured reverse proxy.

For you, this is freedom. But for someone else, it may be worse than Netflix. Netflix works at home, outside, on hotel Wi-Fi, on a smart TV, on a tablet, and on a phone without asking anyone to understand split DNS, WireGuard, Tailscale, reverse proxy headers, certificate renewal, or why the app says the server is unreachable when they are outside the house. Your partner does not want to know that inside the house the server is available at one address, while outside the house it requires VPN access. They want to press play.

This is where self-hosting becomes socially expensive. The technical achievement is real, but so is the UX debt. The same applies to photos. A self-hosted photo backup setup can be wonderful when it works. No big cloud provider. No silent scanning. No subscription pressure. But if uploads fail silently, if the mobile app is clunky, if face search is worse, if sharing an album with family is harder, or if your partner has to ask you whether the baby photos are actually backed up, then your privacy improvement has created a trust problem.

Backups are another example. For a single person, a complicated 3-2-1 backup setup with encrypted drives, restic repositories, off-site storage, and manual recovery commands can be acceptable. You know the passphrases. You know the restore procedure. You know which machine has the latest snapshot. In a shared life, a backup system that only you can restore is not fully resilient. It protects against disk failure, but not against your absence. If your partner cannot recover important documents, photos, tax records, or family files without decoding your personal infrastructure, then the system has a hidden failure mode.

The same thing happens with passwords. A strict personal password manager setup is excellent. Long random passwords, hardware keys, no SMS fallback, separate vaults, strong two-factor authentication. For one person, this is sensible. For a household, you need shared vaults, emergency access, account ownership rules, and a plan for what happens when someone loses a phone. Who has access to the electricity account? Who can log in to the insurance portal? Who can renew the domain name that keeps the home services online? Who can access the child's school account? Who knows where the recovery codes are?

A lone wolf can keep everything in their head. A household cannot. This becomes even more obvious when we consider devices. For example, the iPhone has a security option that erases the device after too many failed passcode attempts. For a single adult who controls the device carefully, this can make sense. If the phone is stolen, repeated attempts to unlock it could wipe sensitive data. That is a reasonable threat model. Now imagine having a kid around. A child does not understand your threat model. A child sees a phone, presses numbers, laughs, tries again, and suddenly your very secure feature becomes a recipe for disaster. The attacker in this case is not a state actor. It is a toddler with sticky fingers and unlimited curiosity.

Even screen locks change meaning. A short auto-lock timer is good security. But if you are following a recipe in the kitchen, helping someone with directions, using a baby monitor app, or letting your partner quickly check a message, constant locking becomes friction. Security features that make sense in isolation can become annoying when life becomes collaborative.

Hardware security keys are another good example. They are one of the best things you can use for account protection. But if every important login depends on a small physical key that only you understand, then your personal security has also become a single point of failure for the household. What happens if you are traveling? What happens if the key is lost? What happens if your partner needs access to something urgent? What happens if an emergency requires someone else to recover an account?

Again, the technical solution is not bad. It is just incomplete. A single person's security model often assumes full personal control. A family security model needs delegation, recovery, and tolerance for mistakes. Even home networking becomes different. When you live alone, you can have VLANs, a guest network, firewall rules, blocked ports, private DNS zones, local-only services, and a VPN requirement for remote access. You can decide that some services should never be exposed to the public internet. You can accept that this means a bit more friction.

But other people experience the network through failure. The printer disappears. The smart TV cannot see the media server. The work laptop cannot connect to a corporate VPN because your DNS setup is too aggressive. A guest cannot cast to the TV. A family member joins the wrong Wi-Fi network. The baby monitor works on one SSID but not another. The security camera app works outside the house but not inside because of NAT loopback or split-horizon DNS.

The network may be beautifully segmented. It may also be socially incomprehensible. Smart home devices make this even more complicated. A smart lock, camera, thermostat, or voice assistant may be convenient, but it introduces shared control. Who can unlock the door? Who can see camera feeds? Who gets notifications? What happens after a breakup? What happens if someone forgets to remove an old device? What happens if an account is compromised? What happens when the internet is down and the "smart" thing becomes very stupid?

A single person can accept experimental home automation. A household needs predictable behavior. Lights should turn on. Doors should unlock. Heating should work. A clever automation that fails 5% of the time is not clever when someone else is standing in the dark.

Travel is another case where the lone wolf model breaks down. If you are alone, you can use a travel router, force all traffic through a VPN, avoid public Wi-Fi logins, use privacy.com credit cards, keep strict device separation, and refuse to install random local apps. This is all manageable because you are the only one paying the cost.

With a partner, the question becomes different. Can both of you access boarding passes? Can either of you log in to the hotel booking? Can someone else find the car rental details if your phone dies? Can your partner access some data without going through your complex VPN setup? Can you share location without turning your whole privacy model into a lecture?

A private life is easier to secure than a shared life because there are fewer legitimate users. And legitimate users are always the hardest part of security.

There is also the social attack surface.

When you are single, many attacks target you directly. Phishing emails, malicious links, password reuse, stolen devices, weak accounts, exposed services. You can train yourself to be careful. You can make your habits stricter. You can reduce your own mistakes.

But when you are not single, your attack surface includes other people. Your partner's phone. Their laptop. Their passwords. Their cloud accounts. Their old devices. Their family group chats. Their email habits. Their understanding of scams. Their tolerance for security prompts. Their willingness to use a password manager. Their patience with two-factor authentication. This does not mean the other person is careless. It means security is now a shared system.

And shared systems are harder.

Even guests change the model. Someone comes over and asks for Wi-Fi. Do you give them the main network password? Do you have a guest network? Is your printer exposed? Are your smart home devices isolated? Can their infected laptop see your NAS? Can their phone cast to your TV? Can they access local services by accident?

When you live alone, you can ignore some of these questions. When you share a home, they become practical. This is why I think "not being single" is a serious security model. It is not only a life status. It changes the assumptions.

A lone wolf can run a very strict setup. Block everything. Self-host everything. Use custom ROMs (viva GrapheneOS). Avoid cloud services. Disable convenience features. Require hardware keys. Encrypt aggressively. Keep recovery procedures in their head. Rebuild systems from scratch. Accept broken UX as the cost of control. A household cannot work like that forever.

A household needs secure defaults, but also humane defaults. It needs privacy, but also convenience. It needs access control, but also recovery. It needs backups that someone else can understand. It needs DNS that does not randomly break normal life. It needs a password manager with shared vaults. It needs emergency access. It needs documentation. It needs boring reliability. It needs a way to separate personal systems from shared systems.

This does not mean giving up on security. It means maturing the model. For a single person, the question is often: How do I make this as private and secure as possible? For a shared life, the question becomes: How do I make this secure enough, usable enough, recoverable enough, and understandable enough for the people who depend on it? That is a harder question. It is also probably the better one.

Because at some point, the best security system is not the one with the most impressive setup. It is the one that survives real life. It survives tired people, children, travel, emergencies, broken phones, forgotten passwords, bad UX, updates, family visits, and the fact that not everyone wants to become a system administrator just to watch Some movie.

Being a lone wolf gives you freedom. You can build sharp tools and live with sharp edges. Not being single means those edges can cut other people too. And that is the real lesson. Security is not only about defending against attackers. It is also about designing systems that remain safe when life becomes shared.

My new blog stack: Nida

2026-04-25T00:00:00Z

So a few months ago, I switched my blog from Hugo to zola, which was a nice choice at the time. Actually zola is too powerful and too simple at the same time. I switched to zola for my Academic Website, this blog and also a new Arabic blog. I was happy with zola, but I wanted to have more control over the stack and also to learn how to build a static site generator. So I decided to build my own SSG called Nida.

First, I wanted to build a simple SSG that can generate a static website from markdown files. I also wanted to have a simple and clean codebase that I can easily maintain and extend in the future. I chose Go as the programming language for Nida because it's fast, easy to learn and has a great standard library.

Nida means "call" in Arabic, and I chose this name because I want Nida to be a call for simplicity and control in the world of static site generators. Nida is still in its early stages, but I'm excited about the possibilities it offers. I don't plan for it to be replacement for zola or Hugo, but rather a simple and lightweight alternative built and used mainly by me. Actually this blog is now powered by Nida. And in the future when I get more time, I will switch both my academic website and my Arabic blog to Nida as well.

Nida philosophy is to be simple, and actually I copied the concept of having simple easy commands from zola, but I implemented them in a way that is more suitable for my needs. Nida has a simple command line interface that allows you to build, serve and deploy your website with just a few commands. Nida also has a simple configuration file that allows you to customize your website without having to write any code. Nida also has a simple templating system that allows you to create custom templates for your website without having to learn a new templating language. Unlike zola, Nida doesn't depend on third party libraries for templating, instead it uses Go's standard library for templating, which is simple and powerful enough for my needs. This both makes Nida faster and also easier to maintain and extend in the future.

There are two commands in the whole Nida binary, build and serve. The build command generates the static website from the markdown files and the templates, while the serve command starts a local development server (port 1307 by default) that allows you to preview your website. Nida also has a simple deployment system that allows you to deploy your website to GitHub Pages with just a few commands. Actually to be honest, there is a third nida version command that prints the version of Nida, but it's not really important and doesn't count, does it?

So the general format for the commands for build is:

nida build [-s PATH] [--site PATH] [-c PATH] [--config PATH] [-d] [--drafts]

And for serve is:

nida serve [-s PATH] [--site PATH] [-c PATH] [--config PATH] [-d] [--drafts] [-p PORT] [--port PORT]

Where -s or --site is the path to the site directory, -c or --config is the path to the configuration file, -d or --drafts is a flag that tells Nida to include draft posts in the generated website, and -p or --port is the port number for the development server. I always like having both short and long versions of the command line arguments, because it allows me to use the short version when I'm in a hurry and the long version when I want to be more explicit.

Most of the time, you can just run nida build or nida serve without any arguments, and it will work just fine, because Nida has sensible defaults for the site directory and the configuration file. But if you want to customize your website, you can use the command line arguments to specify the paths to your site directory and configuration file. The CLI arguments will always override the default values, so you can have multiple sites and configurations on the same machine without any conflicts (assuming that you work on couple of sites at the same time, which is not really the case for me, but you never know).

Nida also support RTL natively, which is a great feature for me as an Arabic speaker. Nida uses the dir attribute in the HTML to specify the direction of the text, and it also has a simple way to specify the direction of the text in the configuration file. This allows me to easily switch between LTR and RTL layouts without having to write any custom CSS or JavaScript.

There are two examples of Nida in the code base, one for an English website and another for an Arabic website. Of course, you can count my current blog as a third example, but I don't want to brag about it. The English website example is a simple blog that has a few posts and a simple layout, same for the Arabic website example. Both examples are fully functional and can be used as a starting point for your own blog.

Lets me compare Nida and zola a bit to give a better idea in terms of what both allow and that Nida is not trying to be a replacement for zola because it is not a general purpose static site generator. Although I intend to add some of the features that zola has in the future, but this is the current status.

[!NOTE]
The striked text means that I have implemented the feature in nida and that it is working in a similar way to zola.

Area	Zola	Nida
Content types	Arbitrary sections	~~Hardcoded post/page/section only~~
Taxonomies	User-defined (any name, any structure)	~~Only tags and categories, hardcoded~~
Permalink patterns	`{year}/{month}/{day}/{slug}/{categories}` etc.	~~Only `{slug}` and `{section}`~~
Templates	Full Tera engine with ~50+ functions, macros, inheritance	Go `html/template` with ~10 helper functions, flat define blocks
Shortcodes	User-definable, parameterized	~~Only 2 hardcoded: `details` and `rawhtml`~~
Multilingual	Full i18n: translations, language-switching, per-language content	Explicitly a non-goal; language field only sets `<html lang>` + text direction
Asset pipeline	Image resizing, SCSS compilation, fingerprinting, lazy-loading	~~Static files copied as-is; no processing~~
Page resources	Co-located assets in page bundles (`index.md` + images in same dir)	~~No concept of page resources~~
Search	Built-in elasticlunr index generation	None
Internal linking	`@/posts/hello.md` with automatic path resolution	~~Manual URLs only~~
Table of contents	Auto-generated from headings	None
Config cascade	`config.toml` + theme config	~~`config.toml` + theme config~~
CLI	`init`, `build`, `serve`, `check`	Only `build`, `serve`, `version`
Data files	TOML/JSON/YAML data loaded into templates	None
Build caching	Renders only changed content on rebuild	~~Renders everything, then diffs (slower)~~
Theme system	Loadable themes with override chains	~~No theme system (inline CSS + `[extra]` values)~~

Overall, Nida is a simple and lightweight static site generator that is designed to be easy to use and maintain. It doesn't have all the features of zola, but it has enough features for my needs, and it's much easier to maintain and extend in the future. If you're looking for a simple and lightweight static site generator that is easy to use and fit your needs if it close to mine, then Nida might be a good choice for you. If you're looking for a more powerful and feature-rich static site generator, then zxola might be a better choice for you.

Fixing TorchAO when Downgrading PyTorch for AITune

2026-04-13T00:00:00Z

Recently, I was working on accelerating inference for a custom PyTorch model I have been working on writing its custom inference engine for sometime. I wanted to benchmark its performance using NVIDIA's newly released AITune library. AITune simplifies inference optimization by sweeping through different compilation strategies (like TensorRTBackend and TorchInductorBackend) to automatically find the highest throughput configuration. I said why not, lets try it and see what it would yield.

Armed with my little RTX 3090, I set up my virtual environment, installed the latest PyTorch (the default uv installs for aitune package) (v2.11.0+cu130), fired up a quickly written benchmark script based on the quick start guide, and I immediately hit my first roadblock.

When running the script, PyTorch stubbornly fell back to compiling on the CPU with a familiar warning:

UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 12080).

But my environment was perfectly modern, but the host machine's system-level NVIDIA display driver only supported up to CUDA 12.8. Because PyTorch 2.11 installs with cu130 (CUDA 13.0) binaries by default, it refused to initialize the GPU. This is not the first time I had this problem. It's part of why I hate my work with NVIDIA GPUs. Anyway, Instead of bothering the sysadmin to globally update the host drivers (because its a lost cause if I need this to be done quickly), I took the standard path of least resistance: downgrade PyTorch. I dropped PyTorch back to 2.6.0+cu124 to match my host's driver limit. The GPU was successfully detected, and I thought I was in the clear. That seems fine, right? I would like to put the famous "you did it like this, right" meme here but I won't.

I ran the AITune benchmark script again, only to have it violently crash the moment AITune tried to initialize the TorchAO backend.

Traceback (most recent call last):
  ...
  File "/.venv/lib/python3.12/site-packages/torchao/utils.py", line 45, in register_as_pytree_constant
    torch.utils._pytree.register_constant(cls)
AttributeError: module 'torch.utils._pytree' has no attribute 'register_constant'

But wait, what is torchao, it is obviously a dependency of AITune but why is giving that torch pytree module doesn't have this attribute? I didn't change anything, I just downgraded PyTorch. So lets see if there is a version mismatch. I tried different versions and it seems older versions get this error. So when I downgraded PyTorch, I didn't neatly downgrade all the secondary ecosystem dependencies. AITune explicitly relies on the torchao library to leverage quantization and inductor optimizations. The modern version of torchao uses a decorator to register classes as PyTree constants for Dynamo's non-strict trace mode. However, the register_constant API it was attempting to call in torch.utils._pytree does not exist in PyTorch 2.6.0 or earlier. It's a newer addition. Because torchao blindly assumes the API is present, running it on an older PyTorch backend triggers a fatal AttributeError.

So I was between the devil and the deep sea. I couldn't use the latest PyTorch because of driver limitations, but I couldn't use the older PyTorch because of torchao compatibility issues. So, I did what every respected programmer shouldn't do, I patched the torchao library to handle the missing API gracefully. The easiest fix was a surgical local patch to make torchao gracefully handle older PyTorch installations. I opened my virtual environment's site-packages .venv/lib/python3.12/site-packages/torchao/utils.py where the offending code was located and navigated to the culprit decorator.

So I changed the register_as_pytree_constant function definition to handle the missing API gracefully:

def register_as_pytree_constant(cls):
    """Decorator to register a class as a pytree constant for dynamo non-strict trace mode."""
    torch.utils._pytree.register_constant(cls)
    return cls

to something safer and doesn't break older PyTorch versions by assuming too much.

def register_as_pytree_constant(cls):
    """Decorator to register a class as a pytree constant for dynamo non-strict trace mode."""
    # Add a conditional check to prevent crashes on PyTorch < 2.6 versions
    # where register_constant does not exist.
    if hasattr(torch.utils._pytree, "register_constant"):
        torch.utils._pytree.register_constant(cls)
    return cls

Now after doing that, saving this one-line fix, AITune ran flawlessly, compiling the model down to the TorchInductorBackend and drastically dropping my model inference latency. If you are working in environments where strict system driver limits force you to run older versions of PyTorch, keep an eye out for edge-case crashes in secondary libraries like torchao, torchvision, or torchaudio. Because the PyTorch, .utils, and Dynamo APIs are rapidly evolving, downstream libraries sometimes (or often, very often) forget to handle backward compatibility gracefully!

JSON formatting in browser is useful

2026-04-11T00:00:00Z

Today I came across the news that a famous Chrome extension for JSON formatting changed its business model and instead of being useful free extension, it became adware where it injects ads into many pages. This is actually not uncommon for free extensions specially those on Chrome extension store. Many extensions are free, but they make money by injecting ads into pages or by selling user data.

On Firefox, the situation is a bit better, but still there are many extensions that are not trustworthy. I actually have the default JSON viewer in Firefox and I didn't think before about installing a JSON formatter extension. It is actually tempted, I deal with JSON data a fair amount of time and having to copy or download it to open it with VSCode or some other tool is a bit annoying. But I was hesitant to install an extension from an unknown source or even from a known source that might change its business model or just sell out its users in the future.

Then I thought, how difficult would it be actually to write my own JSON formatter extension for Firefox? I have some experience with web development and Firefox extensions, and so I know that writing a browser extension is not that hard. So I decided to give it a try. I started by defining the scope and the user target which is basically myself. I want a simple and clean interface for formatting JSON, and I will not even distribute it to others via the AMO store (will use personal distribution channel to sign it though). I just want to have it for my personal use, so I would just focus on what features I would need and how to implement them without worrying about the user experience or the security implications of distributing it to others. Although I will still try to make it secure and not do anything that could harm my browser or introduce an attack vector.

So I sat down with a clear mental checklist of what I actually wanted:

Syntax highlighting for keys, strings, numbers, booleans, and null
Collapsible tree view with level controls
Hover any property to see its full JSON path, click to pin it, then copy
Expand or collapse all children of an object with one click
Three view modes: Tree, Formatted, and Raw (because sometimes I just want to see the raw JSON without any formatting)
Copy JSON to clipboard (Although I hesitated because I thought I would need to grant it clipboard permissions)
Light, dark, and auto themes (Just to help my eyes, I don't care about themes that much, but it would be nice to have a dark mode for nighttime because it hurts my eyes to look at a bright screen at night)
Indent guidelines with hover highlighting

As you can see, nothing fancy here, nothing I don't use. The entire project ended up being just four source files: a manifest, a content script, a stylesheet, and build script. No frameworks, no bundlers, no dependencies. Just vanilla JavaScript (that I still hate so much) and some CSS.

The first thing I ran into was something I didn't expect: Firefox already has a built-in JSON viewer. And it intercepts JSON pages before any extension content script can touch them. I loaded my first attempt, navigated to a .json URL, and… Firefox's own viewer showed up instead. Actually to be honest, I knew about the built-in JSON viewer, but I thought that it would be possible to override it with an extension. But it turned out that Firefox's JSON viewer is actually a built-in feature that is not implemented as an extension, and it takes precedence over any extension content script. So I had to find a way to disable the built-in JSON viewer in order to test my extension.

It turns out this is a known issue. Firefox's native JSON viewer runs at a lower level than the WebExtensions content script pipeline. Even if your extension matches <all_urls> and runs at document_start, the browser catches the response first and renders it. Looking at how other popular Firefox JSON viewer extensions handle this, the answer was unanimous: they all (at least the couple I checked) instruct users to manually disable Firefox's built-in viewer in about:config by toggling devtools.jsonview.enabled to false. It's a one-time setting change, and it's the standard approach across the entire Firefox extension ecosystem for this category. Once that's done, the page returns raw JSON text and the content script can take over.

With that solved, the detection logic became straightforward. The content script checks whether the page's content type is JSON, or if the page is a single <pre> element containing parsable JSON. If either is true, it clears the page and builds the viewer from scratch.

Then we come to the tree render question. The tree renderer is recursive. It walks the JSON structure and creates a DOM node for each value. Objects and arrays get a toggle arrow and a hidden children container. Primitives get styled spans with syntax highlighting classes. The indent guidelines are simply the border-left on nested .jf-children containers, so they scale naturally with depth and never overlap with the actual content.

Path tracking is handled by storing the full JSON path (like data.users[0].name) as a data-path attribute on each line. On hover, the path appears in a bar at the bottom. Click the key to pin it. There's a copy button right next to it.

It seems very straightforward, but then we think about security implications. Even though this is a personal extension, I figured it was worth doing a proper security review. After all, the content script runs on every page and processes arbitrary JSON from any server. Here's what I found and fixed:

XSS via innerHTML: The initial version we had used template literals with innerHTML to render object headers like Array(5). That's a problem if the bracket characters or count value ever came from untrusted data. I replaced it entirely with DOM methods, textContent and appendChild, so no string ever passes through the HTML parser.
Stack overflow on deeply nested JSON. The recursive tree renderer would crash the tab on a 50,000-level-deep object. I added a depth limit of 5,000, which is more than enough for any real-world JSON and prevents the browser from hanging.
No size limit on parsing. A malicious server could serve a 500 MB JSON blob. I added a 10 MB cap before JSON.parse is even called.
escapeHTML was actually corrupting data. I had written an HTML escaping function and was applying it to string values rendered via textContent. But textContent is inherently safe from XSS, it never parses HTML. The escaping was actually causing & to display as &and < as <, which was a data corruption bug. I removed the function entirely.
External links in JSON values. String values that look like URLs get rendered as clickable links. I initially used a regex to check for http:// or https://, but regex-based URL validation is fragile. I switched to using the native URL() constructor with a protocol check, and added rel="noopener noreferrer" on every link to prevent tabnabbing.

And as I said before, No clipboardWrite permission needed because Firefox allows clipboard writes from user click handlers by default. So the extension runs with zero elevated permissions. No background scripts, no storage permissions. The content script does everything and only runs on JSON pages. So that's fine for me.

I also wanted a clean build process. I set up a Makefile with shell scripts that verify the manifest.json, check that all icons are valid PNGs, lint the manifest.json structure, and package everything into a .xpi file. A GitHub Actions workflow triggers on version tags, runs the same checks, and creates a release with both the .xpi and a updates.json file that Firefox uses to auto-deploy updates to installed users (Basically me). For the AMO submission, I added data_collection_permissions with required: ["none"] to declare that the extension collects no data, and bumped the minimum Firefox version to 140 since that's when this manifest key was introduced.

This was about few hours of work, and I now have a JSON viewer that does exactly what I need, with zero ads, zero telemetry, and zero trust required in a third-party developer. If I ever want to add a feature, I change it myself. If I ever want to remove one, I remove it. The source code is right there, and it's small enough that I can understand every line in minutes. The full source is on GitHub at MohamedElashri/JSON-formatter if anyone wants to use it, contributions are welcome, but I won't be publishing it on AMO or any other store. You will need to build it yourself and load it as a temporary extension in Firefox. Insructions are in the README.

make: the automation layer I put on everything

2026-04-08T00:00:00Z

There is a peculiar assumption that follows make around: that it belongs to the C/C++ world, to Autotools nightmares and ./configure && make && make install rituals. That it is ancient infrastructure, tolerated rather than chosen. I want to push back on that, not from a theoretical standpoint but from the very practical one of someone who has spent years writing C++ for particle physics and somehow ended up also wrangling Node.js build pipelines, Python packaging, static site generators, Docker stacks, and LaTeX documents, sometimes all in the same afternoon.

The honest origin of my Makefile habit is memory failure. I have a genuinely bad memory for commands. Not concepts, I can hold the mental model of a system well enough, but the specific invocation, the flags, the ordering of arguments, the environment variable that needs to be set before the script will run without silently doing the wrong thing. This is fine in a world where you only ever touch one kind of project. It becomes a problem the moment you have to context-switch. When I am deep in LHCb analysis work, running DaVinci, managing CERN EOS paths, submitting grid jobs, I am in a certain mental register. Then I need to push a post to my blog, which runs on Zola and lives in a Git repo and has its own little deployment dance. Or I need to update a Python package and remember whether I'm supposed to use uv sync or python -m build and which virtual environment is active and whether I need to bump the version manually first. My brain does not want to load that context. It wants to type make and move on with my day.

This is exactly what a Makefile provides. It is not a build system in the way most people mean that phrase. It is a memory externalization device. It is a place where I write down, once, the precise incantation for every non-trivial operation in a project, and then label each one with a short English word. make test. make docs. make clean. make publish. The fact that make has been on every Unix system since the 70s and requires zero installation is almost beside the point, though it is a genuinely pleasant property when you're SSH'd into a new machine at 2am.

The phony target pattern is particularly underappreciated here. Once you accept that .PHONY targets are just named shell procedures with dependency resolution, the whole tool reframes itself. You are not building anything. You are writing a tiny, self-documenting task runner that comes pre-installed everywhere and has a forty-year track record of not changing its interface. Compare that to the current state of JavaScript build tooling, which I say with great affection for the ecosystem but also with the weariness of someone who came from a world where g++ has been spelled g++ since before I was born. I have encountered projects that moved from Grunt to Gulp to Webpack to Vite to Turbopack across their lifespan. My Makefile from five years ago still runs. I find this deeply comforting.

The dependency mechanism is where things get genuinely interesting even beyond the simple task-runner use case. The fact that make checks timestamps means I can write rules that only regenerate expensive outputs when their inputs have changed, without writing any of that logic myself. For analysis work this matters: if I have a rule that runs a fitting script over data and produces plots, I do not want it to re-run every time I type make plots. I want it to re-run when the script changes, or when the input data changes, and not otherwise. make handles this natively, elegantly, in a syntax that is admittedly arcane but learnable in an afternoon.

I keep a Makefile in almost every project now, my blog repository, my analysis code, my MCP server projects, my LaTeX documents, the little utility scripts I maintain. The structure is almost always the same: a help target at the top that prints the available targets and their descriptions (a simple grep on double-hash comments works perfectly), then the targets grouped loosely by concern. I have a template I copy in when starting something new, which takes about thirty seconds. The cost of entry is extremely low. The payoff, being able to return to a project after three months and immediately remember how to do anything with it, is consistently high.

There is one more thing I appreciate that is harder to articulate. A Makefile is readable in the same way a good configuration file is readable: as documentation. When I open a project I have not touched in a while, the Makefile tells me what the meaningful operations on that project are. Not what is possible, that is what the source code is for, but what I actually do with it. It is an interface description, an operator's manual, a reminder written by past-me to current-me. For someone who moves between enough different systems and contexts, that small act of writing it down is not optional. It is how the work gets done at all.

There might be better solutions and alternatives, but I have not found one that fits my workflow as well as Makefile does. It does one thing well, and it has been doing it for decades.

uv is a miracle for scientific computing

2026-04-03T00:00:00Z

The promise of Python in scientific computing has always been tempered by a quiet frustration. We write elegant algorithms, train sophisticated models, and orchestrate petabytes of detector data, yet the moment another researcher attempts to reproduce our workflow, the environment fractures. Dependency graphs collapse under conflicting version pins. Platform specific binaries problems become apparent. The analysis that ran flawlessly on a university cluster refuses to initialize on a collaborator workstation. This is not merely an inconvenience. It is a reproducibility crisis that quietly erodes trust in computational results. In high energy physics, where analyses span millions of simulated events and require precise alignment between legacy frameworks and modern inference/analysis libraries, the cost of environmental drift is measured in months of debugging, duplicated effort, and occasionally retracted preliminary results.

The Python ecosystem evolved through accretion rather than coordinated design. Package managers multiplied. Virtual environment tools diverged. Lockfile standards emerged piecemeal. Researchers learned to navigate a landscape where pip, conda, poetry, and virtualenv each claimed to solve a different slice of the problem. The reality in large collaborations is that none of them scale gracefully to the full dependency tree of a modern analysis. Conda handles compiled binaries well but suffers from very slow (measured in hours) resolution and channel inconsistencies. Poetry delivers elegant project configuration but struggles with scientific packages that rely on complex C extensions or platform specific wheels. Pip combined with arequirements.txt files offers flexibility but abandons reproducibility the moment a transitive dependency updates. The result is a fragile workflow culture. We document environments in fragile shell snippets. We containerize to paper over the cracks. We tell graduate students to run exactly the same commands and hope for the best.

Into this landscape arrived uv, a package manager and project orchestrator written in Rust. It does not attempt to reinvent Python packaging. Instead it consolidates the fragmented workflow into a single fast resolver, a unified virtual environment manager, and a strict lockfile (uv.lock) format. The design philosophy is refreshingly pragmatic. Installation happens in milliseconds. Dependency resolution happens in seconds. The tool respects PEP 621 project metadata and embraces the newer PEP 751 lockfile specification, which guarantees that every package, every cryptographic hash, and every platform constraint is captured deterministically. What makes uv particularly compelling for scientific computing is not merely its speed. It is the way it treats reproducibility as a first class constraint rather than an afterthought.

Consider a typical ATLAS or LHCb analysis workflow. The stack begins with uproot and awkward for columnar data access (for those not using ROOT/PyROOT), moves through vector for four momentum calculations, incorporates scikit-hep components like pyhf for statistical inference, and often integrates torch or jax for machine learning classifiers. Each of these packages carries binary extensions, CUDA bindings, or architecture specific optimizations. Traditional environment managers routinely stall when resolving overlapping constraints between numpy, scipy, and cuda enabled pytorch. uv approaches this differently. It leverages a globally cached wheel store, downloads prebuilt binaries when available, and falls back to source builds only when necessary. The resolver operates on a strict version graph that eliminates the silent upgrades that plague long running analyses. I have watched a pyhf based likelihood fit that previously required three hours of conda environment reconstruction complete in under two minutes. More importantly, the lockfile produced by uv travels with the analysis repository. A colleague at a different institute can fetch the same commit, run a single command, and obtain an identical environment. The reproducibility guarantee is mathematical rather than aspirational.

Evaluating uv against established tools requires acknowledging the trade offs inherent in each design. Conda remains indispensable for packages that lack Python wheel distributions, particularly certain legacy Fortran wrapped libraries or specialized detector simulation tools. Mamba improved resolution speed but inherited the same channel fragmentation and solver complexity. Poetry offers beautiful configuration files but frequently stumbles on scientific packages that distribute wheels outside standard indexes or require non standard build backends. Pipenv attempted to unify pip and virtualenv but matured slowly and introduced its own resolution ambiguities. uv distinguishes itself by accepting the reality of modern Python distribution. It integrates seamlessly with existing PyPI indexes, supports private repositories, and respects environment markers without demanding custom configuration. Where conda isolates itself in a separate universe, uv operates as a drop in replacement for pip while providing the reproducibility guarantees that large collaborations actually need. The learning curve is negligible because the commands mirror familiar patterns. The performance difference is transformative because the resolver avoids backtracking through legacy version constraints.

No tool is universally optimal. uv currently assumes that most dependencies are available as wheels or standard source distributions. Projects that rely heavily on conda only channels or proprietary binary blobs still require hybrid approaches. The scientific computing community also needs time to adapt to strict lockfile workflows, particularly when experimental dependencies change frequently or when researchers need to test bleeding edge commits. Yet these are transitional constraints rather than fundamental flaws. The toolchain is maturing rapidly, and the adoption curve in research groups is accelerating precisely because the alternative costs so much in lost time and irreproducible results.

I have spent years watching analysis working groups struggle with environment drift. We have patched containers, documented fragile shell scripts, and accepted that computational reproducibility is often a compromise. uv does not eliminate the complexity of scientific software. It does, however, remove the arbitrary friction that has long obscured it. By treating dependency resolution as a deterministic engineering problem rather than an interpretive exercise, it restores the foundational promise of open scientific computing. The next generation of high energy physics analyses will likely generate petabytes of data and rely on increasingly sophisticated inference pipelines. We cannot afford to let environment management remain the weakest link in that chain. uv offers a path forward where the code we share is the code that actually runs, and where reproducibility is no longer a hope we document but a guarantee we ship.

However, the recent OpenAI accquisition of astral, the company behind uv, raises questions about the long term sustainability of the project. Open source tools thrive on community involvement and transparent governance. The scientific computing community should actively engage with the developers, contribute to the codebase, and advocate for an open development model that ensures uv remains a reliable tool for researchers worldwide. I don't trust OpenAI to maintain a tool that is critical for scientific reproducibility which might cast a doubt on the long term viability of uv. But lets hope that OpenAI will continue to support the project and that the scientific community will rally around it to ensure its longevity.

My Sad thoughts appears to be a Greyware

2026-03-24T00:00:00Z

I maintain a static blog in Arabic where I don't publish its link anywhere, and it doesn't belong to sitemap.xml and is not available on the search engines. I usually write about my thoughts and feelings in this blog in my own native language, it served another purpose in the past but no longer. I just write there from time to time, every couple of months, and I don't share it with anyone. Although I read it myself from time to time, just to remember how I felt in the past and how I was thinking about things. It is not super personal, and it doesn't include personal details or anything.

I usually access from my phone during commute or on my computer at home (usually after 12 AM because this is when I have level of Loneliness and sadness that makes me want to write). I have been doing this for years and I never had any problem accessing it, until recently. I was staying late at my office at CERN and overstayed until it was very late, I just wanted to take a break and the thought of reading my blog came to my mind, I opened the link, and I was greeted with this message:

My first reaction was what the hell is the greyware? I have never heard of it before so I had to search about it, and I found out that it is a type of software that is not necessarily malicious but can be used for harmful purposes, and it is often blocked by firewalls. There are not much that people would say about it. It boils down to that it is not a virus or malware, but it is something that can be used for bad purposes and might cause harm to the users, so it is blocked by firewalls as a precautionary measure. So this basically would apply to social media, new outlets and any website that serve ads and collect data about the users.

Maybe the website of propaganda and fake news. But my sad and lonely blog that I don't share with anyone and doesn't have any ads or collect any data about the visitors -and it doesn't have any visitor except me- is blocked (Yes, I know it is a false positive, but it is still sad). I don't even know how the firewall here gets their list of greyware, and it is funny but at least I knew something new about the absurdity of the current internet situation. Of course, I could just contact CERN Security Team and open a ServiceNow ticket to ask them to unblock it, but I don't think I will do that, In part because I will not waste someone time on something that is not important, and in part because I would feel embarrassed to ask them to see it and read it to see it is not harmful, and I don't want to do that. I will just live with the fact that my sad thoughts are considered as greyware at my place of work and blocked by the firewall, and I will just access it from my phone when I am outside the office.

Disclaimer: I am not blaming CERN for this, I understand that they have to protect their network, and they have to block anything that might be harmful, and I am sure that they have a good reason for blocking it, but it is funny observation. But on the other hand, I have very sad/dark writing in some of my posts there so yes, I can see how it would be harmful :).

Mattermost is my new comment system

2026-03-20T00:00:00Z

For a brief amount of time, I had a proper comment section on my blog. It was based on selfhosting isso. But it was a pain to handle the spam and it was rarely used by anyone. So I decided to remove it. Some of that is because although it was there, people used to contact me through email or through mattermost with CERN folks who are the majority of people who will find my blog by chance searching for something similar.

So that I got some valuable suggestions, information and discussions through these two channels, It was not my expectation but I am happy with it. Of course email is used by people who don't know about melashri in CERN mattermost which is technically used by CERN folks (including the majority of CERN users which I'm one of them).

I will not have any comment system and I like it this way. My blog is niche and personal and the traffic to it is two digits on average per day. I understand this and I post here for my own sake and for the sake of those who might find it useful.

The fact also that I don't use any social media makes this more interesting, No one would reach me through Twitter, Facebook, Mastodon, Bluesky, Threads, etc. I don't maintain any presence on these platforms or any other platform for that matter. onI would like to keep it this way. So the only way to reach me is through email or CERN mattermost. Or maybe by chance you can find me at CERN R1 (The center of universe) eating my Margaritha pizza in a way that will anger most of my italian friends.

SSH: LocalCommand

2026-03-17T00:00:00Z

This is the last post in this series, at least for now. I started last week with the idea of reading through man ssh and man ssh_config daily and writing about what I found.
Eight posts later, I've covered ObscureKeystrokeTiming, ChannelTimeout, Match version,
Match sessiontype, Escape Sequences, AddKeysToAgent, and ControlMaster.
I'm pausing the daily posting, between my physics analysis work and GPU work, daily posts aren't realistic right now. I'll probably pick this back up later, there's no shortage of material in these man pages.

So let's move to today's final entry to the list, LocalCommand and I will make it quicker. So lets begin with what the man page say:

From the man page:

LocalCommand
    Specifies a command to execute on the local machine after
    successfully connecting to the server.  The command string
    extends to the end of the line, and is executed with the
    user's shell.  Arguments to LocalCommand accept the tokens
    described in the TOKENS section.

    The command is run synchronously and does not have access
    to the session of the ssh(1) that spawned it.  It should
    not be used for interactive commands.

    This directive is ignored unless PermitLocalCommand has
    been enabled.

Two things makes this interesting. First, it's a post-connection hook, it fires on our local machine after ssh has fully authenticated with the remote server.
Second, it has access to all the tokens: %h (remote hostname), %r (remote user), %p (remote port), %n (hostname as typed on the command line), %d (local home directory), %l (local hostname),
%u (local username), and more. This gives us a programmable callback that knows the full context of the connection that just happened.

To use it, we need to explicitly enable it:

Host *
    PermitLocalCommand yes

Without PermitLocalCommand yes, all LocalCommand directives are silently ignored. This is off by default for good reason, we don't want arbitrary commands running as a side effect of ssh-ing somewhere,
especially if system-wide ssh_config could be modified by an admin.

The simplest use is logging. Say we want to keep a record of every SSH connection we make:

Host *
    PermitLocalCommand yes
    LocalCommand echo "$(date '+%Y-%m-%d %H:%M:%S') %u@%l -> %r@%h:%p" >> ~/.ssh/connection.log

Every time we connect, a line gets appended to ~/.ssh/connection.log with a timestamp, who we are, and where we went.
Useful for auditing our own activity or debugging "when did I last connect to that machine."

We can use it for notifications. i.e, On macOS:

Host production-*
    PermitLocalCommand yes
    LocalCommand osascript -e 'display notification "Connected to %h as %r" with title "SSH"'

Every time we connect to a production host, we get a system notification. A small thing, but it adds a moment of awareness when we're about to do something on a live system.

A more practical pattern: syncing something after connecting. If we keep dotfiles or a specific config on remote hosts:

Host dev-*
    PermitLocalCommand yes
    LocalCommand rsync -q ~/.vimrc %r@%h:.vimrc

This pushes our local .vimrc to the remote host every time we connect. The command runs synchronously before we get our shell prompt, so by the time we're typing on the remote machine, the file is already there.
Keep in mind this adds latency to our connection, a quick rsync is fine, anything heavy is not.

There are important limitations. The command runs synchronously and blocks the connection until it completes. If it hangs, our ssh session hangs. If it fails, we still get connected,
LocalCommand failures don't abort the SSH connection. The command doesn't have access to the SSH session itself: it can't read from or write to the remote shell.
It's a fire-and-forget local action that happens to know about the connection.

Also, LocalCommand only fires for interactive sessions by default. If we run ssh host ls /tmp, or use scp or sftp, the command won't execute.
We can combine this with Match sessiontype from earlier in this series if we want it to fire only for specific session types, or avoid firing for others.

We can scope it per host as we'd expect:

Host *.cern.ch
    PermitLocalCommand yes
    LocalCommand echo "$(date '+%Y-%m-%d %H:%M:%S') -> %r@%h" >> ~/.ssh/cern.log

Host bastion
    PermitLocalCommand yes
    LocalCommand echo "Jumped through bastion" >> ~/.ssh/connection.log

The PermitLocalCommand + LocalCommand pair is one of those features that's been in ssh_config for a long time but barely anyone uses. It's not revolutionary on its own, but it's a building block.
The fact that it has access to all the connection tokens means we can wire it up to whatever local tooling makes sense for our workflow, logging, notifications, syncing, triggering scripts, updating a status file.
It's the closest thing ssh_config has to a plugin system.

That's it for this series. There are topics I didn't get to, CanonicalizeHostname, UpdateHostKeys, KnownHostsCommand, RemoteCommand, VisualHostKey, the full TOKENS section, SSH certificates, and plenty more.
The man pages are dense with things worth knowing if we can push past just looking up flags.
The whole point of writing these was to force myself to actually read the pages, and it worked,
I've already changed my own ssh_config based on what I found. Maybe that's enough of a reason to pick it back up when things quiet down.

SSH: ControlMaster

2026-03-16T00:00:00Z

Seventh post in a series where I read through SSH man pages and write about things worth knowing. Previous posts: ObscureKeystrokeTiming, ChannelTimeout, Match version, Match sessiontype, Escape Sequences, AddKeysToAgent. Today I want to do back to the classics, I will talk about specific directives: ControlMaster, ControlPath, and ControlPersist, They are basically what we call SSH connection multiplexing directives.

Every time we run ssh host, the client does a lot of work: TCP handshake, key exchange, authentication, possibly agent forwarding negotiation. If we connect to the same host ten times in a row, maybe because we're running scp, then we open a shell, then doing rsync, then another scp, that entire dance happens ten times. On a high-latency link, this adds up fast. On hosts where authentication involves MFA or Kerberos ticket negotiation (hello, lxplus), it's actively painful.

SSH multiplexing lets us reuse a single connection for multiple sessions. The first connection does the full handshake and authentication. Every subsequent connection to the same host piggybacks on it, new sessions open nearly instantly because they skip the entire TCP and crypto setup. Three directives control this. ControlMaster enables it, ControlPath tells ssh where to put the Unix domain socket used for multiplexing, and ControlPersist keeps the master connection alive after the first session closes.

Host *
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 10m

ControlMaster auto means: if no master connection exists, become one; if one already exists, reuse it. The auto value is what we want for everyday use. There's also yes (always be master, never reuse) and no (never be master), but auto handles both cases.

ControlPath is the location of the Unix socket file. The %r, %h, and %p tokens expand to the remote user, hostname, and port, so we get one socket per unique connection target. We'll want to create the directory first:

mkdir -p ~/.ssh/sockets

If we skip this, ssh will fail with a somewhat cryptic error about not being able to create the socket. The path needs to be short enough to fit within the OS limit for Unix socket paths (typically 104 or 108 bytes), so don't put it in a deeply nested directory.

ControlPersist 10m means the master connection stays alive for 10 minutes after the last session using it disconnects. Without ControlPersist, the master connection dies when the first session that created it exits, which defeats the purpose if we're opening and closing sessions frequently. Setting it to yes makes the master persist indefinitely (until we manually kill it or the server disconnects). A time value like 10m or 1h is usually more sensible.

The speed difference is noticeable. On a connection to a host behind a VPN with ~100ms latency, the first ssh host takes a few seconds for key exchange and authentication. Subsequent connections while the master is alive take under 200ms. For scripting workflows that SSH into the same host repeatedly, deploy scripts, batch file copies, Ansible runs, this can dramatically reduce total runtime.

There are a few gotchas. If the master connection dies uncleanly (network drops, laptop sleeps), the socket file can be left behind, and new connections will hang trying to connect to a dead socket before eventually falling back to a fresh connection. We can clean up stale sockets manually (rm ~/.ssh/sockets/*) or use ssh -O check host to test if a master is alive and ssh -O exit host to shut one down cleanly.

The -O flag gives us manual control over the master. ssh -O forward -L 8080:localhost:80 host adds a port forward to an existing master without opening a new session (similar to the ~C escape sequence from a few posts ago, but from the command line). ssh -O cancel -L 8080:localhost:80 host removes it.

One important security note: anyone who can access the socket file can piggyback on our multiplexed connection without authenticating. The socket is protected by filesystem permissions (it's in our ~/.ssh/ directory, mode 0600), but on shared machines we should be aware this is how it works. If that's a concern, don't use multiplexing on hosts where others have root access.

OpenSSH 10.0 made a relevant change here: scp and sftp now pass ControlMaster no to ssh by default. Previously, if we had ControlMaster auto in our config, running scp or sftp could accidentally create a master connection that then stuck around. Now they'll reuse an existing master but won't create one, which is generally the behavior we'd expect from file transfer tools.

I've used multiplexing for years, and it's one of those things where once we set it up, we forget about it, everything just feels faster. The three lines in ssh_config are all it takes. But I need to be careful about some security implications and edge cases, especially on shared machines or when I'm using lxplus as a hub for connecting to other machines.

SSH: AddKeysToAgent

2026-03-15T00:00:00Z

Sixth post in a series where I read through SSH man pages and write about things that catch my eye. Previous posts: ObscureKeystrokeTiming, ChannelTimeout, Match version, Match sessiontype, Escape Sequences. Today lets move to something a little bit different: AddKeysToAgent.

The typical SSH key workflow looks like this: we generate a key with a passphrase, then we either type the passphrase every time we connect, or we run ssh-add once to load the key into our agent so it stays decrypted in memory. Most people do the ssh-add dance at the start of their session and don't think about it again. Some people skip the passphrase entirely because they find the workflow annoying, which is worse.

AddKeysToAgent eliminates the manual step. It's a ssh_config directive that tells the SSH client to automatically add keys to a running agent after successful authentication. You type our passphrase once on first use, and the key gets loaded into the agent. Subsequent connections use the agent and don't prompt.

From the man page:

AddKeysToAgent
    Specifies whether keys should be automatically added to a
    running ssh-agent(1).  If this option is set to yes and a
    key is loaded from a file, the key and its passphrase are
    added to the agent with the default lifetime, as if by
    ssh-add(1).  If this option is set to ask, ssh(1) will
    require confirmation using the SSH_ASKPASS program before
    adding a key (see ssh-add(1) for details).  If this option
    is set to confirm, each use of the key must be confirmed,
    as if the -c option was specified to ssh-add(1).  If this
    option is set to no, no keys are added to the agent.
    Alternately, this option may be specified as a time interval
    [...] to specify the key's lifetime in ssh-agent(1), after
    which it will automatically be removed.  The argument must
    be no (the default), yes, confirm (optionally followed by a
    time interval), ask or a time interval.

The simplest version, we can just add it to our config for all hosts:

Host *
    AddKeysToAgent yes

Now every key we use gets loaded into the agent on first use. No more ssh-add at login. But yes keeps the key in the agent forever (until the agent dies, or you manually remove it). That's fine on a personal laptop, but on shared machines or if we're forwarding our agent, we probably want more control.

The time-limited variant is where it gets interesting, if we set it to a duration instead of yes:

Host *
    AddKeysToAgent 1h

The key gets added to the agent with a 1-hour lifetime. After that, it's automatically removed, and the next connection will prompt for the passphrase again. This is a good balance between convenience and security, our key isn't sitting decrypted in memory indefinitely. The time format supports s (seconds), m (minutes), h (hours), d (days), w (weeks), and combinations like 1h30m.

Then there's confirm:

Host *
    AddKeysToAgent confirm

This adds the key to the agent, but every time the key is used for authentication, the agent will pop up a confirmation dialog via ssh-askpass before signing. We get a visual prompt each time something tries to use our key. This is particularly valuable if we forward our agent to remote hosts, it means a compromised remote machine can't silently use our key to hop further without you seeing the confirmation. We can combine confirm with a time limit for even more control:

Host *
    AddKeysToAgent confirm 4h

Here the key is loaded with both a confirmation requirement and a 4-hour expiry.

The ask option is different from confirm. With ask, we get prompted before the key is added to the agent, it's a one-time gate on whether the key should be loaded at all. With confirm, the key is loaded immediately, but we're prompted on every subsequent use. These serve different purposes: ask protects against accidentally loading a key, confirm protects against unauthorized use of an already-loaded key.

There are few practical notes. AddKeysToAgent requires a running ssh-agent with SSH_AUTH_SOCK set in our environment. If no agent is available, the option is silently ignored, and we just get the normal passphrase prompt each time. On macOS, the system agent is usually running already. On Linux, it depends on our desktop environment or how we start our session — most modern setups (systemd user sessions, GNOME, etc.) launch an agent automatically.

For the confirm option to work, we need a ssh-askpass program installed. On macOS, ssh-askpass isn't included by default, but there are several available (like ssh-askpass-mac from Homebrew). On Linux with a desktop environment, it's usually available through the system package manager.

The combination I've settled on for my own config is confirm 8h on machines where I forward my agent, and a plain 4h everywhere else. The timeout means I re-authenticate at least once a workday, and the confirmation on forwarded sessions means nothing uses my key without me seeing it.

SSH: Escape Sequences

2026-03-14T00:00:00Z

Fifth post in a series where I read through SSH man pages and write about things worth knowing. Previous posts: ObscureKeystrokeTiming, ChannelTimeout, Match version, Match sessiontype. Today, I'm stepping out of ssh_config and into man ssh itself: escape sequences. Every interactive SSH session has a hidden control interface. Press Enter, then ~, then ?, and you'll see it:

Supported escape sequences:
  ~.   - terminate connection (and any multiplexed sessions)
  ~B   - send a BREAK to the remote system
  ~C   - open a command line
  ~R   - request rekey
  ~V/v - decrease/increase verbosity (LogLevel)
  ~^Z  - suspend ssh
  ~#   - list forwarded connections
  ~&   - background ssh (when waiting for connections to terminate)
  ~?   - this message
  ~~   - send the escape character by typing it twice
(Note that escapes are only recognized immediately after newline.)

That last line is the critical detail. Escape sequences only work at the beginning of a line, right after you press Enter (or at the very start of the session). If you type hello~., nothing happens. You need Enter, then ~.. This trips me0 up every time.

~. is the one everyone should know. When your SSH session freezes, the network dropped, the remote host crashed, a firewall somewhere timed out the connection, Ctrl+C does nothing because the client is waiting on a dead TCP socket. ~. kills the client side immediately. No stuck terminal, no hunting for PIDs. I use this multiple times a week. If you take nothing else from this post, take ~..

~^Z (tilde then Ctrl+Z) suspends the SSH session and drops you back to your local shell, same as suspending any process. fg brings it back. Useful when you need to quickly check something locally without opening a new terminal.

~# lists all active forwarded connections. If you set up port forwards with -L or -R or -D and want to see what's actually connected, this shows you.

~V and ~v adjust the SSH client's log verbosity on the fly. Each ~v bumps it up one level (INFO → VERBOSE → DEBUG → DEBUG2 → DEBUG3), and ~V takes it back down. This is great for debugging a connection problem mid-session without having to disconnect and reconnect with -vvv.

Now the interesting one: ~C. This opens a ssh> command prompt where you can add or remove port forwards on a live session:

ssh> -L 8080:localhost:80
Forwarding port.
ssh> -D 1080
Dynamic forwarding port.
ssh> -KL 8080
Canceled forwarding.

You can add local forwards (-L), remote forwards (-R), dynamic SOCKS proxies (-D), and cancel any of them with the -K prefix. No need to drop your session and reconnect with different flags. This is genuinely powerful for the kind of ad-hoc tunneling you do when you're deep into debugging something on a remote machine and realize you need access to another port.

Here's the catch. Since OpenSSH 9.2 (February 2023), the ~C command line is disabled by default. If you try it, you'll see commandline disabled. The OpenSSH team added a new EnableEscapeCommandline option that defaults to no. The rationale is that disabling the command line allows tighter sandboxing of the SSH client process on platforms that support it. And to get it back:

Host *
    EnableEscapeCommandline yes

Or per-connection: ssh -o EnableEscapeCommandline=yes host.

All the other escape sequences (~., ~^Z, ~#, ~V/v, etc.) still work regardless of this setting. Only ~C is gated behind EnableEscapeCommandline.

One more thing about nested sessions. If you're SSH'd into host A, and from there SSH'd into host B, ~. will kill the connection to host A (taking host B down with it). To send the escape to the inner session instead, you double the tilde: ~~. kills only the connection to host B. Triple for three levels deep, and so on. The same applies to all escape sequences, each additional ~ pushes the escape one hop further in.

The escape character itself is configurable via the EscapeChar directive in ssh_config, or -e on the command line. Setting it to none disables escape sequences entirely, which you'd want for binary-transparent connections where tilde characters in the data stream could be misinterpreted.

SSH: Match sessiontype

2026-03-13T00:00:00Z

This is the fourth post in a series where I read through man ssh_config and pick out things worth knowing. Previous posts: ObscureKeystrokeTiming, ChannelTimeout, Match version.

Today, we will be looking at the Match sessiontype option. This one was also introduced in OpenSSH 10.0 (April 2025), alongside Match version that I wrote about yesterday. Where Match version lets you branch config based on which OpenSSH you're running, Match sessiontype lets you branch based on what kind of session you're about to start.

From the man page:

The sessiontype keyword matches the requested session type,
which may be one of shell for interactive sessions, exec for
command execution sessions, subsystem for subsystem invocations
such as sftp(1), or none for transport-only sessions, such as
when ssh(1) is started with the -N flag.

There are four possible values: shell, exec, subsystem, and none.

This matters because not all SSH sessions are the same, and you've probably wanted different behavior for different kinds of sessions without having a clean way to express it. Before this, your ssh_config options applied uniformly, the same timeouts, the same forwarding settings, the same keystroke obfuscation, regardless of whether you were opening an interactive shell or just running ssh host rsync ....

Here's a practical example. Say you want ObscureKeystrokeTiming on for interactive sessions (where it's protecting your typing patterns) but off for command execution and file transfers (where it just adds overhead and can interfere with throughput):

Match sessiontype shell
    ObscureKeystrokeTiming yes

Match sessiontype exec
    ObscureKeystrokeTiming no

Match sessiontype subsystem
    ObscureKeystrokeTiming no

Or consider forwarding-only sessions. When you run ssh -N -L 8080:localhost:80 host, you're not opening a shell at all, you're just setting up a tunnel (useful for CERN folks using lxtunnel). The session type here is none. You might want longer timeouts for these since there's no interactive typing to generate traffic, just the forwarded connection:

Match sessiontype none
    ServerAliveInterval 60
    ServerAliveCountMax 10

And you can combine sessiontype with other Match predicates. If you want specific settings only for sftp to a particular set of hosts:

Match sessiontype subsystem host "storage*.example.com"
    Compression yes

The subsystem type covers sftp because sftp is invoked as an SSH subsystem. If you use scp in its newer SFTP mode (which has been the default since OpenSSH 9.0), that also counts as a subsystem session. If you use scp with the legacy SCP/RCP protocol via scp -O, that's a exec session instead, because it runs a remote command.

One thing to note: there's also a SessionType directive (not a Match predicate) that's been in ssh_config since OpenSSH 8.7. That one is a setting you apply, it tells ssh what kind of session to request, equivalent to the -N and -s flags. The Match sessiontype predicate is different: it's a condition you test against. The naming overlap is a bit confusing, but they do different things. SessionType none in a Host block means "don't request a shell on this host." Match sessiontype none means "if we're about to start a transport-only session, apply these settings."

The combination of Match version, Match sessiontype, and the existing Match host/Match exec/Match localnetwork predicates makes ssh_config surprisingly expressive now. You can build a single config file that adapts its behavior based on where you're connecting from, what version of OpenSSH you're running, and what you're about to do — all without any external scripting or config generation.

CUDA Memory Safety Problem

2026-03-12T00:00:00Z

Every CUDA programmer has a story about a silent corruption bug that cost them days. A kernel that writes past the end of a shared memory buffer. An off-by-one in a thread index calculation that stomps on another warp's data. A race condition in a reduction that only manifests at specific block sizes. These bugs don't segfault. They don't throw exceptions. They just quietly produce wrong results, and you don't notice until your GPU-resident trigger is silently dropping interesting events at 30 MHz because a vertex reconstruction kernel read garbage from a misaligned buffer, or your neural network's loss suddenly goes to NaN on the 400th epoch.

The dirty secret of GPU programming is that CUDA's memory model is essentially C with extra dimensions. You get raw pointers, manual memory management, and a threading model so complex that even experienced developers routinely introduce data races. The CUDA toolkit gives you cuda-memcheck and compute-sanitizer, but these are runtime tools that catch only what they observe during execution. They miss the bugs that hide behind specific occupancy levels or input sizes. And they're slow enough that nobody runs them in production workloads.

This isn't a theoretical concern. In my own work integrating a deep neural network into a GPU-resident trigger framework for particle physics, the hardest bugs to track down were never algorithmic. They were memory layout mismatches: SoA data coming in one stride pattern, a cuDNN call expecting another, and an intermediate buffer silently reading garbage because nothing in the type system prevented it. The compiler was perfectly happy. The kernel launched fine. The output was just wrong in ways that took careful numerical debugging to isolate.

What would "safe CUDA" even mean?

If you think about what Rust's ownership model buys you on the CPU side, the core value proposition is straightforward: the compiler proves, at compile time, that your program is free of data races and use-after-free bugs. The trade-off cost is a steeper learning curve and occasional fights with the borrow checker. The payoff is that entire categories of bugs become impossible (We are not going to talk about any unsafe rust code here, because that's a different topic).

Now imagine applying that to GPU programming. A hypothetical safe Rust CUDA dialect would need to solve several problems that don't exist on the CPU side.

First, there's the host-device boundary. Today, cudaMemcpy is just a raw pointer copy with a direction enum. There's nothing preventing you from copying 4MB into a 2MB buffer. A safe abstraction would encode buffer sizes in the type system, making overflow a compile-time error rather than a silent corruption. Projects like cudarc in the Rust ecosystem already do a version of this, wrapping device allocations in typed containers. But they stop at the kernel boundary, because the kernels themselves are still written in CUDA C++.

Second, there's shared memory. CUDA shared memory is declared with __shared__ and accessed by all threads in a block. It's a fixed-size scratchpad with zero access control. Two warps writing to the same shared memory location is a data race, and CUDA gives you nothing to prevent it except __syncthreads() calls that you have to place manually. A Rust-flavored model could express this differently. Imagine shared memory as a type that can only be accessed through a lending pattern where a warp group borrows a slice, operates on it, hits a barrier, and then the borrow expires. The compiler would reject code that tries to read shared memory across a barrier boundary without proper synchronization. This is conceptually similar to how Rust's RwLock works, but enforced statically through the type system rather than at runtime.

Third, there's the thread hierarchy itself. CUDA's grid/block/thread model means that index calculations are everywhere, and they're a constant source of out-of-bounds bugs. Safe Rust already solved this for CPU arrays with bounds checking (which you can opt out of with get_unchecked in unsafe blocks). A GPU analog would bounds-check thread-indexed accesses by default, with an explicit unsafe escape hatch for the performance-critical inner loops where you've proven correctness by other means.

But why hasn't this been done yet? The practical barriers are significant. NVIDIA controls the CUDA compiler toolchain, and their incentive is ecosystem lock-in, not memory safety. The PTX ISA that sits underneath CUDA is a reasonable compilation target, and projects like rust-gpu have demonstrated that you can compile Rust to GPU shader languages. But targeting PTX from safe Rust while preserving the full CUDA programming model, including shared memory, warp-level primitives, cooperative groups, and the memory hierarchy, is a much harder problem than compiling pixel shaders.

There's also a performance question. Rust's bounds checks on array access are cheap on a CPU, but in a GPU kernel running across thousands of threads, any per-access overhead multiplies fast. A practical safe CUDA dialect would need to guarantee zero overhead for provably-safe access patterns, only inserting runtime checks where the compiler can't prove safety. This is an active research area even in CPU Rust, and the GPU adds dimensions of complexity.

The closest things we have today are half-measures. cudarc and rustacuda provide safe host-side management of device memory. rust-gpu compiles Rust to SPIR-V for graphics shaders. The krnl crate attempts safe kernel authoring but covers only a subset of what CUDA offers. None of these give you the full experience of writing a complex kernel, with shared memory, warp shuffles, and cooperative groups, in a memory-safe language.

I don't think NVIDIA will build this. It would have to come from the community, probably building on LLVM's existing NVPTX backend. The realistic path is a Rust proc-macro or embedded DSL that generates PTX, with a type system layer on top that enforces memory safety at the kernel level. It wouldn't need to cover every CUDA feature on day one. Even handling just global memory access and shared memory synchronization safely would eliminate the most common class of GPU bugs.

For real-time scientific applications especially, where correctness matters as much as throughput, this would be transformative. When your GPU trigger is the first filter deciding which collision events survive for downstream analysis, and it's running at tens of MHz with no human in the loop, "it ran without crashing" is not a sufficient correctness criterion. Having a compiler that can prove your memory accesses are well-defined would let you focus on the reconstruction algorithms instead of chasing silent corruption through hex dumps of device memory.

The tools aren't there yet. But the need is clear, and the Rust ecosystem has a habit of eventually building the thing that everyone said was too hard.

SSH: Match version

2026-03-12T00:00:00Z

This is the third post in a series where I read through man ssh_config and write about things I find interesting. Previous posts covered ObscureKeystrokeTiming and ChannelTimeout. Today, let's talk about Match version, a new directive that lets you conditionally apply configuration blocks based on the OpenSSH version. This is a game-changer for anyone who manages multiple machines with varying OpenSSH versions.

So, the idea is that if you keep a single ~/.ssh/config that you sync across multiple machines (through a dotfiles repo, or in my case, machines connected via Tailscale), you've probably run into this: you add a directive that's only supported in newer OpenSSH versions, and now your config is broken on every machine that hasn't been updated yet. You either maintain per-machine configs, comment things out when you switch contexts, or just avoid using new features until every machine catches up. None of these are great.

OpenSSH 10.0 (April 2025) introduced Match version in both ssh_config and sshd_config. It lets you conditionally apply configuration blocks based on the local OpenSSH version string.

From the sshd_config man page:

The Version keyword matches against the version string
of sshd(8), for example "OpenSSH_10.0".

On the client side, it works the same way. The version string is what ssh -V reports, like OpenSSH_10.2p1. You can match against it with wildcards. So if you want to use a feature that only exists in OpenSSH 10.x and later, you wrap it:

Match version "OpenSSH_10.*"
    ChannelTimeout global=30m

Or if you want to set the new default post-quantum key exchange but only where it's available:

Match version "OpenSSH_10.*"
    KexAlgorithms mlkem768x25519-sha256,curve25519-sha256

Machines running 9.x will silently skip the block. No errors, no broken configs.

This also works nicely with the version jump from 9.x to 10.0. OpenSSH 10.0 announces itself as SSH-2.0-OpenSSH_10.0, which broke some tools that matched version strings with patterns like OpenSSH_1*, expecting that the version would always start with a single digit. The Match version directive itself handles this gracefully since it uses standard wildcard matching, but it's a good reminder that version parsing in SSH is trickier than it looks now that we've crossed into double digits.

You can combine version matching with other Match predicates. For example, to apply settings only on newer versions connecting to a specific host:

Match version "OpenSSH_10.*" host "lxplus*.cern.ch"
    ObscureKeystrokeTiming no
    ChannelTimeout session=1h global=2h

One thing to be aware of: on older OpenSSH versions that don't understand Match version at all (anything before 10.0), the directive itself will cause a config parse error. So you can't use Match version to gate features for pre-10.0 machines, it only works for differentiating between 10.0 and later versions. For backward compatibility with genuinely old versions, you still need separate config files or a generation step.

For the sshd_config side, this is useful for fleet management. If you're rolling out OpenSSH updates incrementally across servers, you can ship one config that works everywhere:

Match version "OpenSSH_10.*"
    ChannelTimeout session=30m global=1h
    UnusedConnectionTimeout 5m

Servers still on 9.x will ignore the block (well, they'll error, so really you'd want all your servers on 10.x before using this in sshd_config). But once your fleet is on 10.0+, you can start using Match version to gate 10.1 or 10.2 features without worrying about breaking the 10.0 machines.

The practical upshot: if all your machines are on OpenSSH 10.0 or later, Match version makes a single portable ssh_config meaningfully more viable. It's one of those small features that doesn't sound exciting but removes a real friction point from day-to-day SSH config management.

SSH: Channel Timeout

2026-03-11T00:00:00Z

This is the second post in a series where I read through man ssh_config and write about things I find interesting. First post was about ObscureKeystrokeTiming.

Today's find: ChannelTimeout.

Before this option existed, SSH timeout handling was blunt. You had ClientAliveInterval and ClientAliveCountMax on the server side, which together formed a dead-peer detection mechanism. If the client stops responding to keepalive probes, the server drops the connection. But that's about the whole connection, not individual channels. A single SSH connection can multiplex many channels: your shell session, a port forward, an agent socket, an X11 tunnel. They're all riding the same connection, and there was no way to say "close this idle port forward after 10 minutes but keep my shell alive."

And ChannelTimeout fixes this. It was added to sshd in OpenSSH 9.2 (February 2023) and to the ssh client in 9.6 (December 2023). The syntax is a list of type=interval pairs:

ChannelTimeout session=30m direct-tcpip=10m

This tells ssh to close interactive sessions after 30 minutes of inactivity, and local port forwards after 10 minutes. The channel types you can target are:

agent-connection          # ssh-agent connections
direct-tcpip              # local forwards (LocalForward, DynamicForward)
direct-streamlocal@openssh.com   # local Unix socket forwards
forwarded-tcpip           # remote forwards (RemoteForward)
forwarded-streamlocal@openssh.com # remote Unix socket forwards
session                   # shell, command execution, scp, sftp
tun                       # TunnelForward connections
x11                       # X11 forwarding

You can use wildcards too, so *=15m sets a 15-minute timeout on every channel type. But there's a subtlety here that's worth understanding.

OpenSSH 9.7 (March 2024) added a global timeout type, and it behaves differently from wildcards. The man page is precise about the distinction: the global timeout watches all active channels taken together. Traffic on any active channel resets the timer. When the timer expires, all channels close. And the global timeout is explicitly not matched by wildcards, you have to specify it by name.

This matters for a common setup. Say you have a shell session open and an X11 forward. With per-channel timeouts (session=30m x11=10m), your X11 channel could be killed after 10 minutes of no X11 traffic even though you're actively typing in the shell. With a global timeout (global=30m), any activity on any channel, typing in the shell, X11 events, port forward traffic, resets the single shared timer. Everything stays alive as long as something is happening somewhere.

# Per-channel: idle X11 gets killed even if shell is active
ChannelTimeout session=30m x11=10m

# Global: everything stays alive as long as anything is active
ChannelTimeout global=30m

# Combined: global baseline plus aggressive cleanup of port forwards
ChannelTimeout global=1h direct-tcpip=10m

On the server side (sshd_config), there's a companion directive UnusedConnectionTimeout that closes connections with zero open channels. This pairs with ChannelTimeout nicely: channels time out individually, and once the last one dies, the connection itself is cleaned up. The man page specifically notes that this timeout starts after authentication completes but before the client opens any channels, so don't set it too short or legitimate clients won't have time to establish their session.

This is one of those options that's useful if you manage machines where people leave SSH sessions open indefinitely, set up port forwards and forget about them, or where you want to reclaim resources from abandoned connections without disrupting active work. For personal use, the global timeout is probably the most practical, it's the closest thing to "close everything if I walk away and forget."

SSH: Obscure Keystroke Timing

2026-03-10T00:00:00Z

This is the first post in a series where I read through man ssh_config and write about things I find interesting. Most of us only open man pages to look up a specific flag and close them immediately. I want to break that habit by actually reading through the pages and writing about what I find, so the knowledge sticks.

Today's find: ObscureKeystrokeTiming.

Since OpenSSH 9.5 (released late 2023), the SSH client has been doing something interesting by default. Every time you type in an interactive session, ssh quantizes your keystrokes to fixed intervals (20ms by default) and sends fake "chaff" packets after you stop typing. The goal is to make it harder for a passive network observer to perform keystroke timing analysis on your session.

Keystroke timing attacks are real. The time between your keypresses leaks information about what you're typing. Different letter pairs have characteristic timing patterns. Researchers have shown that this metadata alone can be used to infer passwords and commands. OpenSSH's response was to add this option, enabled by default, that pads your keystrokes into a regular rhythm and throws in decoy packets to muddy the signal.

From the man page:

ObscureKeystrokeTiming
    Specifies whether ssh(1) should try to obscure inter-keystroke
    timings from passive observers of network traffic.  If enabled,
    then for interactive sessions, ssh(1) will send keystrokes at
    fixed intervals of a few tens of milliseconds and will send
    fake keystroke packets for some time after typing ceases.
    The argument to this keyword must be yes, no or an interval
    specifier of the form interval:milliseconds (e.g. interval:80
    for 80 milliseconds).  The default is to obscure keystrokes
    using a 20ms packet interval.  Note that smaller intervals will
    result in higher fake keystroke packet rates.

There are two things worth knowing here. First, you can tune the interval. The default 20ms works fine for most people, but interval:80 would reduce the fake packet rate at the cost of coarser timing granularity. Second, and more practically relevant: if you use X11 forwarding, this feature can cause noticeable lag. If you've upgraded OpenSSH recently and your remote GUI apps feel slow, this might be why.

OpenSSH 10.0 (April 2025) improved this: the client now avoids starting the keystroke obfuscation if there has been recent traffic on an X11 forwarding channel. But if you're on an older version, you can disable it per-host:

Host myserver
    ObscureKeystrokeTiming no

Or globally if you prefer:

Host *
    ObscureKeystrokeTiming no

Worth noting: the feature had a bug in OpenSSH 9.5 through 9.7 where it actually worked in reverse of what was intended, making the real keystrokes distinguishable from the chaff. A researcher demonstrated that the real keystroke packets were slightly larger than the fake ones, making them trivially identifiable with packet capture. This was fixed in 9.8, but it's a good reminder that security features can have subtle implementation issues.

The broader takeaway is that your SSH client is doing more than just encrypting your traffic. It's actively trying to hide your typing patterns from anyone watching the wire. Whether you keep it on or off depends on your threat model, but either way it's good to know it's there.

You Can't Prompt Your Way to Live Data

2026-03-08T00:00:00Z

There is a recurring debate in agent-design circles that goes roughly like this: why build all these MCP servers when you can just write a skill, a markdown instruction file that tells the agent everything it needs to know, saving 90% of your context tokens? It's a seductive argument. It sounds like engineering pragmatism. It is, in fact, a category error dressed up as optimization advice.

Let's be precise about what each mechanism actually does, where each breaks down, and why treating them as competitors reveals a fundamental misunderstanding of the context-engineering problem.

First lets talk about what are we talking about. An LLM skill, in the operational sense used by systems like Claude's Code or Web or various other agent frameworks, is a markdown file that injects structured instructions into the model's context at runtime. It tells the model how to behave, what tools to prefer, what patterns to follow, and sometimes pre-loads domain-specific knowledge, API conventions, library quirks, output schemas that would otherwise require several turns of exploration or failure to discover.

Skills work. When you know exactly what a task looks like, when the domain is well-understood and bounded, and when the instructions are stable across invocations, a well-written skill is extraordinarily effective. It collapses setup time, eliminates certain classes of model confusion, and produces more consistent outputs.

But notice the implicit preconditions embedded in that last paragraph: you know exactly what the task looks like. The domain is well-understood. The instructions are stable. These are not universal properties of agentic workloads. They are special cases.

The 90% Token Savings Claim Is Misleading

The argument that skills save 90% of context tokens usually rests on a comparison like this: "instead of having the agent make three tool calls to discover the schema, just put the schema in the skill file." This is true and useful in exactly one scenario when you already have the schema, it doesn't change, and every invocation needs it.

In practice, this framing quietly assumes away the hardest part of the problem: context that needs to be built, not encoded.

Consider the difference between these two tasks:

"Analyze this ROOT file and produce a summary of the branch structure."
"Find the most recent LHCb simulation request on GitLab for the BnoC working group and tell me its status."

The first task has a known shape. You could write a skill for it: instruct the model on ROOT file conventions, uproot idioms, what a good branch summary looks like. That skill would genuinely compress context and reduce noise.

The second task cannot be encoded in a skill file because its answer does not exist until runtime. The relevant context, which MR, what its current status is, what comments have been left, what pipeline stage it's in is discovered, not pre-known. No amount of markdown instructions substitutes for a tool that actually queries the GitLab API and returns live data. The skill tells the model how to reason. The MCP tool gives the model something to reason about. This is the core asymmetry that the "just use a skill" argument elides.

MCP Tools Are a Context-Building Mechanism, Not a Behavior-Encoding Mechanism

The Model Context Protocol is architecturally oriented around a different problem than skills. MCP servers expose tools that the agent can invoke to retrieve, filter, and assemble context dynamically. The emphasis is on discovery, finding information whose existence, structure, or current value could not have been anticipated at system-design time.

A well-designed MCP server is essentially a context faucet. The agent doesn't know in advance what it will need; it queries the server, inspects the response, decides what's relevant, and proceeds. This is fundamentally an active, runtime-dependent process. The agent is not a passive recipient of pre-loaded instructions; it is an active participant in constructing the context it needs.

This is why comparing MCP tools to skills as substitutes is like comparing a database to a config file. Both store information. The use cases are almost entirely non-overlapping.

Beyond the philosophical mismatch, skills have practical failure modes that advocates underemphasize.

Staleness. A skill that encodes API conventions is correct until the API changes. Skills require active maintenance. In rapidly evolving codebases or external services, the skill becomes a liability the moment its content diverges from ground truth. MCP tools query live systems and are structurally immune to this class of failure.

Authorship bottleneck. To write a skill, you must already understand the domain well enough to encode it. For novel tasks, exploratory analyses, or unfamiliar systems, you don't have this knowledge. You need the agent to discover it. Skills require a human SME investment upfront that is often precisely what you're trying to offload to the agent in the first place.

Context inflation under generalization. The temptation, once you've bought into the skills-as-optimization frame, is to write increasingly comprehensive skills that cover more edge cases. This is the opposite of the promised token savings. Comprehensive skills balloon. They introduce ambiguity as instructions conflict. They create a maintenance surface that grows superlinearly with coverage.

Overfitting to anticipated tasks. Skills optimize for the tasks you predicted. Agentic systems are often deployed precisely because the task space is too large or dynamic to predict exhaustively. A skill-heavy architecture implicitly re-centralizes the knowledge that distribution was supposed to eliminate.

The Real Case for MCP Is Not Token Efficiency

Proponents of MCP tools sometimes make a tactical mistake by competing on the token-efficiency axis. That's a losing argument because skills, in their narrow domain of applicability, genuinely do use fewer tokens. The right argument is structural.

MCP tools solve problems that token efficiency doesn't touch:

Live data access: No skill file can tell you what the LHC beam energy is right now, what your CI pipeline returned three minutes ago, or what a colleague just pushed to the main branch.
Scoped retrieval: A well-designed MCP server doesn't dump everything into context; it exposes tools for the agent to request precisely what it needs. This is better token discipline than a broad skill, not worse.
Capability composition: MCP tools can chain. An agent can use a search tool to identify relevant files, a fetch tool to retrieve them, an analysis tool to parse them, and a write tool to record findings, all in a single session, against live state, with no pre-encoded assumptions about what it would find.

The token argument is also somewhat moot as context windows expand. What doesn't become moot is the fundamental question of whether the information the agent needs exists anywhere at authoring time. If it doesn't, no skill will supply it. None of this is a case against skills. It is a case for accurate categorization.

Skills are the right tool when:

The task shape is well-understood and stable.
The relevant domain knowledge is human-articulable and unlikely to change faster than the skill can be maintained.
The goal is behavioral consistency, formatting conventions, reasoning patterns, output schemas, rather than information retrieval.
You want to encode institutional or domain knowledge that a general-purpose model would otherwise lack.

In these cases, a skill is not just efficient; it is the correct abstraction. Asking an MCP server to answer "what's the idiomatic way to write a RooFit PDF in this codebase" is misusing the tool. That's a skills job.

The productive framing is not "skills vs MCP" but "skills and MCP, applied to their respective domains." A mature agent architecture typically looks something like this:

Skills encode stable domain knowledge and behavioral constraints. How to format output. Which patterns to prefer. What conventions to follow. How to reason about a class of problem.
MCP tools build dynamic context. What exists in the repository right now. What the current state of an external system is. What data the user's files contain. What search results are relevant to this specific query.

These two mechanisms compose naturally. A skill might instruct the agent on how to interpret ROOT file structures; an MCP tool provides the actual file to interpret. The skill is the interpreter; the tool is the input.

Treating them as substitutes is not just technically wrong, it leads to bad architectural decisions. Teams that go all-in on skills end up with brittle, high-maintenance instruction sets that can't adapt to live data. Teams that go all-in on MCP without any behavioral guidance end up with agents that know what data to fetch but don't know what to do with it.

The "just use a skill" argument fails not because skills are bad but because it misunderstands what skills are for. Skills encode what you already know. MCP tools discover what you don't. Most non-trivial agentic tasks require both.

The token savings framing is particularly worth resisting. It frames the problem as one of compression when the real challenge is one of knowledge availability. You cannot compress information that doesn't exist yet at system design time. And in production agentic systems, a significant fraction of the most important context, live state, external data, dynamic artifacts is exactly that kind of information.

Build your skills carefully, for the domains where they belong. Build your MCP servers for the rest. Stop asking which one you need. Start asking which problem each one solves.

Register Spilling Analysis: How NVCC Manages the GPU Register File

2026-03-03T00:00:00Z

If you are working with CUDA and GPU programming, you have probably heard the term "register spilling" at some point. Register spilling is a phenomenon that occurs when a CUDA kernel uses more registers than are available on the GPU, causing some of the register data to be spilled to local memory. This can lead to significant performance degradation, as accessing local memory is much slower than accessing registers.

But to understand register spilling, we must first understand what registers are in the context of a GPU, and why they occupy such a privileged position in the memory hierarchy.

A GPU Streaming Multiprocessor (SM) contains a register file which is a large, flat bank of 32-bit registers shared among all threads concurrently resident on that SM. On modern NVIDIA architectures (Ampere, Hopper), each SM provides 65,536 32-bit registers. These are the fastest storage available to a thread: access latency is effectively zero cycles (operands are read in the same cycle the instruction is issued), and bandwidth is enormous, on the order of tens of terabytes per second aggregate across the chip.

Every thread executing on the SM is allocated a contiguous slice of this register file at launch time. The key constraint is this: the register file is statically partitioned among all resident warps. If each thread in a kernel uses 32 registers, and each warp has 32 threads, then each warp consumes 32 × 32 = 1024 registers. An SM with 65,536 registers can therefore host at most 64 warps simultaneously. If each thread uses 64 registers, that drops to 32 warps, halving occupancy.

This creates the fundamental tension that makes register spilling interesting: the compiler must balance per-thread register usage (which determines computational throughput for each thread) against occupancy (which determines the SM's ability to hide memory latency through warp-level parallelism).

To iterate, register spilling occurs when a kernel's live variable set exceeds the number of physical registers the compiler has allocated for each thread. When this happens, the compiler must evict some register values to a slower level of the memory hierarchy, specifically, to local memory, which despite its name resides in the same off-chip DRAM (or L2 cache) as global memory.

Concretely, a "spill" manifests as a pair of instructions:

Spill store (STL): Write a register value to the thread's local memory stack frame.
Spill load (LDL): Later, read that value back from local memory into a register when it is needed again.

Each of these instructions has a latency of hundreds of cycles (200–800 cycles depending on L1/L2 cache hit rates), compared to the zero-cycle access of a register read. This is why spilling is costly: it transforms what should be a free operand access into a memory transaction that can stall the warp's execution pipeline.

Each CUDA thread has a private local memory region. NVCC uses this region to store:

Spilled register values.
Large arrays declared within a kernel that cannot be kept in registers.
Compiler-generated temporaries for complex expressions.

The local memory address for thread t in block b is computed as an offset from a per-thread basis address. The hardware coalesces local memory accesses across threads in a warp, thread 0 accesses address base + offset, thread 1 accesses base + offset + stride, and so on, so that a warp's spill loads/stores hit contiguous cache lines. This is important: it means spills are at least coalesced, but they still pay the latency penalty of an L1/L2 access.

NVCC's register allocation is a graph-coloring problem operating on the intermediate representation (IR) after the PTX (Parallel Thread Execution) virtual ISA has been lowered to SASS (the actual machine ISA). The process unfolds in several phases:

Phase 1: Liveness Analysis

The compiler performs a classic dataflow analysis to determine, at each program point, which virtual registers are live, meaning their values will be used by some future instruction before being overwritten.

Consider this simplified kernel:

__global__ void example(float *A, float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float a = A[idx];       // v1 = load
        float b = B[idx];       // v2 = load
        float c = a * b;        // v3 = v1 * v2
        float d = sinf(c);      // v4 = sin(v3)
        float e = a + d;        // v5 = v1 + v4    <- v1 is still live here!
        float f = b * e;        // v6 = v2 * v5    <- v2 is still live here!
        C[idx] = f;             // store v6
    }
}

The live ranges are:

Instruction	Live-in set
v1 = load	`{idx, A, B, C, N}`
v2 = load	`{idx, v1, B, C, N}`
v3 = v1*v2	`{v1, v2, C, idx}`
v4 = sin(v3)	`{v1, v2, v3, C, idx}`
v5 = v1+v4	`{v1, v2, v4, C, idx}`
v6 = v2*v5	`{v2, v5, C, idx}`
store v6	`{v6, C, idx}`

The maximum register pressure occurs at instruction v4 = sin(v3), where five virtual registers (v1, v2, v3, C, idx) are simultaneously live. If the physical register budget is 4, the compiler must spill at least one.

Phase 2: Interference Graph Construction

The compiler builds an interference graph where each node represents a virtual register and an edge connects two nodes if their live ranges overlap. Two virtual registers that are simultaneously live cannot share the same physical register.

For the example above, v1 and v2 interfere (both live from instruction 2 onwards through instruction 5 for v1 and instruction 6 for v2). The chromatic number of this graph tells us the minimum number of physical registers needed.

Phase 3: Graph Coloring with Spilling

NVCC uses a variant of the Chaitin-Briggs graph coloring algorithm, adapted for the GPU's architectural constraints. The algorithm proceeds:

Simplify: Iteratively remove nodes with degree less than k (the number of available physical registers) from the graph, pushing them onto a stack.
Potential spill: If no node has degree < k, select a node to be a potential spill candidate based on heuristics (discussed below), remove it, and mark it.
Select: Pop nodes from the stack and assign colors (physical registers). If a potential spill node cannot be colored, it becomes an actual spill, its value is stored to local memory.
Rewrite: Insert STL and LDL instructions for each actual spill and re-run allocation if needed.

And there is are some heuristics that NVCC uses to select spill candidates when the graph is too dense. This happens when the allocator must choose which virtual register to spill, the decision is critical. NVCC employs several heuristics:

Cost-based spilling: The compiler estimates the "spill cost" of each candidate as a function of:
- Frequency of use: A register used inside a loop body has high spill cost because every iteration would incur a spill load.
- Definition-use distance: A value defined far from its use is a better spill candidate than one used immediately after definition.
- Rematerialization potential: If the value can be cheaply recomputed (e.g., it is a constant, an address calculation, or a simple arithmetic expression of other live values), spilling it is effectively free, the compiler can rematerialize it instead of loading from local memory.
Loop-aware analysis: NVCC gives significant weight to loop nesting depth. A variable live across a loop body but only used outside the loop is a prime spill candidate, it can be spilled once before the loop and reloaded once after, rather than incurring per-iteration cost.

Why NVCC Spills: Architectural Motivations

NVCC's register allocation strategy is driven by several GPU-specific considerations that distinguish it from CPU register allocation:

1. The Occupancy Cliff

The register file is a hard-partitioned resource. The relationship between per-thread register count and maximum warps per SM is a step function that can be visualized as follows (for Ampere architecture with 65,536 registers):

Registers/thread    Max warps (Ampere SM, 65536 regs)
≤ 32                64
≤ 40                48  (<- occupancy drops by 25%)
≤ 48                40
≤ 64                32  (<- occupancy halved)
≤ 80                24
≤ 96                20
≤ 128               16  (<- occupancy quartered)
≤ 255               8   (<- absolute minimum)

Notice the non-linearity: going from 32 to 33 registers per thread drops maximum warps from 64 to 48, a 25% occupancy reduction from a single additional register. NVCC is aware of these thresholds and may deliberately spill a few variables to keep register count at or below a step boundary.

2. Launch Bounds and Explicit Hints

CUDA provides the __launch_bounds__ qualifier to give the compiler information about the intended block size:

__global__ void __launch_bounds__(256, 4) 
my_kernel(float *data) {
    // ...
}

Here, 256 is the maximum threads per block and 4 is the minimum blocks per SM. From minBlocks = 4 and threadsPerBlock = 256, the compiler computes that at least 4 × (256/32) = 32 warps must be resident simultaneously, requiring at most 65536 / (32 × 32) = 64 registers per thread. NVCC will then aggressively spill to enforce this limit, even if the natural register usage would be higher.

Without __launch_bounds__, NVCC uses a default heuristic (typically targeting ~32 registers per thread on recent architectures) and makes less aggressive spilling decisions.

3. The `maxrregcount` Flag

The compiler flag --maxrregcount=N globally caps register usage per thread at N. When a kernel's natural register demand exceeds N, NVCC must spill the difference. This is a blunt instrument, it applies uniformly and can cause excessive spilling in register-hungry kernels, but it is commonly used to tune occupancy across an entire compilation unit.

4. Predication and Divergence Pressure

GPU kernels frequently contain conditional code where both branches must be considered for liveness, because threads in a warp may diverge. Consider:

if (condition) {
    float x = expensive_computation_1();
    use(x);
} else {
    float y = expensive_computation_2();
    use(y);
}

On a CPU, only one branch's registers are live at a time. On a GPU, predicated execution or warp-level divergence means the compiler may conservatively assume that variables from both branches are simultaneously live, inflating register pressure and causing spills that would not occur in scalar compilation.

Modern NVCC versions perform predication-aware liveness analysis that is more precise about this, but deeply nested divergent control flow still tends to inflate register pressure.

Analyzing Spills: Practical Techniques

Now that we understand why spills happen, how can we analyze them in practice? NVCC and NVIDIA's profiling tools provide several ways to observe and quantify spilling. These techniques are essential for diagnosing performance issues and guiding optimization efforts. The only caveat is that the tools and metrics can be overwhelming, so I will focus on the most informative ones for spill analysis and will not cover the full breadth of Nsight Compute's capabilities or even try to explain the various occupancy and warp-level metrics that are also important for performance tuning.

Use Compiler flag: `--ptxas-options=-v`

The most direct way to observe spills is the verbose output from ptxas, the PTX assembler:

nvcc --ptxas-options=-v -o kernel kernel.cu

This produces output like:

ptxas info    : Compiling entry function '_Z9my_kernelPfS_S_i'
ptxas info    : Function properties for _Z9my_kernelPfS_S_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 28 registers, 360 bytes cmem[0]

When spilling occurs, you see nonzero values:

ptxas info    : Function properties for _Z15heavy_kernelPfS_S_i
    128 bytes stack frame, 96 bytes spill stores, 88 bytes spill loads
ptxas info    : Used 64 registers, 380 bytes cmem[0]

The asymmetry between spill stores (96 bytes) and spill loads (88 bytes) is normal, some spilled values may be dead along certain paths or rematerialized instead of reloaded.

SASS Inspection with `cuobjdump`

To see the actual spill instructions, disassemble the binary:

cuobjdump -sass kernel.o | grep -E 'STL|LDL'

STL (Store to Local) and LDL (Load from Local) are the SASS instructions corresponding to spill stores and loads. You can count their frequency, observe their placement relative to loop structures, and infer which variables were spilled.

Nsight Compute Profiling

NVIDIA Nsight Compute provides detailed metrics for spill analysis, this is the most powerful tool that you have in your arsenal. Key metrics to look at include:

l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_read: This counts local memory read transactions (spill loads).
l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_write: This counts local memory write transactions (spill stores).
smsp__sass_inst_executed_op_local_ld and smsp__sass_inst_executed_op_local_st: Provide the direct counts of local load/store instructions executed.

A high ratio of local memory traffic to global memory traffic is a strong indicator that spills are the performance bottleneck. But there are many nuances: if the spilled values are reused frequently and hit in L1 cache, the performance impact may be less severe than if they cause L1 misses. And you might miss the fact that some spills are rematerialized, so the local memory traffic metrics may undercount the true spill cost.

Nsight Compute Source Correlation

Using nvcc -lineinfo, Nsight Compute can correlate SASS instructions back to source lines. This allows you to identify which source-level variables are being spilled, critical for targeted optimization.

A Detailed Example: Spill Pathology and Resolution

Let's look into a realistic kernel with high register pressure and see how we can analyze and optimize it. Consider that we have a kernel performing a stencil computation with multiple intermediate buffers, something like this:


__global__ void stencil_3d(
    const float *__restrict__ input,
    float *__restrict__ output,
    int Nx, int Ny, int Nz)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    int k = blockIdx.z * blockDim.z + threadIdx.z;
    
    if (i >= 1 && i < Nx-1 && j >= 1 && j < Ny-1 && k >= 1 && k < Nz-1) {
        int idx = i + j*Nx + k*Nx*Ny;
        
        // Load 7-point stencil
        float center = input[idx];
        float xm = input[idx - 1];
        float xp = input[idx + 1];
        float ym = input[idx - Nx];
        float yp = input[idx + Nx];
        float zm = input[idx - Nx*Ny];
        float zp = input[idx + Nx*Ny];
        
        // Compute second derivatives
        float d2x = xp - 2.0f*center + xm;
        float d2y = yp - 2.0f*center + ym;
        float d2z = zp - 2.0f*center + zm;
        
        // Nonlinear diffusion coefficient
        float grad_sq = (xp-xm)*(xp-xm) + (yp-ym)*(yp-ym) + (zp-zm)*(zp-zm);
        float kappa = 1.0f / (1.0f + grad_sq);
        
        // Cross-derivative terms (13-point stencil extension)
        float xy_pp = input[idx + 1 + Nx];
        float xy_pm = input[idx + 1 - Nx];
        float xy_mp = input[idx - 1 + Nx];
        float xy_mm = input[idx - 1 - Nx];
        float d2xy = 0.25f * (xy_pp - xy_pm - xy_mp + xy_mm);
        
        float xz_pp = input[idx + 1 + Nx*Ny];
        float xz_pm = input[idx + 1 - Nx*Ny];
        float xz_mp = input[idx - 1 + Nx*Ny];
        float xz_mm = input[idx - 1 - Nx*Ny];
        float d2xz = 0.25f * (xz_pp - xz_pm - xz_mp + xz_mm);
        
        float yz_pp = input[idx + Nx + Nx*Ny];
        float yz_pm = input[idx + Nx - Nx*Ny];
        float yz_mp = input[idx - Nx + Nx*Ny];
        float yz_mm = input[idx - Nx - Nx*Ny];
        float d2yz = 0.25f * (yz_pp - yz_pm - yz_mp + yz_mm);
        
        output[idx] = center + kappa * (d2x + d2y + d2z + d2xy + d2xz + d2yz);
    }
}

This kernel has enormous register pressure. At the point where d2yz is being computed, the live set includes: idx, Nx, Ny, center, d2x, d2y, d2z, kappa, d2xy, d2xz, plus the four yz_* temporaries, plus the output pointer, plus several address-computation intermediaries. Compiling with -v:

ptxas info    : Used 42 registers, 48 bytes spill stores, 40 bytes spill loads

42 registers puts us in the "max 48 warps" occupancy bucket. The spills push some pressure to local memory. That's a problem because this kernel is likely memory-bound, and the spill-induced local memory traffic will further reduce effective bandwidth. But how do we fix it? We can think of several strategies:

Strategy 1: Reduce Live Range Overlap

This is the most obvious and often the most effective strategy. Restructure the computation to minimize the number of simultaneously live. By restructuring the computation to minimize the number of simultaneously live intermediate values, we can reduce register pressure without changing the algorithm intermediate values. In our case, it would be as the following:

// Compute and accumulate terms incrementally
float laplacian = 0.0f;

// X-derivative block
{
    float xm = input[idx - 1];
    float xp = input[idx + 1];
    laplacian += xp - 2.0f*center + xm;
    // xm and xp are dead after this scope
}

// Y-derivative block
{
    float ym = input[idx - Nx];
    float yp = input[idx + Nx];
    laplacian += yp - 2.0f*center + ym;
}

And so on for each term. By scoping intermediate values tightly, we reduce the maximum live set at any program point. The compiler can reuse the physical registers that held xm and xp for ym and yp.

Strategy 2: Recompute Instead of Store

If kappa depends on gradient values and those gradient values are also needed for cross-terms, it may be cheaper to recompute the gradient components rather than keeping them live across many instructions. This is the rematerialization strategy, trading ALU cycles (which are cheap on a GPU) for register pressure reduction.

Strategy 3: `__launch_bounds__` Tuning

If the kernel is latency-bound rather than throughput-bound, you might accept lower occupancy in exchange for zero spills:

__global__ void __launch_bounds__(128, 2)
stencil_3d(const float *__restrict__ input, ...) {
    // With minBlocks=2 and 128 threads, the compiler has more
    // registers per thread to work with, potentially eliminating spills
}

This is a deliberate architectural tradeoff: fewer concurrent warps, but each warp runs at full speed with no spill-induced stalls.

NVCC's PTX-to-SASS Pipeline and Spill Decisions

There is a common misconception that register spilling is a direct consequence of the PTX code generated by NVCC. In reality, the spilling decision does not happen at the PTX level. PTX uses an unlimited virtual register set, a kernel's PTX may reference hundreds of virtual registers (%f0, %f1, ..., %f127, ...) without concern for physical limits. The register allocation and spilling decisions are made later, during the PTX-to-SASS compilation phase performed by ptxas. This means that the PTX code you see is not a reliable indicator of whether spills will occur or how many registers will be used in the final SASS. The reasons are:

PTX optimizations (CSE, dead code elimination, constant propagation) may reduce or inflate the virtual register count before ptxas sees it.
ptxas performs its own optimizations: instruction scheduling, register coalescing, and live-range splitting that can substantially change the spilling outcome relative to a naive analysis of the PTX.
SASS-level instruction scheduling is interleaved with register allocation, ptxas may reorder instructions to reduce live-range overlaps, but some reordering may increase register pressure if they bring two previously non-overlapping live ranges into conflict.

This is why analyzing spills from PTX alone is insufficient, the PTX register count bears little relation to the SASS register count. Always inspect the ptxas verbose output or the SASS disassembly.

The Register Pressure vs. Occupancy Tradeoff: A Quantitative View

People will always ask: "How many registers per thread should I use?" The answer is: it depends. And there is a relation with our beloved misleading metric of occupancy. The relationship between register pressure, occupancy, and performance is non-monotonic and workload-dependent. Consider a kernel with arithmetic intensity $q$ (FLOPs per byte of memory traffic):

Memory-bound kernels ( $q < q_{ridge}$ ): Performance scales with occupancy because the SM needs many warps in flight to saturate memory bandwidth. Spilling a few registers to increase occupancy from 50% to 75% can yield a net speedup, even though each individual thread is slower.
Compute-bound kernels ( $q > q_{ridge}$ ): Performance scales with per-thread throughput. Additional warps provide diminishing returns because the SM's compute pipelines are already saturated. Here, spilling hurts, each spill load occupies a memory pipeline slot that could be used for useful data, and the stall cycles directly reduce throughput.
Latency-bound kernels (insufficient parallelism to hide any latency): Occupancy is critical, and moderate spilling is acceptable as long as the spill traffic hits L1 cache.

The roofline model provides a framework for this analysis. At the ridge point where compute and memory ceilings intersect, the optimal register allocation strategy changes qualitatively.

Spills and the L1 Cache

A critical architectural detail: spilled values go to local memory addresses, but these addresses are cached in the L1 data cache (unified with shared memory on Volta+ architectures). If a warp spills a value and reloads it shortly after, the reload will likely hit L1 with a latency of ~30 cycles rather than the ~200+ cycles of an L2 or DRAM access.

This means that not all spills are equally expensive. So a spill-reload pair within a tight loop, where the reloaded value stays hot in L1, costs ~30 cycles per access. Painful, but manageable. But a spill at the top of a long computation with a reload at the bottom where intervening memory traffic has evicted the spilled value from L1, costs 200–800 cycles. Devastating.

NVCC's spill heuristics attempt to account for this by preferring to spill values with short spill-reload distances (likely L1 hits) over values with long distances (likely L1 misses). And there are cases where the compiler may choose to spill a value that is only used once after a long computation, accepting the high latency because the alternative (keeping it live in a register) would cause even worse performance due to occupancy reduction. Another problem is the double precision values that require two registers, which can easily push a kernel over the register limit and cause spills. Double-precision (double, long long) values require two consecutive 32-bit registers (a "register pair"). This means a kernel using double arithmetic faces roughly twice the register pressure of an equivalent float kernel. On architectures where double-precision throughput is already reduced (consumer GPUs: 1/32 of FP32 rate on Ampere), the additional register pressure from spilling compounds the performance penalty. The compiler must also respect alignment constraints for register pairs, further restricting allocation flexibility and increasing the likelihood of spills.

The full sequence, from source to spilled SASS, is:

C++ Frontend (cudafe++): Parses CUDA, separates host/device code.
Device IR Optimization: Inlining, loop unrolling, constant propagation, all of which can dramatically change register pressure.
PTX Generation (cicc): Produces PTX with virtual (unlimited) registers.
PTX Optimization (ptxas frontend): CSE, dead code elimination, peephole optimizations on PTX.
Liveness Analysis (ptxas): Computes live ranges for all virtual registers.
Interference Graph Construction: Builds the conflict graph.
Graph Coloring with Spilling: Chaitin-Briggs variant allocates physical registers, introduces spills.
Spill Code Insertion: STL/LDL instructions are inserted.
Post-Allocation Scheduling: Instructions (including spill code) are scheduled to hide latencies.
SASS Emission: Final machine code with concrete register assignments and spill instructions.

Understanding this pipeline, and knowing where in it to intervene (source restructuring at step 2, __launch_bounds__ at step 7, --maxrregcount at step 7, manual PTX at step 3) is the key to effective register spill analysis and optimization on NVIDIA GPUs.

As a final remark, I want to emphasize that register spilling is not inherently bad. It is a compiler-managed tradeoff between per-thread performance and SM-level parallelism. The goal is not to eliminate all spills, but to ensure that the spilling pattern aligns with the kernel's computational characteristics. A memory-bound kernel can afford, and may even benefit from, moderate spilling to increase occupancy. A compute-bound kernel with complex register-heavy arithmetic should be tuned to minimize spills, even at the cost of reduced occupancy. The tools exist to measure both: ptxas -v, cuobjdump -sass, and Nsight Compute metrics. The optimization loop is: measure register count and spill volume, profile actual performance, adjust source structure or compiler hints, and measure again.

Particle Data Group website crashes VSCode

2026-03-01T00:00:00Z

I was working on a physics analysis and coding inside VSCode, and sometimes I use copilot beyond code completion. The agent mode can be useful with the tool calling like web search, so I ask copilot agent some questions and explicitly ask it to search the web and return the answer (and source). I understand that I probably can just search using the browser myself but the temptation of not leaving the same window is too big to resist.

So I was in the middle of creating a usual fit template using RooFit where I needed to fix the fit parameters to the PDG values for some particles. So instead of changing to the browser and search for the PDG values, I just asked copilot agent to do it for me. I asked it to search for the PDG values for the $J ⁄ ψ$ and $η_{c} (1 S)$ mass and natural width.

So far, I noticed that it did something and prompted Fetch web pages dialog where it asked me to Allow and Review or skip. I clicked Allow and Review, but then VSCode window crashed immediately. At first, I thought it was just a coincidence, but I tried again and it happened again. I found it funny and wonder if this was this particular page that it tried to visit that caused the crash. So I tested the hypothesis by asking copilot agent different questions that explicitly or implicitly require it to visit the PDG website, and I found that it always crashes when it tries to visit the PDG website. I also tried to ask copilot agent to search for other things that are not related to PDG, and it works fine without any crash. So I am pretty sure that it is something about the PDG website that causes the crash.

An example of the question I asked is "Can you search in the PDG website what is the mass if J/psi particle ?" Something simple like shown in the following screenshot:

Can you guess what will happen if I clicked "Allow and Review" ? Yes, you are right, it will crash VSCode immediately. I have no idea why this happens, but I guess it might be something about the way copilot agent tries to fetch the web page or maybe something about the PDG website itself. It is quite funny, and I hope it can be fixed in the future so that I can use copilot agent to search for PDG values without crashing my VSCode. I have filed a bug report to the VSCode team and I hope they can investigate and fix this issue soon.

But I cannot resist the temptation to make a meme out of this and post it on CERN IT memes. The only place I like on the whole CERN Mattermost and the reason I open it every day. Also, the conspiracy theorist inside me cannot help but wonder if this is some kind of intentional sabotage by the PDG website to prevent people from using copilot agent to access their data. Maybe they don't want people to easily access their data through copilot agent and want them to go through the traditional way of searching on the browser. Who knows, maybe there is some hidden agenda behind this crash. Or maybe someone on Copilot team trying to take revenge on the field for not giving them academic position in the past. Just kidding, but it is quite funny to think about the possibilities if I were a conspiracy theorist (or a theorist in general).

Note: My non particle physics friends and readers might wonder what is exactly the PDG website and why it is important. The PDG website is the Particle Data Group website, which is a comprehensive database of particle physics data and information. It contains information about the properties of particles, such as their masses, lifetimes, decay modes, and so on. It is an essential resource for particle physicists and researchers in the field, as it provides a standardized reference for particle properties that are used in various analyses and calculations.

Mohamed Elashri

The Security Model of Not Being Single

My new blog stack: Nida

Fixing TorchAO when Downgrading PyTorch for AITune

JSON formatting in browser is useful

make: the automation layer I put on everything

uv is a miracle for scientific computing

My Sad thoughts appears to be a Greyware

Mattermost is my new comment system

SSH: LocalCommand

SSH: ControlMaster

SSH: AddKeysToAgent

SSH: Escape Sequences

SSH: Match sessiontype

CUDA Memory Safety Problem

SSH: Match version

SSH: Channel Timeout

SSH: Obscure Keystroke Timing

You Can't Prompt Your Way to Live Data

The 90% Token Savings Claim Is Misleading

MCP Tools Are a Context-Building Mechanism, Not a Behavior-Encoding Mechanism

The Real Case for MCP Is Not Token Efficiency

Register Spilling Analysis: How NVCC Manages the GPU Register File

Phase 1: Liveness Analysis

Phase 2: Interference Graph Construction

Phase 3: Graph Coloring with Spilling

Why NVCC Spills: Architectural Motivations

1. The Occupancy Cliff

2. Launch Bounds and Explicit Hints

3. The maxrregcount Flag

4. Predication and Divergence Pressure

Analyzing Spills: Practical Techniques

Use Compiler flag: --ptxas-options=-v

SASS Inspection with cuobjdump

Nsight Compute Profiling

Nsight Compute Source Correlation

A Detailed Example: Spill Pathology and Resolution

Strategy 1: Reduce Live Range Overlap

Strategy 2: Recompute Instead of Store

Strategy 3: __launch_bounds__ Tuning

NVCC's PTX-to-SASS Pipeline and Spill Decisions

The Register Pressure vs. Occupancy Tradeoff: A Quantitative View

Spills and the L1 Cache

Particle Data Group website crashes VSCode

3. The `maxrregcount` Flag

Use Compiler flag: `--ptxas-options=-v`

SASS Inspection with `cuobjdump`

Strategy 3: `__launch_bounds__` Tuning