Suraj ghishadow

UNIX famously uses fork+exec to create processes, a simple API that is nevertheless quite tricky to use correctly and that comes with a bunch of problems. The alternative, spawn, as used by VMS, Windows NT and recently POSIX, fixes many of these issues but it overly complex and makes it hard to add new features.

prepare() is a proposed API to simplify process creation. When calling prepare(), the current thread enters “preparation state.” That means, a nascent process is created and the current thread is moved to the context of this process, but without changing memory maps (this is similar to how vfork() works). Inside the nascent process, you can configure the environment as desired and then call prep_execve() to execute a new program. On success, prep_execve() leaves preparation state, moving the current thread back to the parent's process context and returns (!) the pid of the now grownup child. You can also use prep_exit() to abort the child without executing a new process, it similarly returns the pid

These are simplified examples I came up with to go along with William Woodruff's excellent blog post on x86-64 addressing modes.

All the examples are compiled with GCC -O3 or -Os, which is indicative of what the compiler thinks the best option is in practice.

Base + Index

Use case: Indexing into a byte array (whose address is not fixed)

char f(char buf[], int index)

Why did you do this?

Relax, I only have one Sunday to work on idea, literally my weekend project. So I tried Deepseek to see if it can help. Surprisingly, it works and it saves me another weekend...

What is your setup?

Just chat.deepseek.com (cost = free) with prompts adapted from this gist.

Does it work in one-shot or I have to prompt it multiple times?

RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

Note: To verify my RVI membership and idenity on this otherwise semi anonymous account: I'm Olaf Bernstein, you should be able to view my sig-vector profile, if you are a member of the vector SIG.

The goal of this document is to explore gaps in the current RISC-V Vector extensions (standard V, Zvbb, Zvbc, Zvkg, Zvkn, Zvks), and suggest instructions to fill these gaps. My focus lies on application class processors, with the expectation that suggested instructions would be suitable to become mandatory or optional instructions in future profiles.

I'll assume you are already familiar with RVV, if not, here is a great introduction and here the latest RISC-V ISA manual.

10x Optimization! An installer creation adventure

Background: creating an installer tool

Lately I've been working on rInstallFriendly v2.0, my simple and easy-to-use tool to create fancy software installers.

rInstallFriendly v2.0, some of its multiple visual styles available. Style is fully customizable!

FAQ on the xz-utils backdoor (CVE-2024-3094)

This is a living document. Everything in this document is made in good faith of being accurate, but like I just said; we don't yet know everything about what's going on.

Update: I've disabled comments as of 2025-01-26 to avoid everyone having notifications for something a year on if someone wants to suggest a correction. Folks are free to email to suggest corrections still, of course.

Background

Executing code on a GPU vs CPU

I have had working knowledge of GPUs for a very long time now. I remember playing around with NVIDIA RIVA 128 or something similar with DirectX when they were still 3D graphics accelerators. I have also tried to keep up with the times and did some basic shader programming on a contemporary NVIDIA or AMD GPU.

However, today's GPUs are necessary for another reason – the explosion of AI workloads, including large language models (LLMs). From a GPU perspective, AI workloads are just massive applications of tensor operations such as matrix addition and multiplication. However, how does the modern GPU execute them, which is much more efficient than running the workloads on a CPU?

Consider CUDA, NVIDIA's programming language that extends C to exploit data parallelism on GPUs. In CUDA, you write code for CPU (host code) or GPU (device code). CPU code is just mostly plain C, but CUDA extends the language in two ways: it allows you to define functions for GPUs (kernels) and also prov

TL;DR

When Riot Games introduces the Vanguard anti-cheat to League of Legends, you should STOP playing and you must NOT install the anti-cheat when you get the pop-up. Vanguard is a kernel-level anticheat and these anticheats operate at a privilege level HIGHER THAN YOUR OWN. The anti-cheat can do things that even YOU can't do, without asking or letting you know. It's like Riot installing a camera in every room of your house and getting a copy of every key inside.

Here are just a few examples of what they can do:

GPU Driven renderer design for Godot 4.x

Goals:

The main goal is to implement a GPU driven renderer for Godot 4.x. This is a renderer that happens entirely on GPU (no CPU Dispatches during opaque pass).

Additionally, this is a renderer that relies exclusively on raytracing (and a base raster pass aided by raytracing).

It is important to make a note that we dont want to implement a GPU driven renderer similar to that of AAA/Unreal, as example.

	# LLVM 17.0.1 Passes


	## default<O0>

	always-inline,
	coro-cond(
	coro-early,
	cgscc(
	coro-split