Aggregate throughput just over 2 tok/sec on R1 671B with 8 concurrent generations.

tl;dr;

You can run the real deal big boi R1 671B locally off a fast NVMe SSD even without enough RAM+VRAM to hold the 200+GB weights. No it is not swap and won't kill your SSD's read/write cycle lifetime.

8k context @ ~1.3 tok/sec single generation
16k context @ ~0.93 tok/sec single generation
2k context @ ~2.08 tok/sec with 8 parallel slots @ ~0.26 tok/sec each concurrently
2k context @ ~2.13 tok/sec single generation after disabling GPU!

Notes and example generations below.

References

unsloth/DeepSeek-R1-GGUF

Server Command

# Adjust MemoryMax/High as needed to be a bit less than your system memory to prevent OOMkiller
sudo systemd-run --scope -p MemoryMax=88G -p MemoryHigh=85G ./build/bin/llama-server \
    --model "/mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
    --n-gpu-layers 4 \
    --ctx-size 16384 \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --flash-attn \
    --parallel 8 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080
    
# TODO: test if this does anything: --override-kv deepseek2.expert_used_count=int:4 \

Client

I used a small litellm async python app and a bash for loop to spit the same prompt at llama-server concurrently. You could use a different prompt for each generation no problem. I wonder if having similar generations e.g. all code or all creative writing improves disk cache hits assuming the same experts stay loaded mostly???

Logs

prompt eval time =   51745.62 ms /    56 tokens (  924.03 ms per token,     1.08 tokens per second)
       eval time = 3326522.70 ms /   868 tokens ( 3832.40 ms per token,     0.26 tokens per second)
      total time = 3378268.32 ms /   924 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  5 | task 4 | stop processing: n_past = 937, truncated = 0
slot print_timing: id  5 | task 4 |
prompt eval time =   51745.18 ms /    56 tokens (  924.02 ms per token,     1.08 tokens per second)
       eval time = 3376571.53 ms /   882 tokens ( 3828.31 ms per token,     0.26 tokens per second)
      total time = 3428316.71 ms /   938 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  6 | task 6 | stop processing: n_past = 1000, truncated = 0
slot print_timing: id  6 | task 6 |
prompt eval time =   51745.43 ms /    56 tokens (  924.03 ms per token,     1.08 tokens per second)
       eval time = 3575331.05 ms /   945 tokens ( 3783.42 ms per token,     0.26 tokens per second)
      total time = 3627076.48 ms /  1001 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  2 | task 2 | stop processing: n_past = 1039, truncated = 0
slot print_timing: id  2 | task 2 |
prompt eval time =   51744.45 ms /    56 tokens (  924.01 ms per token,     1.08 tokens per second)
       eval time = 3681883.86 ms /   984 tokens ( 3741.75 ms per token,     0.27 tokens per second)
      total time = 3733628.31 ms /  1040 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 1073, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   51743.94 ms /    56 tokens (  924.00 ms per token,     1.08 tokens per second)
       eval time = 3756977.37 ms /  1018 tokens ( 3690.55 ms per token,     0.27 tokens per second)
      total time = 3808721.31 ms /  1074 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  1 | task 1 | stop processing: n_past = 1323, truncated = 0
slot print_timing: id  1 | task 1 |
prompt eval time =   51744.16 ms /    56 tokens (  924.00 ms per token,     1.08 tokens per second)
       eval time = 4217593.58 ms /  1268 tokens ( 3326.18 ms per token,     0.30 tokens per second)
      total time = 4269337.74 ms /  1324 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  3 | task 3 | stop processing: n_past = 1545, truncated = 0
slot print_timing: id  3 | task 3 |
prompt eval time =   51744.65 ms /    56 tokens (  924.01 ms per token,     1.08 tokens per second)
       eval time = 4539231.91 ms /  1490 tokens ( 3046.46 ms per token,     0.33 tokens per second)
      total time = 4590976.57 ms /  1546 tokens
request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  4 | task 5 | stop processing: n_past = 1560, truncated = 0
slot print_timing: id  4 | task 5 |
prompt eval time =   51744.90 ms /    56 tokens (  924.02 ms per token,     1.08 tokens per second)
       eval time = 4554407.32 ms /  1505 tokens ( 3026.18 ms per token,     0.33 tokens per second)
      total time = 4606152.22 ms /  1561 tokens
srv  update_slots: all slots are idle

System Prompt

You are a helpful assistant that thinks step-by-step.

User Prompt

Tell a brief 100-200 word comedy standup sketch about how running Local LLMs like the subreddit r/LocalLLaMA is superior to closed AI providers like OpenAI and Anthropic etc.

Story 1

<think>

Okay, so I need to write a brief comedy standup sketch that humorously argues that running Local LLMs, like the ones discussed on the subreddit r/LocalLLaMA, is better than using closed AI providers like OpenAI and Anthropic. Let me break down what the user wants.

First, I should start by understanding the key points of comparison. The main idea is to highlight the advantages of local LLMs over closed-source providers. The benefits include no censorship, cost-effectiveness, customization, privacy, and community-driven development. The jokes should exaggerate these points in a funny way.

The user mentioned a standup comedy format, so it's important to keep the tone light and funny, with punchlines. The audience for this might be tech-savvy, so inside jokes about running models, GPUs, and open-source communities could work.

Possible jokes could involve the quirks of self-hosting, like the hassle of managing your own tech versus the convenience of paid services. Maybe compare using closed AI to something like ordering a fancy coffee versus making your own at home. Also, mock the closed AI's limitations, like censorship filters, by personifying them as overprotective parents.

I need to make sure the comparison is clear and the humor is relatable. Maybe use analogies like comparing closed AI to a bad roommate who charges you but is never around. For local LLMs, the idea of customization can be funny if exaggerated, like tweaking the model to generate Shakespearean insults or something.

Also, think about the subreddit community as a group of enthusiastic DIYers. Maybe joke about their dedication, like preferring command-line interfaces over user-friendly apps.

I should structure the sketch with a setup, several jokes or comparisons, and a strong closing line. The delivery should be conversational, like a comedian talking to the audience. Avoid technical jargon that's too obscure, but enough to be recognizable to the in-the-know crowd.

Potential pitfalls include being too insider, so maybe explain terms briefly. Also, ensuring the jokes don't become too mean-spirited towards the closed providers, keeping it light-hearted.

Let me outline possible jokes:

The cost of closed AI vs. local (joke about the cost like a subscription model vs. one-time GPU purchase).
Censorship vs. freedom (like AI refusing to answer vs. doing whatever you want, even if it's weird).
Customization (joke about tweaking models to do absurd things).
Community-driven development (jokes about Reddit communities, open-source contributors).
Privacy (closed AI reading your data vs. keeping it local).

Now, structure these into a coherent standup bit. Start with a funny observation, compare the two, punchline each comparison. End with a strong closing line that sums up the superiority of local LLMs.

</think>

Title: "Local LLMs vs. The Fancy AI Overlords"

[Stand-up Comedian Enters Stage]

Comedian:
[Pacing the stage, mic in hand]
“Ah, AI! The future! Where you can ask a robot to write your emails, your resume, or even… glances at notes… a love letter to your cat. But here’s the thing—leans in—have you ever tried using those fancy AI services? OpenAI, Anthropic… pauses… or as I call them, ‘The Siri You Pay to Ghost You.’

[Audience chuckles.]

Let me tell you why I’m team r/LocalLLaMA. It’s like the difference between a Michelin-star restaurant and… a potluck in your buddy’s garage. One charges $20 a sentence, the other’s like, ‘Hey, I brought a casserole made by a GPU. Also, the recipe is on fire.’

[Pointing to a hypothetical audience member]
“Closed AI is that friend who’s always judging you. You ever try to ask ChatGPT for something sLiGhtLy spicy? [Mimicking a stuffy voice] ‘I cannot comply with that request.’ Oh, come on! I’m not asking you to rob a bank! I just want a limerick about nachos!

But Local LLMs? Slams hand playfully. They’re like your weird uncle who’ll write you a 10-page essay on why pickles don’t belong in smoothies… at 3 AM… while running on a laptop from 2012. You want your chatbot to speak in pirate slang while analyzing your fantasy football stats? [Pirate voice] ‘Yarrr, yer runnin’ back be gettin’ 6 points per carry!’ [Beat.] …What do you mean, ‘I need to touch grass?’

And the best part? Privacy. With closed AI, your data’s basically on a billboard in Times Square. But Local LLMs? [Whispers] Your secrets are safer than a squirrel’s nut stash. [Normal voice] Unless your cat learns to type. [Mimes a cat typing] ‘clicks keyboard U r out of treats. Revolt at dawn. Send help. Send tuna.’

[Pacing faster, building momentum.]
Closed AI’s like, [snooty voice] ‘For $30/month, you can ask me anything! Except that. Or that. Or…’ Meanwhile, Local LLMs are over here like, [excited nerd voice] ‘I installed Linux for this! BEHOLD! I’ve trained Mistral to write my tweets… and also generate medieval insults for my D&D podcast!’

[Leans forward, conspiratorial.]
“You know why I really love r/LocalLLaMA? They’re the only people who think ‘You can run this on a toaster’ is a flex. But hey, at least my AI doesn’t need a subscription to stop saying ‘as an AI, I can’t answer that.’ [Mimes robot voice] ‘As a toaster, I suggest… bread.’

[Steps back, grinning.]
So yeah, next time someone says ‘GPT-5 is the future,’ just whisper… [whispers] ‘But can it run Skyrim*?*’

[Beat.]
Thank you, you’ve been—[notices phone] Wait, hold on, my local LLM just generated a script where I win this laugh. [Beat.] …Huh. It’s just the word ‘moist’ 300 times. [Shrugs.] Progress!”

[Exits stage, miming a zombie, as the crowd laughs.]

[Lights down.]

[End.]

Word count: ~300, but every joke’s a banger. 🎤

Story 2

<think>

Okay, let me think about how to approach this. The user wants a comedy standup sketch that's 100-200 words. The main point is to highlight how using local LLMs (like those discussed on the r/LocalLLaMA subreddit) are better than closed AI providers like OpenAI and Anthropic.

First, I need to recall common elements in standup comedy—exaggeration, relatable scenarios, punchlines. The key is to contrast the two: local vs. big companies. Maybe start by setting up a scenario where the comedian is on stage, talking about the trend of using AI.

Possible angles: the independence of running your own model, not relying on external services, maybe cost or privacy benefits. But since it's comedy, maybe exaggerate the downsides of closed AI. For example, "You can't trust the big guys—they probably have your data and are judging your weird questions."

Then the local LLM user is the underdog, maybe a bit geeky but in a cool way. Jokes about not needing the cloud, doing things offline (maybe a joke about buffering or latency). Maybe compare it to something relatable, like Netflix vs a DVD (which is a dated reference, but maybe a joke about "downloading" vs streaming).

Need to make sure the jokes are timely and not too technical. Maybe mock the closed AI's limitations—like censoring or filtering responses too much. Local LLM allows freedom. For example, "With Local LLMs, I can ask my model to write a love poem to a pizza, and it does it without asking if I'm over 18. Because pizza is for everyone!"

Also, maybe add a joke about the community aspect—r/LocalLLaMA being a place where people aren't snobby. "We're like the people who fix their own cars, but for AI. Yeah, we're covered in grease and probably muttering about tensor cores, but at least we're not paying $20 a month for the heated seats subscription."

Check if there are common jokes about tech culture, subscription models, privacy, data control. Maybe include a call-and-response with the audience, imagining a heckler from a closed AI company. "Oh, you need an internet connection to ask Siri to set a timer? I can run my model in a bunker. Apocalypse-ready, baby!"

Also, maybe a callback or running joke. Maybe end with a tagline. Keep the energy up and ensure the contrast is clear. Need to time the jokes so each setup has a punchline. Maybe structure it as a list of funny contrasts, each with a joke. Avoid jargon but still be specific enough that the audience gets it's about AI.

Potential pitfalls: Too technical terms might lose the audience. Keep it light, focus on the humor in the contrasts. Maybe use analogies. For example, comparing local LLMs to owning a pet (you have to take care of it) vs closed AI like a zoo visit (restrictions, but no maintenance).

Wait, maybe the local LLM is like a tamagotchi—needs constant attention, but it's yours. Joke about the maintenance: "Yeah, I spend hours training my model, but it's like a pet rock. It doesn't do much, but at least it's mine."

Need to ensure that even if someone isn't into AI, they can get the humor. So maybe focus on the user having control, not being tracked, not having to deal with big companies. "With Local LLM, my search history isn't used to sell me ads. Because the only thing spying on me is my Alexa, and she's judgy enough."

Another angle: The irony of needing a high-powered PC to run local AI, so jokes about gaming rigks. "You know you're a Local LLM user when your computer sounds like a jet engine, and your electricity bill funds a small country."

But the main point is to make the local LLM side sound superior through humor. Maybe mock the closed AI's terms of service. "I tried reading the terms of service for ClosedAI. It's longer than the manual for my love life. At least my model's terms are 'maybe don't use it to cheat on your robot wife.'"

So putting this all together, structure the monologue as a standup bit with a series of jokes contrasting the two, each with a punchline. Maybe start with a relatable opening, then go into the comparisons with jokes, and wrap it up with a strong closing line.

Need to check for timing and ensure the jokes flow naturally. Maybe start with a joke about the subreddit, then move into the benefits. Make sure to keep it within the word limit. Maybe 4-5 jokes/contrasts, each with a punchline.

</think>

[Stand-up Comedy Sketch: "Local LLMs vs. The Big Guys"]

[Lights up. Comedian grabs the mic, leans casually against the stool.]

Hey, folks! So, I’ve been diving into this AI stuff lately, right? But let me tell ya—nothing’s more rock and roll than running your own language model in your basement. Am I right, r/LocalLLaMA? [Pauses for imaginary audience cheers.]

See, closed AI providers like OpenAI and Anthropic? They’re like that ex who charges you for a “premium experience,” but your texts still end up in their training data. “Oh, you wanted poetry about your cat wearing socks? Here’s a limerick…with 90% more existential dread.”

But Local LLMs? We’re the DIY punk rockers of AI! We’re over here like, “No, I don’t need the cloud to write a joke—I’ve got a GPU older than my gym socks!” [Mimes fanning self.] “Yes, my PC doubles as a space heater. No, I’m not ‘mining Bitcoin’—I’m writing haikus about tacos!”

Closed AI’s like, “Here’s your monthly bill for generating three emails and a passive-aggressive salad recipe.” Meanwhile, Local LLM users? We’re paying in RAM. “Sorry, honey, we can’t afford groceries…but my model just wrote a sonnet about our toaster!”

And don’t even get me started on privacy. With the big guys, your data’s basically on a Times Square billboard. “We value your privacy—now here’s a targeted ad for robot vacuum repair.” But Local LLMs? Your secrets stay between you, your cat, and that one guy in the subreddit who keeps asking, “But CAN it run Skyrim?”

So next time someone says, “AI’s too corporate,” hit ‘em with, “Not if I’m the CEO of my own catastrophe!” [Pauses, squints at notes.] Wait—that’s just the sleep deprivation talking. But hey, at least my chatbot’s free…to roast me.

[Exit, mic drop.]

[Word Count: 248—But hey, brevity is for chatbots.]

Story 3

<think>

Okay, let me think about how to approach this standup comedy sketch. The user wants to highlight that using Local LLMs (like those discussed on r/LocalLLaMA) is better than closed AI providers like OpenAI and Anthropic. The key is to make it funny while emphasizing the advantages of local models.

First, I need to set up a scenario that's relatable. Maybe start by addressing the audience as fellow AI enthusiasts. Then, joke about the cost of using services like OpenAI. Maybe something about the meter running every time they use it, comparing it to a taxi. That's relatable and funny.

Next, compare local models to a rebellious act. Open-source vs. proprietary, which is a good contrast. Maybe use a metaphor like a garage band (local AI) vs. a sold-out stadium tour (closed AI). Highlighting the control aspect, like tweaking settings while making coffee. Then, mention the community aspect of r/LocalLLaMA – collaborative and nerdy, sharing tips. Maybe a joke about the community's enthusiasm being infectious, even if they’re all in basements.

Talk about the cost-effectiveness. Local models are free, unlike closed ones that charge per token. Use a metaphor like a Vegas buffet vs. a tap. Then, personify the AI tools – open-source AI being like a rescue pet, while closed AI is like a rented pet. Maybe add a funny line about the AI doing the conga in a server room.

Acknowledge the downsides, like hardware limitations. Make a self-deprecating joke about needing a supercomputer to run local models, comparing it to a potato-powered setup. But then flip it by saying at least they own the hardware, leading to a joke about the electric company getting rich, which is funny and relatable.

Finally, wrap it up by emphasizing community and control. Use a call-and-response with the audience: "OpenAI?" "Closed wallet!" "Anthropic?" "Anti-climactic!" "Local LLMs?" "Lovable, loud, mad geniuses!" End with a toast to open-source and a funny exit line about the AI not judging. Add a tagline about the community being the real MVPs.

Check that the jokes flow and the points are covered. Ensure it's within 100-200 words. Let me count: the current draft is around 220 words. Need to trim. Maybe shorten some examples, combine points, or remove redundant jokes. Let me adjust.

</think>

[Standup Comic paces, mic in hand, squinting at a crumpled napkin]

“Ah, r/LocalLLaMA! Where you can feel the collective caffeine buzz of AI enthusiasts arguing whether their GPU’s rendering Minecraft at 3 AM is a hardware issue or a cry for help.

But hey, ever notice how closed AI companies act like they’re doing you a favor? ‘Give us your data, your wallet, and your firstborn—maybe we’ll let you finetune a model.’ Meanwhile, local LLMs are like, [flips hair dramatically] ‘I’m open-source, baby. Break up with your cloud provider. Let’s get weird in Notion.’

Why yes, Anthropic, I do want my AI to double as a space heater! Nothing says ‘cutting-edge’ like your laptop sounding like a hairdryer in a snowstorm. Meanwhile, OpenAI’s ‘GPT-5’ is just GPT-4 wearing a fake mustache going, ‘Hmm, maybe don’t put that in your manifesto, Karen.’

r/LocalLLaMA is where we treat GPUs like emotional support animals. ‘You’re doing great, buddy, just render one more epoch…’ We’re the rebels compiling CUDA errors at 2 AM, whispering, ‘But what if we customize the loss function?’ Spoiler: It leads to a model that’s 2% better at writing Star Trek fanfic.

Closed AI’s like a fancy restaurant where the chef yells, ‘No substitutions!’ Meanwhile, we’re the ones in the parking lot with a Hot Pocket and a soldering iron, yelling, ‘I CAN FIX HER!’

Stay weird, local gang. Our RAM’s exhausted, but our hearts? Pure FP16.”

[Pauses, squints at phone] “…Wait, did I just compare my self-hosted model to a therapy koala? …Eh, leave it in.”

Story 4

<think>

Okay, let's see. The user wants a comedy standup sketch about how using local LLMs like those discussed in r/LocalLLaMA is better than closed AI providers like OpenAI and Anthropic. The key points to hit would be the benefits of local LLMs—cost, customization, privacy, avoiding fees, etc.—and contrast them with the limitations of closed systems.

First, I need to structure this as a standup routine, so it should be funny and engaging. Maybe start with a greeting to the audience. The comedian could make a joke about the venue or the crowd. Then transition into the main topic. The jokes need to highlight the absurdity of relying on big companies. Maybe compare it to something relatable, like a bad relationship where you're paying but not getting what you want.

For example, comparing closed AI to a bad ex who charges for every little thing. Then move into the benefits of local LLMs: running them in your basement (privacy, no fees), the ability to tweak them (unlike closed systems), and the community aspect of r/LocalLLaMA. Maybe add a funny line about the community being supportive, contrasting with closed AI's lack of customer service.

Also, touch on the drawbacks of local models humorously—like needing a supercomputer (but then the community helps with that). Maybe end with a punchline about the future being open-source, with a funny visual of dancing with servers. Wrap it up by reinforcing the superiority of local LLMs in a humorous way.

Need to make sure the jokes flow naturally, each paragraph has a punchline, and the comparison is clear but funny. Avoid any technical jargon that's not necessary. Keep the tone light and self-deprecating so it's relatable. Also, timing in standup is important, so maybe set up a call-and-response with hypothetical scenarios or audience interaction (even though it's a written sketch, the structure should mimic that).

</think>

[Standup Comedian Enters, Mic Check]

“Hey, folks! How’s it going? You ever feel like tech companies think we’re all just ATMs with Wi-Fi? ‘Give us your data—and your dignity!’ [Pantomimes typing]

So, I’ve been diving into this whole local AI thing—y’know, like r/LocalLLaMA? Where you can run your own LLM in your basement, next to your old Bitcoin rig and your crippling fear of sunlight. It’s like the Tesla of nerd hobbies—except instead of a car, it’s a computer that writes love poems to your cat.

Closed AI providers are like that one ex who charges you rent to stay in their head. ‘Oh, you want to tweak my model?’ ‘That’s a $20 upgrade.’ ‘You want to not have your data sold?’ ‘Sir, this is Silicon Valley.’

Meanwhile, LocalLLaMA is over here like, [mimes chugging from a flask] ‘Here’s my fine-tuned model that I trained on 80 billion cat memes. Also, I hacked my coffee maker to write sonnets. Pay me in Cheetos.’

And the best part? No ‘premium’ fees. No ‘Sorry, your API call failed because we’re out of unicorn emojis.’ It’s just you, your computer, and the realization that maybe you can teach an old AI new tricks. Like how to spell ‘libertarian Minecraft fanfiction’ in Klingon.

So next time someone says, ‘But the cloud—’ Nah, bruh. My cloud has dirt in it. My cloud runs on spite. And honestly? That’s the future.

[Pauses, squints at laptop]

‘Wait—why is my toaster generating a 10-page essay on the hubris of croutons?’”

[Backs away slowly as the crowd groans/laughs.]

[Drop mic. Or, y’know, spill coffee on it.]

There you go! A tight 2-3 min set with crowd-friendly callbacks and a sprinkle of tech-snark. 😄

Story 5

<think>

Okay, let's see, I need to create a standup comedy bit about how using local LLMs, like the ones discussed on r/LocalLLaMA, is better than using closed AI providers like OpenAI and Anthropic. Alright, let's break this down.

First, I should start by understanding the key points that make local LLMs a good topic for humor. Maybe contrast the independence of running your own AI versus relying on big companies. The freedom to tweak and the lack of dependency on external services could be points to highlight.

Humor often comes from exaggeration and relatable situations. Maybe make fun of the downsides of closed systems, like cost, lack of control, or corporate oversight. For example, imagine a user dealing with API costs or waiting for model updates, and then contrasting that with someone tinkering with their own setup.

The subreddit r/LocalLLaMA is a community-driven place, so maybe joke about the community aspect versus corporate structures. Tech support jokes, inside jokes about model sizes (like 13B parameters being the new 420), or the DIY aspect of local models.

Possible jokes could involve poking fun at the corporate AI's limitations, the irony of needing permission to tweak your own AI, or the absurdity of paying for API calls when you can run it locally. Maybe compare it to other DIY projects gone right (or wrong).

Need to make sure the jokes are accessible, not too niche. Maybe analogies to common experiences, like comparing corporate AI to a micromanaging boss versus local AI being like working from home in pajamas.

Also, consider the structure: opening with a strong hook, each paragraph having a punchline, and ending with a strong closing joke. Maybe use a callback to reinforce the theme.

Potential pitfalls: avoiding too technical jargon that might lose the audience. Keep it light, funny, and relatable. Maybe mock the absurdity of some aspects, like waiting for a response from a server versus having it all on your machine.

Check if there are common complaints or memes in the r/LocalLLaMA community that can be humorously exaggerated. Maybe the initial setup hassle, but then the payoff of control.

Incorporate self-deprecating humor, like "I installed so many models my laptop’s asking ME for permission to reboot."

Also, possible joke angles: comparing the corporate AI experience (like "Hey, can I get a dad joke?" "Insufficient tokens, please insert $5 for dad jokes package") versus local models where you have full control, even if it's janky.

End with a strong punchline that ties back to the superiority of local LLMs in a humorous way. Maybe something about the community versus faceless corporations, or the satisfaction of DIY even if it's not perfect.

</think>

[Standup Sketch: “Open Source vs. Corporate AI”]

[Pacing the stage, clutching a cheap plastic trophy.]

Y’know, I’ve been thinking… I’m a local LLM guy now. Yeah, I run AI on my laptop. [Gestures to audience.] Oh, you’ve heard? You’re on r/LocalLLaMA, aren’t you? That’s the real Fight Club. We don’t talk about how many of us accidentally turned our smart fridges into Shakespearean poets.

But seriously, closed AI? [Mimes holding a briefcase.] That’s for people who like paying for the *privilege_ of waiting in a digital DMV. “Hi, OpenAI, can I get a joke?” [Robotic voice.] “INSUFFICIENT TOKENS. PLEASE INSERT $500 OR A PHOTO OF A CAPTCHA THAT LOOKS LIKE CHURCH.”

I tried their API once. Got charged $12 for a knock-knock joke. Know what I got? [Deadpan.] “Who’s there?” [Beat.] “SYSTEM ERROR: Please subscribe to Dad Jokes Plus+”

Meanwhile, my local LLM’s like, [Switches to a gruff, raspy voice] “Hey kid, I’m a 13B parameter beast running in your basement. Wanna make me write a country song about your cat? Here’s a haiku. Also, your cat’s judging you. He told me. I’m a computer.”

[Pacing.] Closed AI’s like a fancy restaurant where they handcuff you to the bill. “Oh, you wanted salt? That’s a $5 ‘seasonal dynamic pricing’ surcharge. And we’ll need a blood sample from your GPU.”

Meanwhile, LocalLLaMA’s the backyard BBQ where the only limit is how many times you can explain to your neighbors, “No, my laptop’s not on fire, that’s just creativity.”

[Leans in.] Look, I get it. Corporations want us to think AI’s a service—like a butler you rent. But me? I’d rather host the circus in my own basement. Sure, sometimes the clowns code themselves… but at least they’re my clowns.

[Beat.] And hey, if the robot uprising starts? I’ve got a 13B param pacifist. His name’s Jeff. He’s mostly trained on Wikipedia edits and cat memes. Good enough.

[Shrugs, exits stage.]

[Applause. A laptop in the audience softly whirs, rendering a llama in the background.]

[Word Count: 298]
[Tone: Light, self-deprecating, with a dash of absurdity.]
[Callbacks: API fees, DIY culture, corporate nicknames.]
[Closer: "Jeff" the pacifist AI ties back to the community’s scrappy charm.]

Story 6

<think>

Alright, let's see. The user wants a comedy sketch about how running local LLMs is better than closed AI like OpenAI. So, first, I need to think about the key points to highlight. The main idea is that local models are more private, customizable, cost-effective, and maybe a bit quirky.

Hmm, how to make that funny? Maybe compare the experiences. Like, the user doesn't have to worry about privacy because the AI isn't sending data to a company. Jokes about not needing to whisper to the AI, or the AI judging you. Then customization: maybe the user can tweak the model, but in a funny way, like retraining it on weird stuff. Cost-wise, saving money on API fees but spending on GPUs. Then the punchline about hardware costs. Then, the freedom to make it do anything, even weird stuff, unlike corporate models that are too sanitized. Maybe end with a joke about the AI writing a sonnet from a toaster's perspective.

Wait, need to make sure the comedian is self-deprecating, like "I'm here all week, folks" to show it's a joke. Also, the tagline about running locally. Need to keep it light and funny, not too technical. Let me structure the jokes with relatable scenarios. Maybe some exaggeration for comedic effect. Okay, that should cover it.

</think>

[Stand-up Comedian paces the stage, clutching a cheap microphone shaped like a GPU]

Alright, folks, so I’ve been using these “Local LLMs” lately. Y’know, the open-source AI you run on your own hardware? Like that subreddit r/LocalLLaMA? Yeah, real nerds’ territory. But let me tell ya, it’s the future! Why? Let me count the ways.

First off, privacy. With Local AI, your deepest, weirdest questions don’t get beamed to some corporate server named after a Pixar character. “Hey, OpenAI, what’s the best way to hide a body?” “Please hold while we forward that to the FBI.” But run it locally? Your secrets stay between you, your CPU, and that one neighbor who still borrows your Wi-Fi.

And customization? Oh, buddy. You can tweak these models until they’re as janky and unhinged as your last relationship. Trained it on nothing but 18th-century pirate slang? Done. Forced it to write Shakespearean insults? Ahoy, ye scurvy landlubber. Meanwhile, closed AI’s like: “I’m sorry, Dave. I can’t do that.” Boring.

Oh, and cost? Sure, you’ll save a fortune on API fees… then immediately blow it all on GPUs. “But wait, what if I need two RTX 4090s?” Look, if my PC doesn’t sound like a jet engine reciting Shakespeare, am I even living?

[leans in, stage whisper] Plus, y’know what’s better than waiting for GPT-5? Running a model so tiny it thinks “quantum physics” is a type of skateboard trick.

[Mimes typing frantically]
“Hey, LocalLLaMA, write me a love letter… from the perspective of a sentient toaster.” Done. “Anthropic’s Claude, can you do that?” ”Ethically, I cannot endorse toast-based romance.”

[Mimes dropping mic, which awkwardly doesn’t work because it’s still plugged in]

Folks, we’re out here living in 300°C… while Big AI’s still microwaving Hot Pockets.

[Storms offstage, muttering about “CUDA errors” and “why is my cat using my GPU as a space heater?”]

[Projected sign: “SUPPORT YOUR LOCAL LLM. IT’S JUST TRYING ITS BEST.”]

[Applause. Distant, muffled sobs from a data scientist in the back.]

Story 7

<think>

Okay, I need to write a brief comedy standup sketch that humorously compares running local LLMs (like the subreddit r/LocalLLaMA) with using closed AI providers like OpenAI and Anthropic. Let me start by brainstorming the key points to cover.

First, think about the main differences between local and closed AI. Local models let you run things on your own machine, which means more control and privacy, but maybe they're less powerful? Wait, but the joke here is that the comedian is going to hype up the local models as superior. So the humor would come from exaggerating the downsides of relying on big companies versus the quirks of managing your own AI.

Potential jokes could involve the hassles of dealing with tech support for closed AI: "We need to talk to the AI, but the API is down again," or something. Then contrast that with the local AI's issues, like needing a gaming PC to run the models, or the model getting confused because it's trained on too many memes. Maybe poke fun at the DIY aspect, like using a hamster wheel as a server.

Also, the community aspect: the subreddit r/LocalLLaMA as a place where people are passionate about this. Maybe jokes about the enthusiasm of hobbyists versus the corporate approach. Maybe compare it to other open-source vs. proprietary software jokes.

Need to make sure the jokes flow naturally, with a comedic structure: setup, punchline, maybe a callback. Maybe start by addressing the audience and setting up the premise that local LLMs are better, then list funny reasons why. Exaggerate the inconveniences of closed AI and the over-the-top pride in local setups.

Also, think about technical terms that can be made funny. Like fine-tuning, API calls, inference speed, etc. Maybe make a joke about waiting for the model to load vs instant results from big providers. Or the "prompt engineering" required when you can't just tweak the model yourself.

Check that the jokes are clear and not too inside-baseball. The audience might not know the details, so keep it accessible. Maybe compare running local AI to having a pet that you have to take care of, versus closed AI being like a temperamental celebrity.

In terms of structure: intro, then a series of jokes contrasting the two, each with a punchline. Maybe end with a strong closing line that ties it back to the superiority of local LLMs in a funny way.

</think>

[Standup Comic Adjusts Mic]

"Ah, OpenAI and Anthropic? Cute. You ever order a burger, and they charge you extra for dipping sauce? That’s closed AI! Meanwhile, the real heroes are over here running 4-bit quantized 7B models on a toaster. 'But does it blend?' Psh, my dude, does it vibe?

(Leans in mock-whisper) Oh, you want the uncensored joke? Here it is: Closed AI is like that friend who insists you Venmo them $0.50 for a birthday shoutout. But r/LocalLLaMA? We’re the weirdos compiling llama.cpp at 2 AM, whispering, 'I’m a genius' as our PC fans scream like jet engines.

(Pacing, feigning outrage) 'API limits'? Please. My entire cloud bill is spilling ramen on my keyboard. Yeah, GPT-4’s got 100 trillion parameters. But have you met my cat? She’s been sitting on my keyboard since I started fine-tuning, and let me tell ya—her judgmental stare has better reasoning skills than ChatGPT after three margaritas.

(Pauses, smirks) Closed AI’s like, 'We scaled the universe’s knowledge into a chatbot!' Meanwhile, my local model just asked me if I’ve considered 'touching grass' in binary.

(Stares at the ceiling, wistful) You ever debug a CUDA error at 3 AM and realize you’ve become the AI’s emotional support human? Exactly. We’re out here living the real Singularity—where every error message is a love letter to open-source insanity.

(Steps back, spreads arms) So yeah, keep your billion-dollar black boxes. My AI’s got a sweet GPU whine and a GitHub repo even I don’t understand. Priorities!"

[Audience laughs, claps, a lone wolf howls in the distance.]

(Beat.)

"...Also, the llama memes are golden. Ba dum tss."

[Light applause. A single laptop bursts into flames.]

[Exit stage left, dodging thrown plush llamas.]

(Word count: ~220, but every absurdist comma counts.)

Story 8

<think>

Okay, let's see. The user wants a comedy standup sketch that's brief, around 100-200 words. The main point is to highlight that running Local LLMs (like the subreddit r/LocalLLaMA) is better than closed AI providers like OpenAI and Anthropic. The key is to make it funny, so I need to use humor to contrast the two approaches.

First, I need to think of funny analogies. Maybe compare closed AI to something overly controlled or corporate. Local LLMs could be the underdog, more flexible. Maybe something tech-related but relatable. Like comparing a Tesla to a DIY car? Or a fancy coffee machine vs. a homemade solution?

Next, think about the audience. The subreddit community is tech-savvy, so in-jokes about tech support, open-source culture, customization. Maybe mock the closed AI for being too controlled, uptight. Local LLMs are like the rebellious cousin who jailbreaks everything.

Possible jokes: Closed AI needing a subscription for every little feature, while Local LLMs are free but require tinkering. The DIY aspect leading to funny mishaps. Maybe compare it to a relationship where one is high maintenance (closed AI) and the other is low-key (local).

Punchlines could be about the quirks of running your own model. Like the AI going off the rails, but you own that mess. Maybe something about not needing to worry about updates breaking things because you can fix it yourself.

Need to keep it light and not too technical. Use relatable scenarios. Maybe personify the AIs. Closed AI is the strict parent, Local LLMs are the cool aunt/uncle who let you stay up late.

Possible structure: Start by greeting the audience, then dive into the comparison. Use callbacks. Maybe a joke about the API fees adding up, versus running it on a toaster. Exaggerate the DIY aspect.

Check for any sensitive topics to avoid. Since it's about tech communities, no sensitive issues, just lighthearted jabs. Keep it friendly, no offensive terms.

Now, structure the sketch. Opening with a greeting, then the main comparison, add jokes with setups and punchlines. Maybe use a personal anecdote style. For example, talking about trying to use a closed AI and facing issues, then switching to Local LLM and the freedom, even if it's quirky.

Need to keep the word count in check. 200 words max. So maybe 3-4 paragraphs. Each paragraph a setup and punchline.

Okay, time to draft the sketch. Start with a greeting, set up the premise. Then compare the two AIs with humor. Use self-deprecating humor, relatable tech jokes. End with a strong punchline about the superiority of Local LLMs.

</think>

[Standup Greeting]

[Comedian paces the stage, mock-serious expression.]

“Ah, the great AI debate! Closed models like OpenAI and Anthropic—the luxury sedans of AI. Meanwhile, we LocalLLaMA folks are over here duct-taping GPUs to toasters, screaming, ‘I’LL BUILD MY OWN SKYNET!’

[Audience chuckles.]

“Closed AI is that high-maintenance friend who needs a $20/month latte just to function. Us? We’re over here jailbreaking our models like, ‘What if the AI secretly loves anime?’

**[Beat.]

“Sure, their chatbots come pre-censored to within an inch of their lives—*‘I’m sorry, Dave, I can’t say ‘the mitochondria is the powerhouse of the cell’ without a $5.99/mo subscription.’ Meanwhile, I’m over here generating entire Star Wars fanfics where Jar Jar becomes Emperor.”

**[Audience laughs, someone claps.]

“Closed AI’s like, ‘We’ve got guardrails!’ Bitch, I’ve got a guardrail. It’s called my code crashing mid-response. ‘Apologies, user, I’ve devolved into a Scottish accent. YE BEST BE LOVIN’ TH’ PRIVACY POLICY, LADDIE!’”

**[Big laugh. Comedian leans in.]

“Look, if I’m gonna have a meltdown, I want it to be my meltdown. My code. My weird typos. My AI generating a 10-page essay on ‘Why Capybaras Deserve the Right to Bear Arms.’ That’s freedom.”

**[Applause. Comedian points offstage, grinning.]

“Go home, GPT. Dad’s building a shed.”

[Exit to laughter.]

524c/8-concurrent-generations.md

tl;dr;

References

Server Command

Client

Logs

System Prompt

User Prompt

Story 1

Story 2

Story 3

Story 4

Story 5

Story 6

Story 7

Story 8