Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@pszemraj
pszemraj / create_unified_mcqa.py
Created April 18, 2025 19:20
multiple‑choice dataset aggregator
#!/usr/bin/env python
"""
create_unified_mcqa.py – “batteriesincludedmultiplechoice aggregator
Handles all datasets listed in the conversation
Survives missing/renamed columns
Converts every `label` to pure int64 to avoid ClassLabel clashes
Explicitly casts features to ensure concatenation compatibility
Improved error handling and skipping for malformed examples
Limits warning/info messages per dataset
Fixes column mismatch error during cast
@pszemraj
pszemraj / async_pipeline.py
Last active April 7, 2025 22:51
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
This script processes PDF files from an input directory using the
reducto/RolmOCR model served locally by vLLM via its OpenAI-compatible API.
It renders each page, sends API requests concurrently for OCR, extracts plain
text, and saves the combined text for each PDF into a corresponding .txt file
in the specified output directory.
@pszemraj
pszemraj / alternate_attn_report.md
Created April 4, 2025 14:39
deep research report by gpt-4.5

Alternate Attention Mechanisms for Sequence Modeling (2023–2025)

Transformer-style self-attention has been central to recent advances in language modeling, but its $\mathcal{O}(L^2)$ complexity (for sequence length $L$) motivates research into more efficient alternate attention mechanisms. This report surveys state-of-the-art methods from 2023–2025 that replace or augment standard self-attention in language sequence models. We organize methods by broad families – from linear approximations and sparsity-based variants to convolutional, state-space, and recurrent mechanisms – outlining each method’s motivation, technical formulation, empirical performance on language tasks, and efficiency characteristics.

Contents:

@pszemraj
pszemraj / fix_extensions.py
Created March 31, 2025 22:50
File Extension Fixer using Magika
#!/usr/bin/env python3
"""
File Extension Fixer using Magika
This script analyzes files using Google's Magika deep learning model to identify
their actual content types and fix incorrect file extensions.
pip install -U joblib magika tqdm
"""
sudo apt-get update && sudo apt upgrade -y
sudo apt-get install -y poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
git clone https://github.com/allenai/olmocr.git --depth 1
cd olmocr
pip install -q ninja
pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
# clean up
pip cache purge && apt autoremove -y
@pszemraj
pszemraj / layernorm_scaling.py
Last active March 26, 2025 03:08
LayerNorm Scaling implementation to mitigate the Curse of Depth in LLMs.
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class LayerNormScaling(nn.Module):
"""
LayerNorm Scaling implementation to mitigate the Curse of Depth in LLMs.
Model Average CR⬆️ AGIEval Mean (Min, Max) AGIEval CR MMLU-Pro Mean (Min, Max) MMLU-Pro CR Math Mean (Min, Max) Math CR #Params (B)
meta-llama/Llama-3.1-70B-Instruct 72.39 72.43, (65.34, 74.66) 81.79 66.63, (55.16, 70.68) 73.19 65.88, (64.58, 67.86) 62.18 0
mistralai/Mistral-Large-Instruct-2407 71.93 68.78, (61.41, 74.49) 75.77 65.1, (50.28, 69.23) 72.31 71.04, (69.66, 72.72) 67.71 0
meta-llama/Meta-Llama-3-70B-Instruct 69.11 69.71, (60.77, 71.2) 83.13 58.75, (49.3, 63.16) 75.24 51.29, (49.66, 54.2) 48.96 0
01-ai/Yi-1.5-34B-Chat 58.43 63.89
@pszemraj
pszemraj / tensorboard_inspect.py
Last active March 11, 2025 00:33
CLI utility to quickly inspect the latest scalar values from TensorBoard logs.
#!/usr/bin/env python
"""
CLI utility to quickly inspect the latest scalar values from TensorBoard logs.
Dependencies:
pip install tbparse pandas fire tqdm
Usage:
python tensorboard_inspect.py --logdir ./path/to/logs
"""

bash/zsh util for readerlm-v2

Warning

Using this will send the data of your (public URL) to jina ai, and the license for readerlm-v2 is cc-by-nc

Here's the improved version of your smart_curl function that better handles filename extraction and works in Zsh:

function smart_curl() {