Skip to content

Instantly share code, notes, and snippets.

@and1truong
Created January 15, 2025 19:12
Show Gist options
  • Save and1truong/40635f2f3843bdcaebf78f3d2b80ea18 to your computer and use it in GitHub Desktop.
Save and1truong/40635f2f3843bdcaebf78f3d2b80ea18 to your computer and use it in GitHub Desktop.
  • 20 Years of SRE: Highs and Lows
  • Scam or Savings? A Cloud vs. On-Prem Economic Slapfight
  • Is It Already Time To Version Observability? (Signs Point To Yes.)
  • Capacity Constraints Unveiled: Navigating Cloud Scaling Realities
  • Sharding: Growing Systems from Node-scale to Planet-scale
  • Product Reliability for Google Maps
  • Build vs. Buy in the Midst of Armageddon
  • Compliance & Regulatory Standards Are NOT Incompatible with Modern Development..
  • The Ticking Time Bomb of Observability Expectations
  • Synthesizing Sanity with, and in Spite of, Synthetic Monitoring
  • Migrating a Large Scale Search Dataset in Production in a Highly Available...
  • OIDC and CICD: Why Your CI Pipeline Is Your Greatest Security Threat
  • When Your Open Source Turns to the Dark Side
  • The Sins of High Cardinality
  • Optimizing Resilience and Availability by Migrating from JupyterHub to the...
  • 99.99% of Your Traces Are (Probably) Trash
  • Meeting the Challenge of Burnout
  • What We Want Is 90% the Same: Using Your Relationship with Security for Fun..
  • Thawing the Great Code Slush
  • Resilience in Action
  • Navigating the Kubernetes Odyssey: Lessons from Early Adoption and Sustained...
  • "Logs Told Us It Was Kernel – It Wasn't"
  • What Is Incident Severity, but a Lie Agreed Upon?
  • Hard Choices, Tight Timelines: A Closer Look at Skip-level Tradeoff Decisions...
  • Triage with Mental Models
  • Defence at the Boundary of Acceptable Performance
  • Lightning Talks
  • System Performance and Queuing Theory - Concepts and Application
  • It Is OK to Be Metastable
  • The Art of SRE: Building People Networks to Amplify Impact
  • Teaching SRE
  • Cross-System Interaction Failures: Don't Fail through the Cracks
  • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
  • The Invisible Door: Reliability Gaps in the Front End
  • Automating Disaster Recovery: The Ultimate Reliability Challenge
  • From Chaos to Clarity: Deciphering Cache Inconsistencies in a Distributed...
  • Patching Your Way to Compliance with a Small Team and a Pile of Technical Debt
  • Strengthening Apache Pinot's Query Processing Engine with Adaptive Server...
  • Taming the Linux Distribution Sprawl: A Journey to Standardization and...
  • Frontend Design in SRE
  • Measuring Reliability Culture to Optimize Tradeoffs: Perspectives from an...
  • Storytelling as an Incident Management Skill
  • Real Talk: What We Think We Know — That Just Ain’t So
  • What Can You See from Here?
  • Dude, You Forgot the Feedback: How Your Open Loop Control...
  • You Depend on Time, This Is How It Works and You Won’t...
  • SRE Saga: The Song of Heroes and Villains
  • The Frontiers of Reliability Engineering
  • I Can OIDC You Clearly Now: How We Made Static Credentials a...
  • OMG WTF SSO: A Beginner’s Guide to Single Sign-On...
  • Sailing the Database Seas: Applying SRE Principles at Scale
  • Survivor: MySQL Island – Outwit, Outplay, Outlast Metadata...
  • Fixing Your Noisy Pager in 500 Easy Steps
  • Achieving Excellence: SLO Thresholds That Transform Service...
  • Selective Reliability Engineering: There Is No Single Source...
  • Why You’re (Probably) Doing Service Catalogs Wrong
  • Exploring the Unintended Consequences of Automation in Software
  • Rock around the Clock (Synchronization): Improve Performance...
  • Mnemonic Rules for Eponymous Laws or: There’s a Law for That!
  • SRE Stakeholders: A Spotter’s Guide
  • Panel Discussion: Is Reliability a Luxury Good?
  • Enhancing Elasticsearch Performance: Innovative Reindexing...
  • Lessons from Unix History
  • Treat Your Code as a Crime Scene
  • Finding the Capacity to Grieve Once More
  • Incident Groundhog Day
  • Anomaly Detection in Time Series from Scratch Using...
  • Generative AI: Beyond (Just) Hype
  • From PIDs to Pods: The Life Cycle of an eBPF-Autoinstrumented..
  • Scheduling at Scale: eBPF Schedulers with Sched_ext
  • When Your SaaS Provider Goes out of Business – Lessons from...
  • Configuration Languages Are the Bane of Our Existence
  • Just Buy the Printer: Resilience in Action
  • Noisy Neighbors, through Networking
  • Taming Noisy Benchmark Results Using Change Point Detection
  • Enabling Product Scalability through Load Testing
  • NVMe/TCP Makes iSCSI Look like Fortran
  • The Silent Performance Killers: BIOS and Firmware Updates
  • How a Single API Endpoint Saved Us 3000 CPU
  • Managing the Risk of Software Supply Chain Attacks
  • When SRE and Security Teams Meet to Face a Crisis
  • How to Host a (Very) Popular Website for 30 Altairian...
  • How Snowflake Migrated All Alerts and Dashboards to a...
  • What If We Ask Linux to Do Cryptography for Us?
  • Synthetic Monitoring and E2E Testing: 2 Sides of the Same Coin
  • Lightning Talks
  • Monitoring Systems as a Service – Walking the Line between...
  • An Exploration in Storing Telemetry in Cloud Object Storage
  • Opening the Box: Diagnosing Operating-System Task-Scheduler...
  • Embrace Fleet Reboots and Make Them Boring
  • A Brief History of Release Engineering
  • Red Tide Revert
  • Riot Games: Evolution of Observability at the Gaming Company
  • A Powerful Logs Management Solution We All Have and Use but...
  • Blast Radius Reduction for Large-Scale Distributed Systems
  • AppStack: An Open Source Cloud Native Platform for Running...
  • Science Reliability Engineering for High Performance Computing
  • Get Your Non-SREs Oncall Ready!
  • Transforming Production Readiness
  • Energy Consumption of Datacenters
  • Are We Really Engineers?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment