Skip to content

Instantly share code, notes, and snippets.

@and1truong
Created January 15, 2025 19:18
Show Gist options
  • Save and1truong/bcdc6e3d30d81f29b06f134e7ed1f994 to your computer and use it in GitHub Desktop.
Save and1truong/bcdc6e3d30d81f29b06f134e7ed1f994 to your computer and use it in GitHub Desktop.

2021 - SRE conferences

  • When Linux Memory Accounting Goes Wrong
  • Don't Follow Leaders or "All Models Are Wrong (and So Am I)"
  • Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity
  • What's the Cost of a Millisecond?
  • What To Do When SRE is Just a New Job Title?
  • Capacity Management for Fun & Profit
  • A Political Scientist's View on Site Reliability
  • Panel: Engineering Onboarding
  • Sparking Joy for Engineers with Observability
  • Panel: Observability
  • 10 Lessons Learned in 10 Years of SRE
  • Rethinking the SDLC
  • Elephant in the Blameless War Room—Accountability
  • How LinkedIn Performs Maintenances at Scale
  • Take Me Down to the Paradise City Where the Metric Is Green and Traces Are Pretty
  • Need for SPEED: Site Performance Efficiency, Evaluation and Decision
  • SLX: An Extended SLO Framework to Expedite Incident Recovery
  • Watching the Watchers: Generating Absent Alerts for Prometheus
  • A Principled Approach to Monitoring Streaming Data Infrastructure at Scale
  • Let's Bring System Dynamics Back to CS!
  • From 15,000 Database Connections to under 100—A Tech Debt Tale
  • MySQL and InnoDB Performance for the Rest of Us
  • Cache Strategies with Best Practices
  • Optimizing Cost and Performance with arm64
  • Ceci N'est Pas un CPU Load
  • Grand National 2021: Managing Extreme Online Demand at William Hill
  • Microservices above the Cloud—Designing the International Space Station for Reliability
  • Horizontal Data Freshness Monitoring in Complex Pipelines
  • How We Built Out Our SRE Department to Support over 100 Million Users for the World's 3rd
  • You've Lost That Process Feeling: Some Lessons from Resilience Engineering
  • Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19
  • DevOps Ten Years After: Review of a Failure with John Allspaw and Paul Hammond
  • What If the Promise of AIOps Was True?
  • Model Monitoring: Detecting and Analyzing Data Issues
  • Leveraging ML to Detect Application HotSpots [@scale, of Course!]
  • Demystifying Machine Learning in Production: Reasoning about a Large-Scale ML Platform
  • Designing an Autonomous Workbench for Data Science on AWS
  • Panel: OpML
  • When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field
  • Evolution of Incident Management at Slack
  • Hacking ML into Your Organization
  • Automating Performance Tuning with Machine Learning
  • Practical TLS Advice for Large Infrastructure
  • Learning More from Complex Systems
  • Of Mice & Elephants
  • User Uptime in Practice
  • Nothing to Recommend It: An Interactive ML Outage Fable
  • Improving Observability in Your Observability: Simple Tips for SREs
  • SRE for ML: The First 10 Years and the Next 10
  • Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform
  • Nine Questions to Build Great Infrastructure Automation Pipelines
  • Hard Problems We Handle in Incidents but Aren't Recognized
  • Experiments for SRE
  • Reliable Data Processing with Minimal Toil
  • SRE "Power Words"—the Lexicon of SRE as an Industry
  • How Our SREs Safeguard Nanosecond Performance—at Scale—in an Environment Built to Fail
  • Panel: Unsolved Problems in SRE
  • Beyond Goldilocks Reliability
  • A Retrospective: Five Years Later, Was Chaos Engineering Worth It?
  • The Origins of USAA's Postmortem of the Week
  • Cache for Cash—Speeding Up Production with Kafka and MySQL binlog
  • Taking Control of Metrics Growth and Cardinality: Tips for Maximizing Your Observability
  • Games We Play to Improve Incident Response Effectiveness
  • Food for Thought: What Restaurants Can Teach Us about Reliability
  • Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots
  • Trustworthy Graceful Degradation: Fault Tolerance across Service Boundaries
  • Spike Detection in Alert Correlation at LinkedIn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment