2021 - SRE conferences

SREcon21

When Linux Memory Accounting Goes Wrong
Don't Follow Leaders or "All Models Are Wrong (and So Am I)"
Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity
What's the Cost of a Millisecond?
What To Do When SRE is Just a New Job Title?
Capacity Management for Fun & Profit
A Political Scientist's View on Site Reliability
Panel: Engineering Onboarding
Sparking Joy for Engineers with Observability
Panel: Observability
10 Lessons Learned in 10 Years of SRE
Rethinking the SDLC
Elephant in the Blameless War Room—Accountability
How LinkedIn Performs Maintenances at Scale
Take Me Down to the Paradise City Where the Metric Is Green and Traces Are Pretty
Need for SPEED: Site Performance Efficiency, Evaluation and Decision
SLX: An Extended SLO Framework to Expedite Incident Recovery
Watching the Watchers: Generating Absent Alerts for Prometheus
A Principled Approach to Monitoring Streaming Data Infrastructure at Scale
Let's Bring System Dynamics Back to CS!
From 15,000 Database Connections to under 100—A Tech Debt Tale
MySQL and InnoDB Performance for the Rest of Us
Cache Strategies with Best Practices
Optimizing Cost and Performance with arm64
Ceci N'est Pas un CPU Load
Grand National 2021: Managing Extreme Online Demand at William Hill
Microservices above the Cloud—Designing the International Space Station for Reliability
Horizontal Data Freshness Monitoring in Complex Pipelines
How We Built Out Our SRE Department to Support over 100 Million Users for the World's 3rd
You've Lost That Process Feeling: Some Lessons from Resilience Engineering
Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19
DevOps Ten Years After: Review of a Failure with John Allspaw and Paul Hammond
What If the Promise of AIOps Was True?
Model Monitoring: Detecting and Analyzing Data Issues
Leveraging ML to Detect Application HotSpots [@scale, of Course!]
Demystifying Machine Learning in Production: Reasoning about a Large-Scale ML Platform
Designing an Autonomous Workbench for Data Science on AWS
Panel: OpML
When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field
Evolution of Incident Management at Slack
Hacking ML into Your Organization
Automating Performance Tuning with Machine Learning
Practical TLS Advice for Large Infrastructure
Learning More from Complex Systems
Of Mice & Elephants
User Uptime in Practice
Nothing to Recommend It: An Interactive ML Outage Fable
Improving Observability in Your Observability: Simple Tips for SREs
SRE for ML: The First 10 Years and the Next 10
Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform
Nine Questions to Build Great Infrastructure Automation Pipelines
Hard Problems We Handle in Incidents but Aren't Recognized
Experiments for SRE
Reliable Data Processing with Minimal Toil
SRE "Power Words"—the Lexicon of SRE as an Industry
How Our SREs Safeguard Nanosecond Performance—at Scale—in an Environment Built to Fail
Panel: Unsolved Problems in SRE
Beyond Goldilocks Reliability
A Retrospective: Five Years Later, Was Chaos Engineering Worth It?
The Origins of USAA's Postmortem of the Week
Cache for Cash—Speeding Up Production with Kafka and MySQL binlog
Taking Control of Metrics Growth and Cardinality: Tips for Maximizing Your Observability
Games We Play to Improve Incident Response Effectiveness
Food for Thought: What Restaurants Can Teach Us about Reliability
Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots
Trustworthy Graceful Degradation: Fault Tolerance across Service Boundaries
Spike Detection in Alert Correlation at LinkedIn

and1truong/2021 - SRE conferences.md

Select an option

No results found

Select an option

No results found

2021 - SRE conferences

SREcon21