Skip to content

Instantly share code, notes, and snippets.

@lusis
Created October 25, 2012 12:48
Show Gist options
  • Save lusis/3952375 to your computer and use it in GitHub Desktop.
Save lusis/3952375 to your computer and use it in GitHub Desktop.
What happened here?

Random idea

I'm a big fan of the Ops School idea. I've struggled for years about how to "train up" someone coming into this field. So much of our skill set is forged in the fire of outages and troubleshooting.

One of the things that is both good and bad about system administration and operations is that we tend to see patterns. It's good in that we immediately see things that stand out. The downside is we tend to superimpose that pattern recognition in inappropriate ways.

We had an interesting issue happen yesterday at the dayjob. I won't go into exactly what it was here but I had an idea based on some graphs I was looking at. It's part social experiment but it's also part problem solving skills.

Given the following image with no context, what do you think happened? What are some of the key indicator points that jump out and what pattern do they call to mind?

(Since it may not be clear simply from an image resolution perspective, there are 4 metrics in this graph)

an graph

You can probably spot some common patterns in this graph. Here's a bit more historical context:

an historical graph

@obfuscurity
Copy link

It also occurred to me that we could be looking at IO, but I don't think so. If only because one typically tracks reads/writes and iops/octets in the same graph (by convention, if not necessity). Instead, it looks to me like a simple latency graph... either web requests or something like time-to-completion of some action or script.

@jjb
Copy link

jjb commented Oct 25, 2012

my first simple guess looking at the first graph is that purple is memory usage and red is disk access events. memory runs out and VM kicks in. then, operator intervenes, kills some big processes and both go down. VM doesn't go down to zero right away because some that got swapped into disk isn't needed right now so there's no reason to swap it back.

however that doesn't account for the metrics gap that @obfuscurity brought up. also, memory usage lines don't usually look so jagged, so it's probably not memory at all :)

@lusis could you tell us if the Y axis is the same units for all the metrics?

@imbriaco
Copy link

Guess: EBS got slow and affected top-line response time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment