The Sounds of Silence: Lessons from an 18 hour API outage
Sometimes applications are behaving "normally" along strict definitions of HTTP statuses but under the surface, something is terribly wrong. In 2017, Checkr's most important API endpoint went down for 12 hours without detection. In this talk I'll talk about this incident, how we responded (what went well and what could have gone better) and explore how we've hardened our systems today with simple monitoring patterns.
Paul hails from Denver, CO where he works as an Engineering Manager at Checkr. He's passionate about building technology for the new world of work. In a former life, Paul was a competitive swimmer. He now spends most of his free time on dry land with his wife and three children.