It is recommended for all alerts to adhere to the follow guidelines: Keep conditions simple. Alerts should be actionable. Check for measured failure on critical paths rather than a lack of success. Alerts should not have special cases for routine maintenance. Consider how the alert check can fail.
The year was 2012 and operating a critical service at Netflix was laborious. Deployments felt like walking through wet sand. Canarying was devolving into verifying endurance (“nothing broke after one week of canarying, let’s push it”) rather than correct functionality. Researching issues felt like bouncing a rubber ball between teams, hard to catch the root cause and harder yet to stop from bouncing between one another. All of these were signs that changes were needed.
Ever wonder how Netflix serves a great streaming experience with high-quality video and minimal playback interruptions? Thank the team of engineers and data scientists who constantly A/B test their innovations to our adaptive streaming and content delivery network algorithms. What about more obvious changes, such as the complete redesign of our UI layout or our new personalized homepage? Yes, all thoroughly A/B tested.
Another day, another custom script to analyze an A/B test. Maybe you’ve done this before and have an old script lying around. If it’s new, it’s probably going to take some time to set up, right? Not at Netflix.
The Netflix experimentation platform (XP) was founded on two key tenets: (1) that democratizingcontributions will allow the platform to scale more efficiently, and (2) that non-distributed com-putation of causal models will lead to rapid innovations in statistical methodologies.