Embracing the chaos of chaos engineering

  • You start with a hypothesis and you make an educated guess of what will happen in various scenarios, including deciding on your steady-state.
  • Then you introduce real-world events to test your guesswork for hardware/VM failure, state inconsistency, running out of processing power or memory or time, dependency issues, rare conditions, traffic spikes, and service unavailability.
  • After that comes doing test runs in production on the properly pretested codebase (though be cautious of doing this to safety-critical systems such as banking) and then complete your hypothesis based on how real-world events affect your steady-state.
  • You should communicate your results not only to engineers but also to support staff and community managers who face the public.
  • Use different tools to run your experiments and ensure you have alerts & reporting systems in place to minimize potential damage. Abort quickly if needed.
  • Once you have defined ideal metrics and potential effects, increase the scope of testing by changing the parameters or events you test each time, applying fixes as you go till you find the points where the system really starts to break down.

Full post here, 6 mins read