Game Day tests deliberately trigger failure modes in production systems to practice response to unpredictable situations.

  • List all potential failure scenarios. Consider which parts of your infrastructure are completely safe, what are your blind spots, what happens if a server runs out of space or in case of a DNS outage or DDOS attack.
  • Create a series of experiments to anticipate how things will break - what side effects may be triggered, whether alerts will be correctly dispatched, whether downstream systems may be affected.
  • Test your human systems. Consider how team members need to interact when an incident unfolds.
  • Address the gaps and patch any holes you find. Check which hypotheses held up in practice and which ones did not. Establish a plan to correct these and run a new Game Day test to check whether your hypotheses are now valid.

Full post here, 6 mins read