Game Day tests deliberately trigger failure modes in production systems to practice response to unpredictable situations.
- List all potential failure scenarios. Consider which parts of your infrastructure are completely safe, what are your blind spots, what happens if a server runs out of space or in case of a DNS outage or DDOS attack.
- Create a series of experiments to anticipate how things will break - what side effects may be triggered, whether alerts will be correctly dispatched, whether downstream systems may be affected.
- Test your human systems. Consider how team members need to interact when an incident unfolds.
- Address the gaps and patch any holes you find. Check which hypotheses held up in practice and which ones did not. Establish a plan to correct these and run a new Game Day test to check whether your hypotheses are now valid.
Full post here, 6 mins read