• Define and measure service-level objectives (SLOs) to ensure data freshness, data correctness, and superior data isolation.
  • Plan for dependency failures by checking for overdependence on products that don’t meet their own SLOs.
  • Create and maintain system diagrams, process documentation and playbook entries that outline recovery from alert conditions.
  • Reduce hot-spotting by balancing out the workload across resources.
  • Utilize autoscaling.
  • Adhere to strict access control for privacy, security and data integrity.
  • Use idempotent and two-phase mutations to avoid duplication or storage of incorrect data in case of pipeline failure in the middle of a process.
  • Use checkpointing to store partially implemented processes.

Full post here, 5 mins read