- Define and measure service-level objectives (SLOs) to ensure data freshness, data correctness, and superior data isolation.
- Plan for dependency failures by checking for overdependence on products that don’t meet their own SLOs.
- Create and maintain system diagrams, process documentation and playbook entries that outline recovery from alert conditions.
- Reduce hot-spotting by balancing out the workload across resources.
- Utilize autoscaling.
- Adhere to strict access control for privacy, security and data integrity.
- Use idempotent and two-phase mutations to avoid duplication or storage of incorrect data in case of pipeline failure in the middle of a process.
- Use checkpointing to store partially implemented processes.
Full post here, 5 mins read