Our adventures in scaling

  • Handling sudden activity spikes poses different challenges than scaling a rapidly growing user base.
  • Check whether databases are resource-constrained and hence slowing down. Check hardware metrics during spikes to check on CPU, disk i/o and memory.
  • If there are no spikes in those metrics, look higher up the infrastructure stack at service resources for increased resource acquisition times. Also, check the garbage collection activity, which indicates whether JVM heap and threads are the bottlenecks.
  • Check network metrics next to look for a constraint in the network between services and databases - for example, if the services’ database connection pools are consistently reaching size limits.
  • To collect more metrics, log the latency of all transactions and collect those higher than a defined time, which should be analysed across daily usage to determine whether removing the identified bottleneck would make a significant difference.
  • Some of the bottlenecks may be code-related, for example, inefficient queries, a service is resource-starved, inconsistencies in database response itself - so look for metrics on higher-level functioning and not just low-level system components.

