#devops
11 posts

Migrating functionality between large-scale production systems seamlessly

Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
Read more

Migrating functionality between large-scale production systems seamlessly

Lessons from Uber’s migration of its large and complex systems to a new production environment:

  • Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
  • Use this opportunity to settle any technical debt incurred over the years, so the team can move faster in the future and your productivity rises.
  • Carry out validation on a trial and error basis. Don’t assume it will be a one-time effort and plan for multiple iterations before you get it right.
  • Have a data analyst in your team to find issues early, especially if your system involves payments.
  • Once confident in your validation metrics, you can roll out to production. Uber chose to start with a test plan with a couple of employees dedicated to testing various success and failure cases, followed by a rollout to all Uber employees, and finally incremental rollout to cohorts of external users.
  • Push for a quick final migration, as options for a rollback are often misused, preventing complete migration.

Full post here, 6 mins read

Improving incident retrospectives

Often, too much focus is on triggers for the incident. The retrospective should instead review the timeline of incidents, remediation items and find owners for the remediation items.
Read more

Improving incident retrospectives

  • Incidents retrospectives are an integral part of any good engineering culture.
  • Often, too much focus is on triggers for the incident. The retrospective should instead review the timeline of incidents, remediation items and find owners for the remediation items.
  • Retrospectives should be used as an opportunity for deeper analysis into systems (both people and technical) and assumptions that underlie these systems.
  • Finding remediation items should be decoupled from the retrospective process. It helps participants to be free in conducting a deeper investigation as they are unburdened from finding any shallow explanations quickly.
  • It’s a good practice to lighten up the retrospective template you are using because any template will be unequipped to capture unique characteristics of varied incidents. Also, sticking rigidly to a template means limits open-ended questions that can be quite useful in evolving your systems in the right direction.

Full post here, 6 mins read

The 3 myths of observability

A myth is that getting an observability tool is a good strategy - Having an observability platform is not sufficient on its own. Unless observability becomes core to your engineering efforts your company culture, no tool can help.
Read more

The 3 myths of observability

  • Myth #1 is that you will experience fewer incidents if you implement an observability strategy - Just implementing a strategy has no impact on the number of event occurrences but having it in place means that when a problem arises, you will have enough telemetry data to quickly solve it.
  • Myth #2 is that getting an observability tool is a good strategy - Having an observability platform is not sufficient on its own. Unless observability becomes core to your engineering efforts your company culture, no tool can help.
  • Myth #3 is that implementing observability is cheap. As observability is a core part of any modern tech infrastructure, you should think of your observability budget as a percentage of your overall infrastructure budget. The value derived from a good observability program in terms of efficiency, speed, and customer satisfaction surpasses the costs it incurs.

Full post here, 4 mins read

Sampling in observability

Subcomponents of a system may need different sampling strategies, and the decision can be quite subjective.
Read more

Sampling in observability

  • You can use sampling APIs by way of instrumentation libraries that let you set sampling strategies or rates. For ex, Go’s runtime.setCPUProfileRate, which lets you set CPU profiling rate.
  • Subcomponents of a system may need different sampling strategies, and the decision can be quite subjective: for a low-traffic background job, you might sample every task but for a handler with low latency tolerance, you may need to aggressively downsample if traffic is high, or you might sample only when certain conditions are met.
  • Consider making the sampling strategy dynamically configurable, as this can be useful for troubleshooting.
  • If collected data tracks a system end to end and the collection spans more than one process, like distributed traces or events, you might want to propagate the sampling decision from parent to child process through the header passed down.
  • If collecting data is inexpensive but transferring or storage are, you can collect 100% of the data and apply a filter later to minimize while preserving diversity in the sample, retaining edge cases specifically for debugging.
  • Never trust a sampling decision propagated from an external source; it could be a DOS attack.

Full post here, 4 mins read

Learning DevOps as a software engineer

Monitoring/visibility, reliability & software delivery - focus on these three things that help in improving the quality of production.
Read more

Learning DevOps as a software engineer

  • Monitoring/visibility, reliability & software delivery - focus on these three things that help in improving the quality of production.
  • Monitoring four signals - latency, request rate, saturation, and error & success rate - is helpful in catching potential problems.
  • Analyzing which components can fail and how their failure can affect the system should be an important step in building new services or refactoring current ones.
  • Running end-to-end tests on staging and production is crucial.
  • Continuous delivery workflow is extremely important to reduce operational overheads and to enable faster delivery.

Full post here, 4 mins read