Back 2 Code

[ Code — Data Science — Ops ]

Read this first

Effective Monitoring and Alerting


A short note about this book I used in my work. First of all two good points. The first is that it deals with monitoring, alerting and reporting in general, that is to say independently of the tools used. This is both a strong point and a weak point since it could be useful to identify families of tools adapted to each use. This step back is not so common and allows to introduce higher level concepts, for example the organization of the monitoring in stacks which is absolutely crucial but also notions and general definitions applicable in all circumstances - or almost. And we come to the second strong point, definitions. It is essential in the professional context to rely on precise definitions that allow framing concepts that most people have an unfortunate tendency to confuse as monitoring and alerting, for example.

In the weak points, it lacks background and practical cases. If we...

Continue reading →

Error Budgets

SRE has found that roughly 70% of outages are due to changes in a live system.


Knowing this, there is no need to look any further the reasons why SRE teams–or production team or whatever the team that will be called by angry customers–are so reluctant to change. If it’s not enough, just remind that their objectives are certainly based on the reliability of the services they maintain.

On the other side, teams in charge of developing new products are trying to push their code into production as often as possible–agility for better or worse encourages this trend– to provide new features to customers or to fix their own bugs and also because they are evaluated on their velocity.

So you end up with a kind of dichotomy between two populations that work on the same product but not at the same level and that does not share the same vision nor the same objectives.



Continue reading →

Fun but so true

Some of my favourite quotes about software engineering.


If builders built houses the way programmers built programs, the first woodpecker to come along would destroy civilization.
– Gerald Weinberg

There are only two hard things in computer science: cache invalidation and naming things.
– Phil Karlton


Don’t comment bad code – rewrite it.
– B. W. Kernighan & P. J. Plaugher

Refactoring is often compared to gardening; it is never finished.
– Scott Rosenberg


First, nothing is as permanent as a temporary fix. Most of these remain in place for the next year or two.
– Michael T. Nygard


Meetings are places where minutes are taken and hours are lost.
– Anonymous?

View →


I explain here how to interact with AWS either with the CLI (Command Line Interface) and with an IT automation tool: Ansible. Ansible is not the first tool that comes in mind for AWS (Serverless, Terraform or the built-in CloudFormation make more sense) however Ansible could be useful if you just want to configure some EC2 and specially if you have already an Ansible script somewhere around.


I’m using an Anaconda as the python distribution, it’s not required but I find this distribution practical to use. I’m assuming either Anaconda or Miniconda is already installed. Please refer to Anaconda Installation page if it’s not the case.
You will also need AWS Access and Secret Key pair. If you do not know how to get them, check this blog post.

 Conda environment

Setting up a fresh conda environment with the latest python version and giving it a meaningful name aws.

$ conda...

Continue reading →

Chaos Engineering: an introduction


Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Principles of Chaos

Netflix has been a practitioner of Chaos Engineering for a long time and at the origin of what can be called now a discipline – some organizations have a dedicated chaos engineering team.
Testing is about performing well defined tests in well defined conditions and expecting a well known answer. Chaos Engineering is more about conducting experiments by producing a failure and learning from it (how the system behave). According to the result it could be time to work on improving things.

According to each use case chaos engineering can spread from:

  • lower layers: at infrastructure level by rebooting an instance,
  • to higher layers: at service level by flooding an API with a kind of...

Continue reading →

Release It!


This book is a bible for any professional who wants to deploy a solution in production–it’s the goal normally, not building throwable POC. It is a recognized reference since it has helped to popularize certain patterns such as the circuit breaker and it is at the top of all the must read lists in the domain. It’s full of good advices and feedbacks since Michael T. Nygard has worked in the field in question, which is now called operations (and even SRE), on critical applications–mainly, but not only, big e-commerce sites.

His book is more about experience and the application of good practices than about theory and dogmatism–it’s a good thing. He is also endowed with a true talent of writer–I’m speaking about technical books and not about literature. This book is extremely well written and humorous which incredibly spices the reading–which might not be fun at all according to the...

Continue reading →


I’m using Pelican for another blog dedicated to books–no one is perfect. For several needs–an mainly because I’m a nerd–I have developed several plugins. And I have discovered that the Pelican plugin mechanism is based on a small framework called Blinker.

Blinker provides fast & simple object-to-object and broadcast signaling for Python objects.

It provides a way to communicate between objects through signals (a kind of event). I really found this way of working handy and elegant so I decided to have a closer look and to talk a bit about it.

It can be installed using pip: $ pip install blinker
To demonstrate its usage I’ve made a completely dumb example:

A number generator sending signals and methods written to listen to these signals in order to print some information.

# The only import needed
from blinker import signal

# Signals definition
number_generator_number =

Continue reading →

The 4 Golden Signals + 1

The term 4 golden signals has been introduced by Google SRE team in the book Site Reliability Engineering1. The main definitions presented below are borrowed from this book.

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

 1 - Latency (Performance)

The time it takes to service a request, with a focus on distinguishing between the latency of successful requests and the latency of failed requests.1

I often call it performance since it sounds more natural for most people. The distinction has to be made between performance of successful and failed requests.

  • The first reason is obvious, you want to know the performance of the service when it delivers correct answers without being distorted by the performance of failed queries–which you should expect to be faster, see...

Continue reading →

Architecting for Scale


This book is simple and well organised. It addresses the key topics that need to be addressed if you want to build, deploy and operate large-scale applications. Here they are, I do not invent anything, they are the five sections of the book

  • Availability: learn techniques for building highly available applications, and for tracking and improving availability going forward
  • Risk management: identify, mitigate, and manage risks in your application, test your recovery/disaster plans, and build out systems that contain fewer risks
  • Services and microservices: understand the value of services for building complicated applications that need to operate at higher scale
  • Scaling applications: assign services to specific teams, label the criticalness of each service, and devise failure scenarios and recovery plans
  • Cloud services: understand the structure of cloud-based services, resource...

Continue reading →

Circuit Breaker


A circuit breaker is a well known piece of technology used in — almost — every house. According to Wikipedia it is

designed to protect an electrical circuit from damage caused by excess current, typically resulting from an overload or short circuit. Its basic function is to interrupt current flow after a fault is detected. Unlike a fuse, which operates once and then must be replaced, a circuit breaker can be reset (either manually or automatically) to resume normal operation.

Ok, but what does it have to do with software architecture? The basic idea is that the same concept can be used to protect an entire system from a big cascade failure caused by the failure of a weak dependency. Here is an example of this kind of failure given by Lee Atchison in his book Architecting for scale1.

A classic example of the pitfalls of ignoring dependency failure occurred in a real-life application...

Continue reading →