The effective engineer (Part-IIIb)

Again I started reading this book “The Effective Engineer” by Edmond Lau. I noted down these points while reading so that it can be kind of cheat-sheet for myself and others too. I strongly recommend buying the book and reading it at-least once.

The book is divided into 3 parts and my idea here is to write 3 blog posts one for each. So here is the second part of the series. You can read the Part-I here and Part-II here.

For Part-III, I am going to do a little different. The third part of this book is considerably a lengthy part especially the second chapter of this part. So I felt to break the final part into 3 sections each for a chapter in the book. You can get a summary of first chapter of Part-III.

Continuing from where we left here’s the 2nd part on Building long-term values…

Minimize operational burden

Embrace Operational Simplicity – Do simple things first

Complex architectures impose a maintenance cost in a few ways:
– Engineering expertise gets splintered across multiple systems.
– Increased complexity introduces more potential single points of failure.
– New engineers face a steeper learning curve when learning and understanding the new systems.
– Effort towards improving abstractions, libraries, and tools gets diluted across the different systems.

Build systems to fail fast – Fail fast to pinpoint the source of errors

Examples of failing fast include:
– Crashing at startup time when encountering configuration errors
– Validating software inputs, particularly if they won’t be consumed until much later
– Bubbling up an error from an external service that you don’t know how to handle, rather than swallowing it
– Throwing an exception as soon as possible when certain modifications to a data structure, like a collection, would render dependent data structures, like an iterator, unusable
– Throwing an exception if key data structures have been corrupted rather than propagating that corruption further within the system
– Asserting that key invariants hold before or after complex logic flows and attaching sufficiently descriptive failure messages
– Alerting engineers about any invalid or inconsistent program state as early as possible

Relentlessly Automate Mechanical Tasks – Automate mechanics over decision making

Ask yourself: Will I save more time overall by manually doing a particular task or by paying the upfront cost of automating the process?

– Time is our most valuable resource. Pushing relentlessly toward automation”
– Do not stop automating for following reasons:
– Don’t have time right now
– Tragedy of commons – not interested in automating (self interest over groups long term interest)
– Lack of familiarity on automation tools
– Underestimate the future frequency of the task
– Not internalising the time savings over a long time horizon
– Activities where automation can help include:
– Validating that a piece of code, an interaction, or a system behaves as expected
– Extracting, transforming, and summarizing data
– Detecting spikes in the error rate
– Building and deploying software to new machines
– Capturing and restoring database snapshots
– Periodically running batch computations
– Restarting a web service
– Checking code to ensure it conforms to style guidelines
– Training a machine learning model
– Managing user accounts or user data
– Adding or removing a server to or from a group of services

Important: Automation can produce diminishing returns as you move from automating mechanics to automating decision-making. Given your finite time, focus first on automating mechanics. Simplify a complicated chain of 12 commands into a single script that unambiguously does what you want. Only after you’ve picked all the low-hanging fruit should you try to address the much harder problem of automating smart decisions.
Make your batch process idempotent – Aim for idempotence and reentrancy
“Scripts executing a sequence of actions without human intervention is known as batch processes.”
“An idempotent process produces the same results regardless of whether it’s run once or multiple times.”

– The ability to run infrequent processes at a more frequent rate than strictly necessary, to expose problems sooner
– When idempotence isn’t possible, structuring a batch process so that it’s at least retryable or reentrant can still help.
– A retryable or reentrant process is able to complete successfully after a previous interrupted call.
– A process that’s not reentrant typically leaves side effects on some global state that prevents it from successfully completing on a retry.

Running batch processes more frequently also allows you to handle assorted glitches transparently. A system check that runs every 5 to 10 minutes might raise spurious alarms because a temporary network glitch causes it to fail, but running the check every 60 seconds and only raising an alarm on consecutive failures dramatically decreases the chances of false positives. Many temporary failures might resolve themselves within a minute, reducing the need for manual intervention.
Hone your ability to respond and recover quickly – Plan and practice failure modes

– Netflix, Google, and Dropbox all assume that the unexpected and the undesired will happen.
– They practice their failure scenarios to strengthen their ability to recover quickly.
– They believe that it’s better to proactively plan and script for those scenarios when things are calm, rather than scramble for solutions during circumstances outside of their control.
– Ask “what if” questions and work through contingency plans for handling different situations:
– What if a critical bug gets deployed as part of a release? How quickly can we roll it back or respond with a fix, and can we shorten that window?
– What if a database server fails? How do we fail over to another machine and recover any lost data?
– What if our servers get overloaded? How can we scale up to handle the increased traffic or shed load so that we respond correctly to at least some of the requests?
– What if our testing or staging environments get corrupted? How would we bring up a new one?
– What if a customer reports an urgent issue? How long would it take customer support to notify engineering? How long for engineering to follow up with a fix?
– Practicing our failure scenarios so that we can recover quickly applies more generally to other aspects of software engineering, as well:
– What if a manager or other stakeholder at an infrequent review meeting raises objections about the product plan? What questions might they ask, and how might we respond?
– What if a critical team member gets sick or injured, or leaves? How can we share knowledge so that the team continues to function?
– What if users revolt over a new and controversial feature? What is our stance and how quickly can we respond?
– What if a project slips past a promised deadline? How might we predict the slippage early, recover, and respond?

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s