
I am half way through the book Release It by Michael Nygard and thought I would scribble down my first take on this. The book aims to prepare developers for the uncertainties in a production environment. Essentially, the code you write behaves differently in different environments and most of the time what developers see (even forsee) is the behavior in their dev/qa/load-test environment. By the time you are done with load + functionality testing, the assumption is that things are production ready and so it's pushed to production. The kudos mails are in and the team heads out for the project party :-).
Once in the production environment, given the right push, your code starts to get stressed and strained. This manifests itself as behavioral differences compared to what was seen in other environments. As the computation involved in the application grows, the resources available to it get strained until finally it breaks down, bringing down the whole application. I am sure this is nothing new for a lot of people out there. I have heard organizations resorting to crazy (but practical given the circumstances they are in) procedures like restarting the applications once every 4 hours to get around these limitations. If true, there are even more surprising things out there like 400 restarts per day.
The book encourages developers to see things from a pessimistic point of view. Anything and everything outside your computational memory can fail, so how do you plan to work around this? The memory itself can fail but I guess you can't really do anything there. If you access the network, it could fail so have you planned for it? What is your plan? If you store lots of data into a database and try to read it, how does the sheer volume of data affect your application? If you integrate with third party services like credit card payment service or address verfication service, how does their SLA affect yours? If a part of your system is in trouble, does it bring down your whole application? How can your degrade your application gracefully when parts of your system is down? Have you planned for capacity difference between your application and the applications you integrate with? These are just some of the issues that this book is trying to address. Some of the proposed solutions like :-
- Timeout (even for thread waits)
- Fail Fast
- Circuit Breaker
etc. are not so difficult to implement. There are more solutions available in this book if you are interested. In today's application where integration with auxillary services is the key to improving the core value of your application, I think the book has a lot to offer. I hope to provide more input once I am done reading the book.


