Technical Team's nightmare

11th February 2010 will go down in the history of Sankalp Tech Team in bold letters. We are not particularly proud of this date. Yet, the series of events that unfolded on the evening of this date have ensured that the organisation takes a whole new approaching into handling its technical requirements. Sankalp Tech will never be the same.

At around 7:00 PM on this day all Sankalp Sites went down. The service provided withdrew all services without notice. As we scrambled to understand what has happened we got a mail from the service provider showing that we had over consumed the CPU resource and hence our sites are being blocked!

Having no central office where all of us meet very often, we have made the workflows of Sankalp very much dependent on internet based communication. We have multiple intranet like portals which allow each of our teams to work seamlessly with very powerful information logging and exchange. Going down of the sites spells disaster for the organization. Even of the sites do not work for 24 hours, we can have many situations getting easily out of hand.

We initiated a conversation with the support staff of our service provider and tried to understand what has happened. He explained that our account is consuming too much of CPU and that we should upgrade to a Virtual Private Server from the Shared Hosting that we currently use. This meant paying at least 5 times more that what we pay now! Sankalp volunteers pay from their own pockets to run the organization. We have to think 10 times before we spend an extra rupee. And here we had a situation where to get things back online we needed to pay a huge amount of money!

It took time to explain to the support staff that we need to get the sites back online and request them for 2 weeks to either ensure that our sites stay in the utilization limits or take a VPS. Persistence paid off and the sites were put back online.

There was a sigh of relief within the team. At the same time an investigation into why this happened, how could it have been handled better and how to prevent it next time began. One thing we realized is that we had faced a similar problem 2 months back. When that happened, though it was minor, we had identified what can be done to fix it. The same things were posted as high priority task items. But even after 2 months we had not actually worked upon them. There was something fundamentally wrong in the way we were managing our systems.

To give a background, first Sankalp had just one site: The Tech Team was responsible to maintain this site. There was ample amount of time at hand and some interesting things to do. As we understood the power of Drupal and started using more and more installations of Drupal for various different activities, the complexity increased. All the tasks that happen in the organization were systematized and we had more than 6 Drupal installations running in parallel receiving hundreds of hits daily. Each of this was a very custom and a very stand alone setup. As teams began using them rapidly, we started getting a lot of requests for enhancements and improvements. What was once a periodic update converted into daily bug fixes and upgrades. The Tech Team was held up almost all the time and they shifted their focus from ensuring stability and scalability to providing features and keep things going. This was a perfect recipe for disaster and it struck!

As a learning from this event, we stopped taking random feature requests completely. We focused on ensuring that we do one thing at a time for longer duration taking care of requirements going deep into the future.We requested people to make sure that they do homework before asking for features to avoid frequent changes. All the time that was saved was utilised to learn deeper aspects one thing at a time.

The first major agenda item was an investigation into why this happened, how could it have been handled better and how to prevent it next time. For the first time we started looking at system level issues. Linux sysadmin's work came on our tables. We focused on identifying the things that can be done to ensure that our setups give more performance with lesser resource. A lot of research gave good dividends. We first learnt how to measure performance. Then we used this understanding to identify bottlenecks and fix them. Finally we put in place a regular monitoring structure to give us an early warning if something unplanned for starts happening.

Huh! After all this, the Tech Team has a fresh feeling. We have learnt a new dimension of looking at web hosting. If now we consume more resource than allowed, we will be happy to move to VPS as we know that we ave done our best to keep the resource usage low. Now the transfer to higher resources will be justified :) At the same time, the recent change to the way to work, by picking up one thing at a time and drilling into it, looks very refreshing and interesting for now.

Happy working on our sites. Happy surfing our sites.

Sankalp Unit