Building Software Brick by Brick: June 2008

Wednesday, June 25, 2008

Choosing Open Source Software

Of late Open Source Software (OSS) has become so critical to software companies that it is hard to imagine life without them. Although software vendors do not have code from the OSS community (although they would like to) because of licensing restrictions, their internal IT environments are replete with them. Consider this - Operating System: Linux, Development Environment: Eclipse, Issue tracking: Bugzilla, Source code control: SVN and the list goes on. A complete IT environment can be set up in matter of hours and we have multiple options at every turn.

Now let us turn our attention to a harder problem. Suppose one is releasing a product or a service. Can one use open source / free software as part of this? Obviously the first aspect to think is that of licensing. There are few tools out there that can be embedded into other software as part of a product without constraints. However the licensing constraints are not applicable for hosted services since the original software is not re-distributed for profit. So for the discussion lets move past licensing and see how do we choose open source components. The following anecdote is instructive.

Recently we were implementing a messaging system as part of which we had to have a Mail Transfer Agent (or in simpler words a server that can send, receive and deliver emails). We looked at various free options - theres tons of them some of them open source while others are without licensing costs. We eventually decided to go with qmail, primarily because we already had in house expertise on setting it up and running it. All was going very smooth until we did some high throughput performance tests. It turned out that the average delivery time for every message was increasing non-linearly with increase in load. It meant that with more load the system too significantly more time to deliver the messages. It was unacceptable.

We did a bit of profiling of the system and eventually could attribute the system behaviour to a queer problem. Its called as "The silly Qmail Syndrome" (http://qmail.jms1.net/silly-qmail.shtml). The gist of this problem is that when a very large number of messages are sent in a short period of time, qmail keeps itself busy doing only one set of tasks (either sending messages out or classifying incoming messages). This starves the other task and in turn dramatically increases the end to end delivery time. The site above gives the solution - comes to the community as a patch to the qmail code. The creators (Andre O and John S) have been kind to share this in an easily consumable form with the world. I shudder to think how our project would have been affected if we had not found this solution in time.

In conclusion I would like to go back to the point I wanted to make. When choosing Free or Open Source software to be part of critical business applications (wonder if there are there any non-critical business applications!) the most important thing one needs to look at is the community around the tool in question. The references below have more thoughts on what else to look for. In all using the right open source tools can amount to real cost savings while not compromising quality.

Some References

http://www.oss-watch.ac.uk/resources/beginners.xml: A good start for people who do not know the details of open source.
http://www.oss-watch.ac.uk/resources/tips.xml: Suggests top ten tips on how to select open source software. The focus is for enterprises to deploy them internally.
http://www.linuxdevices.com/articles/AT7621761066.html: Nokia built the 770 Internet Tablet mostly on open source. Read some interesting aspects of their story here.
http://www.dwheeler.com/oss_fs_eval.html: Good process for evaluating open source software systems
http://shearer.org/MTA_Comparison: Has a good comparison of the popular MTAs (SMTP servers). This is a good example of a typical study of open source / free software which we need to look at before we use them.

Tuesday, June 17, 2008

Insights into Java's popularity

Recently I was involved in a discussion on why Java has become so popular. As a result I did a bit of searching around. If you ask this question (Why has Java become so popular?) to people, the typical answers would be in the line of - Java is platform independent, Ease of use, Object Oriented, Multi-threaded and the like. However there is more to this than just good features of Java.

One of the most important ingredients to Java’s success is the marketing strategy employed by Sun. They did a great job on this one and I mean it in a positive way. Java (or Oak as it was called earlier) was invented for a very different purpose and for a different project. It was meant to network electronic devices. By definition it was expected that there would be varying environments involved, the programs had to be resilient in the face of devices coming on and going off the network and so on. However for whatever reason this idea did not take off and Java was orphaned. Sun probably actually did not know what to do with Java – the project went through many phases of closures and rebirths. However the explosive growth of the internet was round the corner. Java by design was suited to address some of the problems that came with this growth. At this point Sun made a great decision to put Java into the hands of the community across the world while still maintaining just enough control to enforce discipline. This was in a way innovative for the times – the developer community embraced Java like a long lost cousin. Java reached the stage it is today through this partnership. Has Sun made a lot of money in the process, I do not know, but it gave Java a great life ahead.

The other aspect of why Java grew so popular is that it is a dynamic language by design. The fact that Java is interpreted gives the designers the advantage of doing certain things differently. However there is still credit due to them for considering doing things differently. To understand the significance of the dynamism of Java, consider the following. Lets say there is a software system running for a particular purpose. As time progresses the needs of the users change and so do the expectations from this software. How does one add more functionality to this system?

The typical solution for the above problem is to build a new version of the software with additional/changed features. Then replace the older version with new one. The key to note here is that although the new and old versions of the software share the same code base, they are distinctly two different entities. This is where Java’s dynamism can be very helpful. One could build a system in Java that could be extended while on the run. A Java based Application Server is a very good example. We add new functionality (in fact complex new applications themselves) at run time without bringing down the Application Server itself. This of course does not mean that all Java programs are extensible. It just means that Java provides an easy platform. A running Java program can discover new things about code that it encounters for the first time through the mechanism of Reflection. The programmer still has to do the hard work of designing a system that harnesses these capabilities. That said let me also talk about the earlier C/C++ world. What I described above is also possible to be done using shared libraries (or dynamic linked libraries). But it is not as easy to do it – primarily because shared libraries were invented with a different goal in mind.

These two aspects of Java made it popular. Today its very well embraced by many communities and in fact many critical projects bet their money on Java.

Sunday, June 8, 2008

The Unsung Hero

Okay the title sounds like something from the history text book. But lets start with following problem definition:

Data about students in a college (typical of strength of about 500) are to be tracked. We need to be able to track about twenty attributes for every student like:

Opting for a particular subject, opting for hostel accommodation, Ladies and gents, etc.

Further we need to be able to answer queries like

Give all students who have opted for Artificial Intelligence and stay in the hostel

How you would organize the data structure and the corresponding algorithms for this problem? Think about this before you read further on.

If you recognized that these could be implemented as sets and the answers to the queries would be set operations, then you are on the right track. However what data structure did you think of using? If you thought of a structure like a list then you are not alone.

However there is a better structure suited for this situation - the bit vector. We will represent the data for students as bit vectors for each of the attribute - note that all the attributes need to be boolean. So we have a 500 bit vector for every subject, hostel accommodation and so on.
Then the answer to the queries like the one mentioned above can be obtained by a series of bitwise operations between different bit vectors. And there are some really cool algorithms to do the popular bitwise operations. This is very efficient both in terms of storage and time taken to process queries.

The question is why don't most of us think of this data structure? I believe its because our learning curriculum does not accord the kind of importance needed for such an elegant and efficient data structure. We are systematically "taught" to stay away from anything to do with bit operations. In fact I have seen good programmers with a bit operations phobia. I think it is time to revisit this mentality.

Let me end this discourse with a problem picked up from Jon Bentley's book Programming Pearls. The situation there was that a programmer needed to sort several thousand telephone numbers (seven digits long) stored in a file. The programmer was constrained by the fact that s/he could not read the entire set of numbers into main memory for lack of space.

The solution that was thought of was interestingly very elegant. An obvious note is that telephone numbers do not repeat. So the programmer represented every phone number (there can be only 99,99,999 of them since they are seven digits) by a bit. If the phone number existed in the file then the corresponding bit was set to one. So the bit vector could be initialized by one pass of the file on disk. Then sorting the numbers just meant walking through the bit vector and printing the (phone) numbers for which the bit was set to one.

I say three hurrays to our unsung hero The Bit Vector!

Building Software Brick by Brick