Tuesday, September 23, 2008

Static Code Analysis - some thoughts

If you do (or at some point of time did) C programming on Unix then you get introduced to static code analysis pretty early. There are tools like lint, cflow, ctags and cxref, although I was not familiar with this term at that time. So what is Static Code Analysis?

As the name suggests it is a technique to analyze a software without actually trying to run it. It obviously implies that the technique looks at the source code of the said software. Note that it is not Software Testing - wherein we actually compile and run the software application. The application is fed with test data. Software Testing is a part of Dynamic Analysis. Typically static analysis tools are built on top of compiler front-ends. The information collected by compiler front-ends is run through sets of rules to find pre-defined patterns or stored for further analysis.

So is analyzing source code without actually running the software of any use? It turns out that it is very useful - both for prevention of bugs and also for discovery of information. Primarily Static Code Analysis tools have been used for two purposes:

  • Making code more maintainable: Over time the software community has collected lot of good coding practices. It is possible for a tool analyze source code to find if any such practices are violated. More importantly these kinds of tools can now be integrated into Integrated Development Environments (IDEs) so that a software developer can be warned right at the moment s/he is committing a mistake. These tools can also be run on software that is already existing to point out such weak areas or outright defects. Typically these kinds of issues were expected to be caught in Peer Reviews, but its always a good idea to take help from a software.

    Another way to make software more maintainable by using these tools is to rearrange the code (Beautify, Re-factor) so that it is well-formatted and adheres to good design practices. The resulting rearranged code is more elegant, more logical and hence more maintainable.

    Analyzing software for other non-functional characteristics like Security, Performance, Scalability are studied but are not as common place.
  • Finding design information in existing code: More often than not, software is written under artificially created "Time to Market" pressure. Another very misused term is "We use agile coding methodologies". Whatever be the means, but the end result is large amount of poorly documented and poorly structured source code. Further it is very likely that the team that inherits this source code is not the same as that which writes it. At these times tools that can extract design information from source code become very critical to the success of the projects.

    These are called as Information Abstraction systems. They collect data from the source code and put it into data stores (like relational databases). Querries can then be built on these data stores to aid in discovering design information. For example querries like - what are all the recursive functions in this software? Which classes 'use' a given class and so on. This process is also called as Reverse Engineering. There are tools that can generate 'intelligent' class diagrams from source code. Many a times such tools save the day for software developers.

    Another utility of the information thus collected is to aid a software developer to navigate large code bases easily - one who has waded through millions lines of code appreciates their value.
Over the last few years the importance of static analysis has steadily grown. Teams find that they can dramatically reduce their code review times if the reviews are preceded by a static analysis first. Open source projects where code reviews are hard to arrange logistics wise, depend on these techniques to ensure a minimum quality of the software. Today development environments like Eclipse, IBM Rational, Microsoft include a multitude of static analysis tools to make the developer's life easier. If one has to reach the next level as a developer then one needs to know how to take advantage of these tools to take care of the mundane basics and use ones time for the more important things in life (like getting a cup of coffee!)

References:

This is definitely not exhaustive, but I wanted to put together some interesting aspects of static analysis.

Monday, August 18, 2008

Understanding Existing Code

Continuing from my last post on debugging, the next natural step is to look at how to understand existing code. While we are on this topic it is interesting to note that the overall software entropy in the world is ever increasing. What I mean by that is the number of lines of code of software written in the world keeps increasing. So far it has been the experience that systems that need software, keep needing it and never die out. They might change in shape and size but never die out. Take the example of mainframe systems. Even today many enterprises still employ systems from the 1970s although they are wrapped in modern connectors. Moreover the systems that have been decommissioned are replaced by more complex modern software systems. Would we ever reach a time when all software ever needed is written? It is hard to say but I would bet that this time would never come.

So whether we like or not, we will be co-existing with lot of software. And by definition understanding existing software becomes critical. This article is about my thoughts on how to ease the life of the programmer who is banished to understand other's code. More often than not there is little or no documentation, both inside the code and outside (what do you mean documentation, you have the code!). The challenge this programmer is faced with is to extract design ideas out of the code. Before a programmer embarks on this journey she needs to understand the goals of why she is doing it very clearly. It could taking ownership of the software, fixing a defect or adding a new feature. The following thoughts come to mind.
  • The first step in understanding a software system is to go outside in. One needs to understand as much as possible about the execution context, functionality exposed to the user, external interfaces to the world and so on. This information is embedded in User Interfaces so "playing around" with the system is obviously desirable. Some times this is not possible. The programmer has to satisfy herself with reading user manuals or rely on user descriptions or even use only reports and other artifacts generated by the system under consideration.
    • A neglected aspect in this regard are the test cases for the system. If the quality assurance for the system is up to date and complete, then the test cases are an extremely rich source information about the behaviour of the system. The programmer needs to classify the test cases and start looking at "end to end" cases. Executing these test cases on a running system is very desirable.
  • The next step is to understand the deployment aspects of the system. If one is faced with a compiled system, the build scripts embed a lot of information about the dependencies within the system. The process of coming to binary executables from code can be quite complex. Understanding this process is critical in understanding the software. There are a few tools that help the programmer in this process - for example one tool coverts build scripts into a graphical representation that is much easier to comprehend.
  • The next step is to start getting into the code. Many times programmer start understanding a software system inside out, that is start looking at the code first. This is not a good idea as it can be very confusing leading to wasted time. The better way is go outside in. However these three steps will be needed to be done back and forth many times. Coming back to understand source code there are quite a few advances in this area.
    • One of the ways to start is to set up source code as a "project" in an appropriate Integrated Development Environments (IDE). The IDE support for modern languages is phenomenal and one needs to harness this for understanding code. For one navigating the code becomes extremely easy.
    • Another point to consider here is Program Understanding has developed into a science in the last two decades and there are tools now availble to automate this process. Some examples:
      Using automatic static analysis tools like these makes a lot of sense when the code base the programmer is faced with is huge (this can be relative, but my opinion is anything larger than 100,000 lines of code). Many of these tools extract information about the software and store it into query-able systems. The data then is open for a lot of reporting and representations. The next level of detail is to put together the control and data flows that are relevant for ones investigations.
  • It is rarely that a programmer gets into understanding large programs just for the sake of understanding it or for leisure. More often than not, the effort is driven by a need to change the behaviour of the software under consideration - either add a new functionality or fix a defect. In this case the programmer has a definite goal. She can now start narrowing on to this goal. The ideal situation is to understand all the flows related to the module being changed and have a good sense of the impact of changing the code.
    • An interesting note here - as part of doing the above a programmer mentally "slices" the program in various ways. In computer science this is studied as "Program Slicing". There are a few tools that can help the programmer here, but more importantly having this theoretical awareness helps the programmer to have a better perspective on how to go about achieving ones goals.

Hopefully these points help a software programmer or two faced with this dilemma. The ideal outcome of a program understanding effort is to get a completed set of documentation, both inside and outside the code. However many times its not feasible because of time and budget constraints. My personal opinion is that the organization needs to at least budget for internal commenting of the code. The programmers given the task of understanding code should also be charged with commenting it. This can be done incrementally, the programmer adds suitable comments in the modules that she is targeting. If this is not done then the knowledge discovered in this process remains in the minds of people, but may be that is a good thing because it generates jobs for other programmers in a few years !!

Tuesday, July 15, 2008

How to be a Better Debugger

One of the complaints I have had with my computer science education is that I was never taught to debug code. Writing code is one thing, but getting it to work is another! Also it is interesting that our curriculum never emphasizes understanding existing code either. And more interesting is the fact that this has not changed in the last fifteen years. There are a whole bunch of new things added to the computer science subjects, but the required emphasis for debugging is still missing. I believe it is an extension of one of the human traits - we learn well how to speak but never to listen well :).

So I thought I will write bit about how one can get good at debugging code. As I said its like learning to listen well. The easiest way to debug code is to ensure that the bug does not get introduced in the first place! The focus of this article is not to talk about sound programming techniques so I will not talk about that and many times you are fixing someone else's code. Assuming that enough care is taken to start with, there still will be bugs. This article is about how to find them. Even today after all the advances in compiler technology debugging is a tough art. However that is reason enough to motivate many people to do better debugging.

Before we jump in, I want you to think about why defects are introduced in software in the first place. If you look at it there can be about 10-15 different reasons. Some that jump to mind are "Developer does not understand requirements well", "Developer does not understand the underlying platform well", "Plain old mistakes" and so on. If for every bug one finds if it can be classified in one of these categories, it will help teams to avoid them in the future. Overall my point is that defects are good opportunities to learn about whats happening in the code and around in the team. I will address the debugging process in the following bullets.

  • The first key to efficient debugging is to clearly understand what the software has to be doing. It sounds absurd but I have seen umpteen situations where the developers do not comprehend and more importantly do not agree on the software's behaviour. Getting this cleared out of the way eliminates a class of defects related to requirements understanding.
  • Debugging is like solving a murder mystery - you have the dead body (or the crashed program) from where you have to reason back about what happened. The good thing in programming is with luck a developer can recreate the situation where the program dies. So the next step in debugging is to be able to reproduce the bugs predictably. For complex defects this is easier said than done.
  • Typically a bug manifests itself in different ways - a memory (core) dump, a stack trace dump, an unexplained or unintended behaviour or just plain old software hang (in which the software does nothing!). So the next mantra in debugging is to collect as much information about the bug as possible. This step is under-rated by many, because once we find an issue we tend to stop and fix it. However we do not pay attention to all the flows in which the particular issue can surface. The typical consequence of not doing enough upfront research is the introduction of new defects while fixing the one at hand.
  • There are several other ways of gathering the above information - stepping through debugger sessions, analysing existing logs, adding print statements to gather more information, etc. It is usually less time consuming to add print statements and is recommended. However it has the side effect of strewing ones code with unnecessary code - a good logging framework upfront alleviates this problem. Moreover these print statements can be switched on again if there is a later need.
  • The next step in the debugging process is to track the instance at when the bug was introduced. If developers follow the discipline of "one issue fixed per checkin" then it makes the debugging that much easier. Most often the recent changes are the culprits. Hence you need to develop skills of doing diffs and using the versioning system well.
  • You should note that debugging is a very intensive activity and needs a lot of preparatory time. Hence once a person dives into it, it would be a waste of time to switch to another task and come back later to debugging again. The context is lost and it does not help the process.
  • Look for patterns. When debugging hard to find problems, it pays to be systematic. In one instance we thought we had a sporadic bug. But when we charted its occurrences, it turned out to be that the bug appears only around the start of the week. It was later traced down to an over run of the array in which week days were stored. Half the battle is won if a developer can put together a pattern (based on time, environment, actions, input sections, etc) on the bug occurrences.
  • Learn as much as you can about the development and production environments. I have seen many developers use less than 10% of the cool facilities that IDEs give them. The primary reason is not being aware of them. These can range from simple time saving short cuts to enlightening stack trace analysers. Knowing the tools does not automatically find the bugs but makes the journey more enjoyable.
  • Know and understand "Debugging Blindness". Constant stares at the same piece of source code over extended periods of time can cause fatigue and make one miss things that are obvious. In one instance we had the following:

      
    if( i < l) { ... 


    Its hard to see that 'i' is being compared to the variable 'l' and not the constant '1'. It can be easily missed in a debugging session. Speaks a bit about what variable names to choose. The point is that it also pays to understand the psychological aspects of programming - both for one who wrote the code and one who is debugging it.

If you have tried all the above and still cannot find the bug then close it in your bug tracking software with a note saying "Not reproducible" and let your customers find it for you :). We wish this was an option, but you and me know that it is not.

Thursday, July 3, 2008

Whats the big deal about Seam

Recently we had this discussion about whats the big deal about Seam. If one looks at the cricket game the seam on the ball decides a lot about the game - particularly in bowler friendly conditions. However the seam that we were discussing was not related to cricket. It was related to the web framework called Seam (http://www.seamframework.org/). Its a relatively new framework on the horizon. This blog is about whats the big deal about this Seam!

To understand whats the big deal one needs to have a background about some of the related J2EE technologies particularly Java Server Faces (JSF) and Enterprise Java Beans 3.0 (EJBs). These two are mature technologies now and enough developers have tried them out in various contexts. There are a few drawbacks of JSF that have been reported so far.
  • The most important one is that JSF requires an XML configuration file to manage beans in the backend and the navigation rules. The integration with the EJBs in the backend is not very smooth.
  • People consider JSF "less" object oriented and more of XML tags. The string expressions here are not type safe.
  • JSF makes it hard to unit test, debug by stepping through. Further more the generated HTML is very verbose.
  • Templates and Custom components are hard to use.
Over all the general consensus was that for JSF is hard for the average developer to use with the EJBs (version 3) in the backend. Even though these frameworks are complementary they are built independent of the other. One has to do too much of a glue code. Seam was introduced with these problems in mind - the missing framework in between.

So lets see whats the big deal about Seam...
  • Seam is annotations based as opposed to using XML for configuration. What this means is that the developer can put the meta data related to a particular class right next to it instead of in a configuration file else where.
  • Further annotated objects become the standard way for a developer to write all code - simpilfies the application.
  • A more important concept Seam brings to the table is that of a context for every object (called conversations). In the earlier world the developer had to fend for herself when it comes to storing and retrieving information related to sessions. In the Seam world, one can annotate an object about the scope of its life and the framework takes care of storing and making the object available as needed. The framework understands that applications are stateful and helps to automatically manage that.
  • Seam achieves the contextual information movement using dependency bijection. This makes it possible for you to use two different tabs on the same browser instance to buy a CD and a book at the same time on your favourite website. Now that sounds powerful.
  • Under the hood Seam brings in a new way of looking at web applications. It shakes up the conditioning that we developers have been subjected to in terms of HTTP sessions.
  • Another good thing that Seam has done is that it has brought about the best of breed libraries (for example iText for PDF document generation) for various aspects.

Overall Seam tries to bring in Rails kind of developer productivity to Java while still providing the raw power. Its not there yet, but its trying! Are there any disadvantages to Seam - well its a bit early to say but here are a couple of thoughts I have.
  • At first because its a different programming model, one needs to unlearn some of the stuff from the past and that is hard. So it feels that the learning curve for Seam is high.
  • Secondly since Seam is more annotations, one needs to get the configurations right early on. Any change in configurations means a re-compile (whereas in the XML world it might have been an application reload).

As we discuss this, tons of developers are trying out Seam. Lets hope that we come to hear more short comings of Seam and the community fixes them. Eventually the roadmap is that Seam is being proposed as Web Beans (JSR 299) specification in the next version of Java Enterprise Edition. One would wait and watch.

Wednesday, June 25, 2008

Choosing Open Source Software

Of late Open Source Software (OSS) has become so critical to software companies that it is hard to imagine life without them. Although software vendors do not have code from the OSS community (although they would like to) because of licensing restrictions, their internal IT environments are replete with them. Consider this - Operating System: Linux, Development Environment: Eclipse, Issue tracking: Bugzilla, Source code control: SVN and the list goes on. A complete IT environment can be set up in matter of hours and we have multiple options at every turn.

Now let us turn our attention to a harder problem. Suppose one is releasing a product or a service. Can one use open source / free software as part of this? Obviously the first aspect to think is that of licensing. There are few tools out there that can be embedded into other software as part of a product without constraints. However the licensing constraints are not applicable for hosted services since the original software is not re-distributed for profit. So for the discussion lets move past licensing and see how do we choose open source components. The following anecdote is instructive.

Recently we were implementing a messaging system as part of which we had to have a Mail Transfer Agent (or in simpler words a server that can send, receive and deliver emails). We looked at various free options - theres tons of them some of them open source while others are without licensing costs. We eventually decided to go with qmail, primarily because we already had in house expertise on setting it up and running it. All was going very smooth until we did some high throughput performance tests. It turned out that the average delivery time for every message was increasing non-linearly with increase in load. It meant that with more load the system too significantly more time to deliver the messages. It was unacceptable.


We did a bit of profiling of the system and eventually could attribute the system behaviour to a queer problem. Its called as "The silly Qmail Syndrome" (http://qmail.jms1.net/silly-qmail.shtml). The gist of this problem is that when a very large number of messages are sent in a short period of time, qmail keeps itself busy doing only one set of tasks (either sending messages out or classifying incoming messages). This starves the other task and in turn dramatically increases the end to end delivery time. The site above gives the solution - comes to the community as a patch to the qmail code. The creators (Andre O and John S) have been kind to share this in an easily consumable form with the world. I shudder to think how our project would have been affected if we had not found this solution in time.

In conclusion I would like to go back to the point I wanted to make. When choosing Free or Open Source software to be part of critical business applications (wonder if there are there any non-critical business applications!) the most important thing one needs to look at is the community around the tool in question. The references below have more thoughts on what else to look for. In all using the right open source tools can amount to real cost savings while not compromising quality.

Some References

Tuesday, June 17, 2008

Insights into Java's popularity

Recently I was involved in a discussion on why Java has become so popular. As a result I did a bit of searching around. If you ask this question (Why has Java become so popular?) to people, the typical answers would be in the line of - Java is platform independent, Ease of use, Object Oriented, Multi-threaded and the like. However there is more to this than just good features of Java.

One of the most important ingredients to Java’s success is the marketing strategy employed by Sun. They did a great job on this one and I mean it in a positive way. Java (or Oak as it was called earlier) was invented for a very different purpose and for a different project. It was meant to network electronic devices. By definition it was expected that there would be varying environments involved, the programs had to be resilient in the face of devices coming on and going off the network and so on. However for whatever reason this idea did not take off and Java was orphaned. Sun probably actually did not know what to do with Java – the project went through many phases of closures and rebirths. However the explosive growth of the internet was round the corner. Java by design was suited to address some of the problems that came with this growth. At this point Sun made a great decision to put Java into the hands of the community across the world while still maintaining just enough control to enforce discipline. This was in a way innovative for the times – the developer community embraced Java like a long lost cousin. Java reached the stage it is today through this partnership. Has Sun made a lot of money in the process, I do not know, but it gave Java a great life ahead.

The other aspect of why Java grew so popular is that it is a dynamic language by design. The fact that Java is interpreted gives the designers the advantage of doing certain things differently. However there is still credit due to them for considering doing things differently. To understand the significance of the dynamism of Java, consider the following. Lets say there is a software system running for a particular purpose. As time progresses the needs of the users change and so do the expectations from this software. How does one add more functionality to this system?

The typical solution for the above problem is to build a new version of the software with additional/changed features. Then replace the older version with new one. The key to note here is that although the new and old versions of the software share the same code base, they are distinctly two different entities. This is where Java’s dynamism can be very helpful. One could build a system in Java that could be extended while on the run. A Java based Application Server is a very good example. We add new functionality (in fact complex new applications themselves) at run time without bringing down the Application Server itself. This of course does not mean that all Java programs are extensible. It just means that Java provides an easy platform. A running Java program can discover new things about code that it encounters for the first time through the mechanism of Reflection. The programmer still has to do the hard work of designing a system that harnesses these capabilities. That said let me also talk about the earlier C/C++ world. What I described above is also possible to be done using shared libraries (or dynamic linked libraries). But it is not as easy to do it – primarily because shared libraries were invented with a different goal in mind.

These two aspects of Java made it popular. Today its very well embraced by many communities and in fact many critical projects bet their money on Java.

Sunday, June 8, 2008

The Unsung Hero

Okay the title sounds like something from the history text book. But lets start with following problem definition:

Data about students in a college (typical of strength of about 500) are to be tracked. We need to be able to track about twenty attributes for every student like:
  • Opting for a particular subject, opting for hostel accommodation, Ladies and gents, etc.
Further we need to be able to answer queries like
  • Give all students who have opted for Artificial Intelligence and stay in the hostel
How you would organize the data structure and the corresponding algorithms for this problem? Think about this before you read further on.

If you recognized that these could be implemented as sets and the answers to the queries would be set operations, then you are on the right track. However what data structure did you think of using? If you thought of a structure like a list then you are not alone.

However there is a better structure suited for this situation - the bit vector. We will represent the data for students as bit vectors for each of the attribute - note that all the attributes need to be boolean. So we have a 500 bit vector for every subject, hostel accommodation and so on.
Then the answer to the queries like the one mentioned above can be obtained by a series of bitwise operations between different bit vectors. And there are some really cool algorithms to do the popular bitwise operations. This is very efficient both in terms of storage and time taken to process queries.

The question is why don't most of us think of this data structure? I believe its because our learning curriculum does not accord the kind of importance needed for such an elegant and efficient data structure. We are systematically "taught" to stay away from anything to do with bit operations. In fact I have seen good programmers with a bit operations phobia. I think it is time to revisit this mentality.

Let me end this discourse with a problem picked up from Jon Bentley's book Programming Pearls. The situation there was that a programmer needed to sort several thousand telephone numbers (seven digits long) stored in a file. The programmer was constrained by the fact that s/he could not read the entire set of numbers into main memory for lack of space.

The solution that was thought of was interestingly very elegant. An obvious note is that telephone numbers do not repeat. So the programmer represented every phone number (there can be only 99,99,999 of them since they are seven digits) by a bit. If the phone number existed in the file then the corresponding bit was set to one. So the bit vector could be initialized by one pass of the file on disk. Then sorting the numbers just meant walking through the bit vector and printing the (phone) numbers for which the bit was set to one.

I say three hurrays to our unsung hero The Bit Vector!