Building Software Brick by Brick

Sunday, June 5, 2011

Software Architects - Are There Different Types?

That is the question that I have been asked a few times: are there different classes of Architecture that we Software professionals deal with? So wanted to address this question in this post.

The most common kind of software architecture we do in OPD (Outsourced Product Development) done in Persistent is what we can name is Software Product Architecture. The corresponding professional would be Product Architect. This is all about building a software product. The usual kinds of problems solved here are how do we structure the software, what kind of data structures we use, how do we remove performance bottle necks, and so on. The focus of Product Architect is to build software that meets the requirements as given by product management. In turn product management would get their requirements from understanding the problem to be solved in a particular market. Typically such a product addresses problems faced by a vast majority of customers in that market.

The second kind of software architecture that is common is solutions architecture and the corresponding professional is the Solutions Architect. More often than not a Solution Architect solves a problem faced by a particular customer. Further a solution thus built might include a many software products working together. Each software product in the mix solves one part of the mystery. Thus a Solution Architect solves a problem using a right mix of the appropriate software products integrated in an optimal way. If and when this problem faced by the customer becomes more common place, there is a chance that the solution can be turned into a software product.

The third important kind of architect associated with Software is the Enterprise Architect. Many times an Enterprise Architect is confused for a person who is working on software products that are used in the Enterprise. Nothing can be more incorrect - Enterprise Architect is concerned with "implementing" the right kind of structure that can address needs of the business that this Enterprise is built to solve. So the Enterprise Architecture in addition to software systems, can also include other technical capabilities of an organization - people, organization structure, information structure, technologies used, etc. Enterprise Architecture by definition is far reaching as it also addresses an organization's changing needs as it evolves. Many people might contribute in building it but all contributors are not Enterprise Architects.

A quick look at the skills needed by each of these Architect classes throws more light on this discussion. The Product Architect needs a very sound basis in Computer Science - needs to understand the implication of using one algorithm over the other, one data structure over the other, etc. S/He needs a very good idea of what it takes to build a robust software and be able to translate user requirements into implementable software components. The Solution Architect on the other hand needs a very good handle on which off the shelf products solve a given problem. Further s/he needs a very good understanding of issues related to integrating them. So it goes without saying that this person has integration technologies on the finger tips. Also having a good handle on the domain in which the solution is being deployed is critical. The Enterprise Architect however, is akin to the godfather of the Enterprise s/he is out to structure. S/he is a formidable personality who can influence the way people work, the way processes structured, the investment made and so on. As one could see the skillsets here would need to be beyond technical capabilities (which by the way are a necessity in all the three classes).

Thus each of the above architects play a major role in their own way. Note that, in addition to these terms there are other architectures (like Application, Data, Hardware, Deployment Network, System, etc). However I feel that most of these terms do not intersect much with each other and mostly are self explanatory.

Sunday, March 29, 2009

Better Email Search

Providing Email search merely based on keywords is no longer enough for users. This is primarily due to the Email Overload phenomenon. Over time Email has become a huge storehouse of information. The problem is this information is hidden inside of tons of other transactional / low long term value emails. Email software have features like folders, tagging and sorting. But these help in organizing emails and not really searching as much. More over these are ineffective when the user does not or cannot find the time to tag/classify the emails.

Historically the most used search tool provided by email clients is keyword based scanning. However these soon turned out to be woefully inadequate. The searching was done on the fly and the amount of data to be scanned simply became too large for the tool to respond in time. The next generation of email searching started with indexing the emails early on (done as they arrive or done periodically). When a search needs to be performed it is the index that scanned. This turns out to be much faster while needing disk space to store the index. Google Desktop is by far one of the best indexing tools (Xobni, Microsoft and Yahoo having their own too).

However this search is still keyword based. Consider the following situation I had recently. I participate in a lot of prospective customer visits. Recently I wanted to find the name of a prospective to whom I had presented our capabilities in building complex web applications. When I search for keywords - customer visits - Google Desktop throws up close to three thousand results in my email. Then I am trying to look for enhance my keywords. I figured out that one of my colleagues was also with me in the discussion. Typing his name cuts down the results by half but it still does not help. You see where I am going with this one.

Context Based Search
Keeping this in mind, at Persistent, we designed a different approach to searching for emails. We call this Context Based searching and it manifests as a tool called ViewMOR (https://ViewMOR.persistent.co.in/). People are adept at remembering contexts as opposed to names. In the above instance, I remembered that the prospective customer was high profile and a lot of other senior people at Persistent were involved in the visit. In addition to the above search criteria using ViewMOR I could search for emails that have more than six people in the email thread. Using such contextual information I could narrow down in no time.

Some of our ideas of what constitutes a context are here below.

Keywords (in various email headers and body)
People or to be more specific email address involved in the email conversation.
Cardinality (number of people) of the conversation.
Domains (the latter part of an email address) involved in the conversation.
Date ranges when the conversation took place.
Parameters used in previous searches.
Attachment names, types and a host of other things.

Thus we provide a very rich semantics to email search (as opposed to mere keywords). Fortunately email is very amenable to searching on these fields since this information is captured in header. There are clear demarcation of fields. The challenge is to capture this data and implement algorithms that can work with large amounts of such data (typically represented as graph structures). Further we need to implement algorithms that can efficiently and quickly implement set operations (intersections, unions) of these large graphs. Main Memory also becomes a constraint. We implemented all of these in ViewMOR. Today it has become a robust platform for searching large amounts of data hidden in email.

In my next write-up I will discuss how one can deploy this search infrastructure in various environments.

References:
Although most email tools have some search implementation, there are few standalone tools available off the shelf. This space definitely needs more research. Some of the tools that we have seen:

IBM OmniFind - http://www.alphaworks.ibm.com/tech/emailsearch
ISYS Email Search - http://www.isys-search.com/technology/isysemail/
Xobni - http://www.xobni.com/

Tuesday, September 23, 2008

Static Code Analysis - some thoughts

If you do (or at some point of time did) C programming on Unix then you get introduced to static code analysis pretty early. There are tools like lint, cflow, ctags and cxref, although I was not familiar with this term at that time. So what is Static Code Analysis?

As the name suggests it is a technique to analyze a software without actually trying to run it. It obviously implies that the technique looks at the source code of the said software. Note that it is not Software Testing - wherein we actually compile and run the software application. The application is fed with test data. Software Testing is a part of Dynamic Analysis. Typically static analysis tools are built on top of compiler front-ends. The information collected by compiler front-ends is run through sets of rules to find pre-defined patterns or stored for further analysis.

So is analyzing source code without actually running the software of any use? It turns out that it is very useful - both for prevention of bugs and also for discovery of information. Primarily Static Code Analysis tools have been used for two purposes:

Making code more maintainable: Over time the software community has collected lot of good coding practices. It is possible for a tool analyze source code to find if any such practices are violated. More importantly these kinds of tools can now be integrated into Integrated Development Environments (IDEs) so that a software developer can be warned right at the moment s/he is committing a mistake. These tools can also be run on software that is already existing to point out such weak areas or outright defects. Typically these kinds of issues were expected to be caught in Peer Reviews, but its always a good idea to take help from a software.

Another way to make software more maintainable by using these tools is to rearrange the code (Beautify, Re-factor) so that it is well-formatted and adheres to good design practices. The resulting rearranged code is more elegant, more logical and hence more maintainable.

Analyzing software for other non-functional characteristics like Security, Performance, Scalability are studied but are not as common place.

Finding design information in existing code: More often than not, software is written under artificially created "Time to Market" pressure. Another very misused term is "We use agile coding methodologies". Whatever be the means, but the end result is large amount of poorly documented and poorly structured source code. Further it is very likely that the team that inherits this source code is not the same as that which writes it. At these times tools that can extract design information from source code become very critical to the success of the projects.

These are called as Information Abstraction systems. They collect data from the source code and put it into data stores (like relational databases). Querries can then be built on these data stores to aid in discovering design information. For example querries like - what are all the recursive functions in this software? Which classes 'use' a given class and so on. This process is also called as Reverse Engineering. There are tools that can generate 'intelligent' class diagrams from source code. Many a times such tools save the day for software developers.

Another utility of the information thus collected is to aid a software developer to navigate large code bases easily - one who has waded through millions lines of code appreciates their value.

Over the last few years the importance of static analysis has steadily grown. Teams find that they can dramatically reduce their code review times if the reviews are preceded by a static analysis first. Open source projects where code reviews are hard to arrange logistics wise, depend on these techniques to ensure a minimum quality of the software. Today development environments like Eclipse, IBM Rational, Microsoft include a multitude of static analysis tools to make the developer's life easier. If one has to reach the next level as a developer then one needs to know how to take advantage of these tools to take care of the mundane basics and use ones time for the more important things in life (like getting a cup of coffee!)

References:

This is definitely not exhaustive, but I wanted to put together some interesting aspects of static analysis.

Wikipedia (http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis) the link says it all.
The C Information Abstraction System (http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=48940) is an example of an Abstraction System I mentioned above.
The Wisconsin Madison Program Slicing project (http://www.cs.wisc.edu/wpis/htmlw/) is an excellent example of the varied kinds of analysis that can be done on information extracted from a software. It has some very interesting possibilities.
In this story Aron Campbell (http://asert.arbornetworks.com/2006/10/static-code-analysis-using-google-code-search/) uses static analysis for providing some very interesting insights into source code - a very enlightening (and amusing) discussion.

Monday, August 18, 2008

Understanding Existing Code

Continuing from my last post on debugging, the next natural step is to look at how to understand existing code. While we are on this topic it is interesting to note that the overall software entropy in the world is ever increasing. What I mean by that is the number of lines of code of software written in the world keeps increasing. So far it has been the experience that systems that need software, keep needing it and never die out. They might change in shape and size but never die out. Take the example of mainframe systems. Even today many enterprises still employ systems from the 1970s although they are wrapped in modern connectors. Moreover the systems that have been decommissioned are replaced by more complex modern software systems. Would we ever reach a time when all software ever needed is written? It is hard to say but I would bet that this time would never come.

So whether we like or not, we will be co-existing with lot of software. And by definition understanding existing software becomes critical. This article is about my thoughts on how to ease the life of the programmer who is banished to understand other's code. More often than not there is little or no documentation, both inside the code and outside (what do you mean documentation, you have the code!). The challenge this programmer is faced with is to extract design ideas out of the code. Before a programmer embarks on this journey she needs to understand the goals of why she is doing it very clearly. It could taking ownership of the software, fixing a defect or adding a new feature. The following thoughts come to mind.

The first step in understanding a software system is to go outside in. One needs to understand as much as possible about the execution context, functionality exposed to the user, external interfaces to the world and so on. This information is embedded in User Interfaces so "playing around" with the system is obviously desirable. Some times this is not possible. The programmer has to satisfy herself with reading user manuals or rely on user descriptions or even use only reports and other artifacts generated by the system under consideration.

A neglected aspect in this regard are the test cases for the system. If the quality assurance for the system is up to date and complete, then the test cases are an extremely rich source information about the behaviour of the system. The programmer needs to classify the test cases and start looking at "end to end" cases. Executing these test cases on a running system is very desirable.

The next step is to understand the deployment aspects of the system. If one is faced with a compiled system, the build scripts embed a lot of information about the dependencies within the system. The process of coming to binary executables from code can be quite complex. Understanding this process is critical in understanding the software. There are a few tools that help the programmer in this process - for example one tool coverts build scripts into a graphical representation that is much easier to comprehend.
The next step is to start getting into the code. Many times programmer start understanding a software system inside out, that is start looking at the code first. This is not a good idea as it can be very confusing leading to wasted time. The better way is go outside in. However these three steps will be needed to be done back and forth many times. Coming back to understand source code there are quite a few advances in this area.

One of the ways to start is to set up source code as a "project" in an appropriate Integrated Development Environments (IDE). The IDE support for modern languages is phenomenal and one needs to harness this for understanding code. For one navigating the code becomes extremely easy.
Another point to consider here is Program Understanding has developed into a science in the last two decades and there are tools now availble to automate this process. Some examples:
- See University of Wisconsin's program slicing tool page here: http://www.cs.wisc.edu/wpis/html/
- The C Information Abstraction system from Bell Labs - http://www.uni-koblenz.de/FB4/Institutes/IST/AGEbert/Teaching/WS0405/RE/ChNiRa90.pdf.
- JDepend (http://clarkware.com/software/JDepend.html) gives a lot of information about the structure of a Java program.
Using automatic static analysis tools like these makes a lot of sense when the code base the programmer is faced with is huge (this can be relative, but my opinion is anything larger than 100,000 lines of code). Many of these tools extract information about the software and store it into query-able systems. The data then is open for a lot of reporting and representations. The next level of detail is to put together the control and data flows that are relevant for ones investigations.

It is rarely that a programmer gets into understanding large programs just for the sake of understanding it or for leisure. More often than not, the effort is driven by a need to change the behaviour of the software under consideration - either add a new functionality or fix a defect. In this case the programmer has a definite goal. She can now start narrowing on to this goal. The ideal situation is to understand all the flows related to the module being changed and have a good sense of the impact of changing the code.

An interesting note here - as part of doing the above a programmer mentally "slices" the program in various ways. In computer science this is studied as "Program Slicing". There are a few tools that can help the programmer here, but more importantly having this theoretical awareness helps the programmer to have a better perspective on how to go about achieving ones goals.

Hopefully these points help a software programmer or two faced with this dilemma. The ideal outcome of a program understanding effort is to get a completed set of documentation, both inside and outside the code. However many times its not feasible because of time and budget constraints. My personal opinion is that the organization needs to at least budget for internal commenting of the code. The programmers given the task of understanding code should also be charged with commenting it. This can be done incrementally, the programmer adds suitable comments in the modules that she is targeting. If this is not done then the knowledge discovered in this process remains in the minds of people, but may be that is a good thing because it generates jobs for other programmers in a few years !!

Tuesday, July 15, 2008

How to be a Better Debugger

One of the complaints I have had with my computer science education is that I was never taught to debug code. Writing code is one thing, but getting it to work is another! Also it is interesting that our curriculum never emphasizes understanding existing code either. And more interesting is the fact that this has not changed in the last fifteen years. There are a whole bunch of new things added to the computer science subjects, but the required emphasis for debugging is still missing. I believe it is an extension of one of the human traits - we learn well how to speak but never to listen well :).

So I thought I will write bit about how one can get good at debugging code. As I said its like learning to listen well. The easiest way to debug code is to ensure that the bug does not get introduced in the first place! The focus of this article is not to talk about sound programming techniques so I will not talk about that and many times you are fixing someone else's code. Assuming that enough care is taken to start with, there still will be bugs. This article is about how to find them. Even today after all the advances in compiler technology debugging is a tough art. However that is reason enough to motivate many people to do better debugging.

Before we jump in, I want you to think about why defects are introduced in software in the first place. If you look at it there can be about 10-15 different reasons. Some that jump to mind are "Developer does not understand requirements well", "Developer does not understand the underlying platform well", "Plain old mistakes" and so on. If for every bug one finds if it can be classified in one of these categories, it will help teams to avoid them in the future. Overall my point is that defects are good opportunities to learn about whats happening in the code and around in the team. I will address the debugging process in the following bullets.

The first key to efficient debugging is to clearly understand what the software has to be doing. It sounds absurd but I have seen umpteen situations where the developers do not comprehend and more importantly do not agree on the software's behaviour. Getting this cleared out of the way eliminates a class of defects related to requirements understanding.
Debugging is like solving a murder mystery - you have the dead body (or the crashed program) from where you have to reason back about what happened. The good thing in programming is with luck a developer can recreate the situation where the program dies. So the next step in debugging is to be able to reproduce the bugs predictably. For complex defects this is easier said than done.
Typically a bug manifests itself in different ways - a memory (core) dump, a stack trace dump, an unexplained or unintended behaviour or just plain old software hang (in which the software does nothing!). So the next mantra in debugging is to collect as much information about the bug as possible. This step is under-rated by many, because once we find an issue we tend to stop and fix it. However we do not pay attention to all the flows in which the particular issue can surface. The typical consequence of not doing enough upfront research is the introduction of new defects while fixing the one at hand.
There are several other ways of gathering the above information - stepping through debugger sessions, analysing existing logs, adding print statements to gather more information, etc. It is usually less time consuming to add print statements and is recommended. However it has the side effect of strewing ones code with unnecessary code - a good logging framework upfront alleviates this problem. Moreover these print statements can be switched on again if there is a later need.
The next step in the debugging process is to track the instance at when the bug was introduced. If developers follow the discipline of "one issue fixed per checkin" then it makes the debugging that much easier. Most often the recent changes are the culprits. Hence you need to develop skills of doing diffs and using the versioning system well.
You should note that debugging is a very intensive activity and needs a lot of preparatory time. Hence once a person dives into it, it would be a waste of time to switch to another task and come back later to debugging again. The context is lost and it does not help the process.
Look for patterns. When debugging hard to find problems, it pays to be systematic. In one instance we thought we had a sporadic bug. But when we charted its occurrences, it turned out to be that the bug appears only around the start of the week. It was later traced down to an over run of the array in which week days were stored. Half the battle is won if a developer can put together a pattern (based on time, environment, actions, input sections, etc) on the bug occurrences.
Learn as much as you can about the development and production environments. I have seen many developers use less than 10% of the cool facilities that IDEs give them. The primary reason is not being aware of them. These can range from simple time saving short cuts to enlightening stack trace analysers. Knowing the tools does not automatically find the bugs but makes the journey more enjoyable.
Know and understand "Debugging Blindness". Constant stares at the same piece of source code over extended periods of time can cause fatigue and make one miss things that are obvious. In one instance we had the following:
```
if( i < l) { ... 
```
Its hard to see that 'i' is being compared to the variable 'l' and not the constant '1'. It can be easily missed in a debugging session. Speaks a bit about what variable names to choose. The point is that it also pays to understand the psychological aspects of programming - both for one who wrote the code and one who is debugging it.

If you have tried all the above and still cannot find the bug then close it in your bug tracking software with a note saying "Not reproducible" and let your customers find it for you :). We wish this was an option, but you and me know that it is not.

Thursday, July 3, 2008

Whats the big deal about Seam

Recently we had this discussion about whats the big deal about Seam. If one looks at the cricket game the seam on the ball decides a lot about the game - particularly in bowler friendly conditions. However the seam that we were discussing was not related to cricket. It was related to the web framework called Seam (http://www.seamframework.org/). Its a relatively new framework on the horizon. This blog is about whats the big deal about this Seam!

To understand whats the big deal one needs to have a background about some of the related J2EE technologies particularly Java Server Faces (JSF) and Enterprise Java Beans 3.0 (EJBs). These two are mature technologies now and enough developers have tried them out in various contexts. There are a few drawbacks of JSF that have been reported so far.

The most important one is that JSF requires an XML configuration file to manage beans in the backend and the navigation rules. The integration with the EJBs in the backend is not very smooth.
People consider JSF "less" object oriented and more of XML tags. The string expressions here are not type safe.
JSF makes it hard to unit test, debug by stepping through. Further more the generated HTML is very verbose.
Templates and Custom components are hard to use.

Over all the general consensus was that for JSF is hard for the average developer to use with the EJBs (version 3) in the backend. Even though these frameworks are complementary they are built independent of the other. One has to do too much of a glue code. Seam was introduced with these problems in mind - the missing framework in between.

So lets see whats the big deal about Seam...

Seam is annotations based as opposed to using XML for configuration. What this means is that the developer can put the meta data related to a particular class right next to it instead of in a configuration file else where.
Further annotated objects become the standard way for a developer to write all code - simpilfies the application.
A more important concept Seam brings to the table is that of a context for every object (called conversations). In the earlier world the developer had to fend for herself when it comes to storing and retrieving information related to sessions. In the Seam world, one can annotate an object about the scope of its life and the framework takes care of storing and making the object available as needed. The framework understands that applications are stateful and helps to automatically manage that.
Seam achieves the contextual information movement using dependency bijection. This makes it possible for you to use two different tabs on the same browser instance to buy a CD and a book at the same time on your favourite website. Now that sounds powerful.
Under the hood Seam brings in a new way of looking at web applications. It shakes up the conditioning that we developers have been subjected to in terms of HTTP sessions.
Another good thing that Seam has done is that it has brought about the best of breed libraries (for example iText for PDF document generation) for various aspects.

Overall Seam tries to bring in Rails kind of developer productivity to Java while still providing the raw power. Its not there yet, but its trying! Are there any disadvantages to Seam - well its a bit early to say but here are a couple of thoughts I have.

At first because its a different programming model, one needs to unlearn some of the stuff from the past and that is hard. So it feels that the learning curve for Seam is high.
Secondly since Seam is more annotations, one needs to get the configurations right early on. Any change in configurations means a re-compile (whereas in the XML world it might have been an application reload).

As we discuss this, tons of developers are trying out Seam. Lets hope that we come to hear more short comings of Seam and the community fixes them. Eventually the roadmap is that Seam is being proposed as Web Beans (JSR 299) specification in the next version of Java Enterprise Edition. One would wait and watch.

Wednesday, June 25, 2008

Choosing Open Source Software

Of late Open Source Software (OSS) has become so critical to software companies that it is hard to imagine life without them. Although software vendors do not have code from the OSS community (although they would like to) because of licensing restrictions, their internal IT environments are replete with them. Consider this - Operating System: Linux, Development Environment: Eclipse, Issue tracking: Bugzilla, Source code control: SVN and the list goes on. A complete IT environment can be set up in matter of hours and we have multiple options at every turn.

Now let us turn our attention to a harder problem. Suppose one is releasing a product or a service. Can one use open source / free software as part of this? Obviously the first aspect to think is that of licensing. There are few tools out there that can be embedded into other software as part of a product without constraints. However the licensing constraints are not applicable for hosted services since the original software is not re-distributed for profit. So for the discussion lets move past licensing and see how do we choose open source components. The following anecdote is instructive.

Recently we were implementing a messaging system as part of which we had to have a Mail Transfer Agent (or in simpler words a server that can send, receive and deliver emails). We looked at various free options - theres tons of them some of them open source while others are without licensing costs. We eventually decided to go with qmail, primarily because we already had in house expertise on setting it up and running it. All was going very smooth until we did some high throughput performance tests. It turned out that the average delivery time for every message was increasing non-linearly with increase in load. It meant that with more load the system too significantly more time to deliver the messages. It was unacceptable.

We did a bit of profiling of the system and eventually could attribute the system behaviour to a queer problem. Its called as "The silly Qmail Syndrome" (http://qmail.jms1.net/silly-qmail.shtml). The gist of this problem is that when a very large number of messages are sent in a short period of time, qmail keeps itself busy doing only one set of tasks (either sending messages out or classifying incoming messages). This starves the other task and in turn dramatically increases the end to end delivery time. The site above gives the solution - comes to the community as a patch to the qmail code. The creators (Andre O and John S) have been kind to share this in an easily consumable form with the world. I shudder to think how our project would have been affected if we had not found this solution in time.

In conclusion I would like to go back to the point I wanted to make. When choosing Free or Open Source software to be part of critical business applications (wonder if there are there any non-critical business applications!) the most important thing one needs to look at is the community around the tool in question. The references below have more thoughts on what else to look for. In all using the right open source tools can amount to real cost savings while not compromising quality.

Some References

http://www.oss-watch.ac.uk/resources/beginners.xml: A good start for people who do not know the details of open source.
http://www.oss-watch.ac.uk/resources/tips.xml: Suggests top ten tips on how to select open source software. The focus is for enterprises to deploy them internally.
http://www.linuxdevices.com/articles/AT7621761066.html: Nokia built the 770 Internet Tablet mostly on open source. Read some interesting aspects of their story here.
http://www.dwheeler.com/oss_fs_eval.html: Good process for evaluating open source software systems
http://shearer.org/MTA_Comparison: Has a good comparison of the popular MTAs (SMTP servers). This is a good example of a typical study of open source / free software which we need to look at before we use them.