Historically the most used search tool provided by email clients is keyword based scanning. However these soon turned out to be woefully inadequate. The searching was done on the fly and the amount of data to be scanned simply became too large for the tool to respond in time. The next generation of email searching started with indexing the emails early on (done as they arrive or done periodically). When a search needs to be performed it is the index that scanned. This turns out to be much faster while needing disk space to store the index. Google Desktop is by far one of the best indexing tools (Xobni, Microsoft and Yahoo having their own too).
However this search is still keyword based. Consider the following situation I had recently. I participate in a lot of prospective customer visits. Recently I wanted to find the name of a prospective to whom I had presented our capabilities in building complex web applications. When I search for keywords - customer visits - Google Desktop throws up close to three thousand results in my email. Then I am trying to look for enhance my keywords. I figured out that one of my colleagues was also with me in the discussion. Typing his name cuts down the results by half but it still does not help. You see where I am going with this one.
Context Based Search
Keeping this in mind, at Persistent, we designed a different approach to searching for emails. We call this Context Based searching and it manifests as a tool called ViewMOR (https://ViewMOR.persistent.co.in/). People are adept at remembering contexts as opposed to names. In the above instance, I remembered that the prospective customer was high profile and a lot of other senior people at Persistent were involved in the visit. In addition to the above search criteria using ViewMOR I could search for emails that have more than six people in the email thread. Using such contextual information I could narrow down in no time.
Some of our ideas of what constitutes a context are here below.
- Keywords (in various email headers and body)
- People or to be more specific email address involved in the email conversation.
- Cardinality (number of people) of the conversation.
- Domains (the latter part of an email address) involved in the conversation.
- Date ranges when the conversation took place.
- Parameters used in previous searches.
- Attachment names, types and a host of other things.
Thus we provide a very rich semantics to email search (as opposed to mere keywords). Fortunately email is very amenable to searching on these fields since this information is captured in header. There are clear demarcation of fields. The challenge is to capture this data and implement algorithms that can work with large amounts of such data (typically represented as graph structures). Further we need to implement algorithms that can efficiently and quickly implement set operations (intersections, unions) of these large graphs. Main Memory also becomes a constraint. We implemented all of these in ViewMOR. Today it has become a robust platform for searching large amounts of data hidden in email.
In my next write-up I will discuss how one can deploy this search infrastructure in various environments.
References:
Although most email tools have some search implementation, there are few standalone tools available off the shelf. This space definitely needs more research. Some of the tools that we have seen:
- IBM OmniFind - http://www.alphaworks.ibm.com/tech/emailsearch
- ISYS Email Search - http://www.isys-search.com/technology/isysemail/
- Xobni - http://www.xobni.com/