Information flow part 5: Real-time information syndication and search on our intranet
The lack of real-time publishing and distribution (syndication), has been and still is, a problem on our intranet. In order to ensure that the users on our intranet has the right information on the right time, we have to be sure that the information is instantly available and constantly flowing. To make that possible we have decided to use Atom and PubSubHubbub!
Note: I use the concept of real-time as perceived by users, even though there is a latency under one second e.g. between publishing and indexing.
We have decided to use Atom as the distribution format because it is readable, extensible, and widely available. We already use Atom for our feeds. It also helps that important future standards that we are about to start using, for example Activity streams, are based on Atom. The Open Search standard which we have implemented in our search also uses Atom.
The problem is polling
The problem with information distribution has been the delay between publishing in one system and the re-publishing (syndication) of the information to other places and systems. Traditionally we have done this via RSS/Atom feeds. The problem with feeds are that the other system has to do repeated polling of the feed just to check if there is new information. This puts a strain on many systems, since a lot of RSS/Atom implementations for feeds I have experienced are crap.
For example building a RSS-feed using the standard template for RSS in our current Web Content Management (WCM) solution. The whole tree-structure is traversed each time the RSS-feed is polled to check if there is any new items to add to the RSS-feed. This is problematic if a landing page/start page for a sub site on our intranet displays a couple of RSS-feeds from other sub sites, let’s say 5 feeds. So every time a user lands on the page, it gets the 5 RSS-feeds. Each one of the 5 RSS-feeds is ”constructed” by traversing the tree structure (which equals the navigation structure), checking the database for new items. Lets assume that we have a new visitor every 10 seconds for the particular landing page. That means 30 RSS-polls per minute. Multiply this by the number of sub-sites we have with RSS-feeds (≈100) and you might understand the problem.
Ok, I’m fully aware that this solution is neither well programmed, implemented or elegant. The template should not traverse the tree for each RSS poll. We should also use a cache. But remember we do not want any delay between publishing and re-publishing (distribution/syndication). So a cache does not work very well.
The solution is Push
This is a great video made by the PubSubHubbub (PuSH) team explaining the problems with polling and the promise of push through PubSubHubbub.
We can now distribute information in real-time between producer and consumer. Also the syndication of information is made easier since the PuSH-server can join several feeds together and distribute them as one feed to the subscriber.
Another rather nice thing is that we make the search system listen to the information flowing through the PuSH-server, the information is subsequently indexed. So any information that is distributed through the PuSH server is also immediately searchable as well! So the PuSH server can send all sorts of signals to other systems for them to take action on. But the most important thing is that we publish and distribute information to all users in real-time.
Also the Open Search standard is used to track ”keywords” for any new or updated information that is added to the index containing the ”keywords”. The result is an Atom feed that we distribute through the PuSH-server. So we do not have to poll the search index…
As always I appreciate any feedback, comments or RTs.