Interesting Article: Interview with Patrick Chanezon
Read "Interview with Patrick Chanezon". They talk about Enterprise Syndication. HiT Syndicaat is mentioned as one of the two players in this address space.
Developing Enterprise Syndication Management Systems
Read "Interview with Patrick Chanezon". They talk about Enterprise Syndication. HiT Syndicaat is mentioned as one of the two players in this address space.
Originally, when we started development of HiT Syndicaat, we had envisioned a rather heterogeneous pattern of usage. Basically, we were concerned that content to syndicate would come from data vs. text content. In the former case, data is frequently updated. In the latter one, content needs to be searched in sophisticated ways. For this reason, in our product architecture, we organized repository to support both a text repository and a relational repository. It turned out that for internal reasons, we started developing the text repository support, first. Naturally, we immediately considered using Apache Lucene as our text search engine.
Things we noticed right away were the incredible performance of text searches (try searches at our demo RSS sites: HiT Syndicaat RSS Demo Feeds ) and the relatively slowness of response on updates (i.e., when changes are immediately visible – if changes are cached for later commitment, performance is outstanding). Soon, however we realized that content syndication is not about real time performance (as in online transaction processing) but rather about efficient and powerful text queries. For this reason, we decided to postpone development of the relational repository support and to stick to the Lucene-based text search engine for the foreseeable future.
Currently, as far as we know, not many web logs or RSS platforms are Lucene-based (check Powered by Lucene ). This is rather surprising considering that RSS and web log content is preeminently text and the efficiency of the Lucene software (let's not forget it is open-source...). What do you think?
Recently, in this web log, we described how we aggregated content obtained from BBC backstage RSS feeds, how we combined it into just one RSS feed and made that feed searchable thru A9. We received a number of objections to this experiment. People told us that popular RSS search engines can already be queried using A9. Others objected to the fact that RSS content is just a fraction of the "real" content pointed to by the RSS link field (optionally) contained in each RSS item. Both objections indeed raise reasonable doubts about the approach.
However, being able to carefully select just a small set of RSS feeds (possibly filtering them with feed-specific conditions) and make them collectively serchable with A9, has the advantage of visibility. Selecting a specific "column", A9 users make the conscious decision to keep "relevant" columns (i.e., subjects or data sources) visible at all times with query results. This will never happen when searching with a global RSS search engine in just one column.
The other objection about RSS content being too "slim" compared to the "real" content is a very valid one. Indeed, we are already operating on this issue in such a way NOT to include content pointed to by RSS link fields, but rather by ONLY text-indexing it in our data store. This way, when executing a text search on RSS content, items can be returned whose links point to text satisfying the query request.
We wish to proceed with our experiment and are avaiable to freely A9-enable other RSS content pointed out to us by readers of this blog. So, is there anyone for A9 out there ?
Recently, the BBC started an interesting initiative called backstage (http://backstage.bbc.co.uk/).
According to their web site, the motto is Build what you want using BBC content. So, we tried a little experiment. We aggregated and stored all the provided BBC RSS feeds into a unique RSS feed (note: platform support provided by the HiT Syndicaat server). Our goal was to provide a demo environment to show the performance and resource utilization of our RSS server.
There is an interesting twist to our experiment, namely RSS search support. HiT Syndicaat supports searches and also supports the A9 search format.
So, we registered the BBC RSS feed collection to the A9 server and now you can run your queries on BCC content (starting May 18, 2005) within A9.
To do that follow these instructions:
Go to http://a9.com
Login (or Register if necessary)
Go to http://a9.com/-/search/moreColumns.jsp
In the Search Columns edit box enter: RSS BBC. There, you will find our collection of all RSS BBC feeds coalesced into one feed and provided by our server. Add that column and run your searches.
Alternatively, if you want to run your searches directly from our server, use the following URL:
http://www.hitsyndicaat.com:8080/synd/_channels/ALL+BBC
followed by URL parameters as in the following examples:
http://www.hitsyndicaat.com:8080/synd/ALL+BBC?daterange=-m15 to return all RSS items in the last 15 minuts
http://www.hitsyndicaat.com:8080/synd/ALL+BBC?texttitle=Paris to return all RSS items containing Paris in the title
http://www.hitsyndicaat.com:8080/synd/ALL+BBC?description=Paris to return all RSS items containing Paris in the description
http://www.hitsyndicaat.com:8080/synd/ALL+BBC?daterange=-d3%20to%20-d2 to return all RSS items saved three days ago.
Please, refer to HiT Syndicaat documentation for further instructions on search syntax.
Early on in our design phase of HiT Syndicaat, we had a difficult technical decision to make. How were we going to develop our syndication management tools in such a way to make them simple to use and effective at the same time?
We had a few options to consider. Considering the architecture of our platform (Apache Tomcat, REST APIs and Lucene as backend database), initially we lent towards standard JSP development. Its major pluses are stability and maturity of the approach as well as the definite advantage to having a browser-based management console. In fact, all syndication and blogging platforms currently available (commercially and open-source) are browser-based applications (although most of them are PHP-based rather than J2EE applications). It is commonly agreed that browser applications are extremely effective for the average user since they do not require deployment and setup. There is also a definite similarity among all browser-based applications (once you are familiar with it) that makes the average user more comfortable.
Besides this option, we considered two other alternatives. In fact, we had major doubts about the JSP approach because on the one hand we figured that management consoles are complex GUI applications that over time get even more complex (by natural progression of the product). On the other hand, our typical user is not the typical "web surfer". In most instances, it is either systems management personnel or a "power user". These people are more inclined to appreciate effectiveness over GUI slickness.
So, we looked at DHTML development and smart-client development. The former approach is all the rage these days (see gmail and gmap by Google, or the discussions about AJAX). With smart DHTML development, by all means, your browser becomes a smart-client (especially when combined with asynchronous HTTP/XML calls to the backend server). Our major concern with scripting development was code stability and code maintenance. Until scripting environments mature significantly to become as robust a development environment as typical IDEs, we figured that quality and costs were not acceptable for our development schedule.
In the end, we were left with smart-client development. Considering that our target was the development of a powerful syndication management console, we were comforted in our decision by the fact that as far as we know most management consoles for DBMS are smart client applications (albeit not HTTP ones). Although many backend systems are provided with browser-based tools, full-feature management consoles are all stand-alone applications.
We think we overcame implicit limitations of the smart client approach by developing in Java/SWT (operating system portability) and by providing browser-based setup environments (to simplify application deployment and setup).
Open Search is web tool that support searches on multiple heterogeneous web data sources using a minimalist web query syntax and a minimalist XML/RSS schema for returning query result sets. Query syntax is based on query templates provided by web data sources when they register their URLs to the Open Search directory of compliant web data sources.
Clearly, the primary objective of this specification is the one of radically changing the way web searches are currently supported by search engines. Rather than relying on ever growing search repositories, web search platforms should perform concurrent searches on user-selected specialized searchable repositories. This way, multiple specialized search engines can be searched at the same time and their results coalesced into a combined XML document.
In the Open Search approach, RSS 2.0 (i.e., the format used for returning search result sets) is just the result set XML schema specification. In fact in most cases, the internal data structure of the heterogeneous web data sources can be anything and typically, it will be neither RSS, nor XML (nor is the semantics of web data sources linked to syndication).
Even though in Open Search, RSS is only a convenient format for returning search results, there are no intrinsic reasons for not using the Open Search specification for searching syndication content.
Blogs can be very conveniently searched using Open Search query syntax. However, should "pure" (i.e., syndication documents not tied to a blog) syndication content be searchable with Open Search ?
If a syndication management system is used to complement a Web Content Management system, it should definitely support the Open Search specification. In this context, an important extension to Open Search should be support of authenticated access. This would enable searches based on named subscriptions. We also believe that more advanced search capabilities (such as the use of the RSS category element, or restrictions on all RSS elements or attributes) would improve the quality of results. It should also be possible to return data source specific extensions to RSS in the result sets.
In our implementation of an enterprise syndication management system (HiT Syndicaat), we decided to Open Search-enable queries from the first release. This turned out to be a fairly light task because our URL-based query syntax was already richer than the minimalist Open Search query specification.
I am hoping to hear from HiT Syndicaat users about how useful they found our Open Search support.
With limited effort, WCM systems can be extended (and most of them already are) to return XML feeds linked to dynamically generated web content (see for instance the CMS Matrix at CMS Matrix for a complete roundup of RSS support in CMS). In most instances, when WCMs are in place, many of the functionalities of an RSS Syndication Management Systems (Syndication Management Systems) are provided by the WCM itself.
Typically, this is accomplished by creating an infrastructure for feed management (feed creation and editing) and an infrastructure for feed content generation. Basically, when creating a new node (web page) in the WCM, the same node can be assigned (automatically or manually) both to the web hierarchy and/or to the RSS feed collection of items. This approach is definitely appropriate (and probably the optimal one) whenever there is an almost one-to-one relationship between what is shown on a web page and what is referred to by an RSS item.
However in many situations, this is not the case because, compared to syndication, web content is fairly static over a period of time where syndication content is generally highly dynamic. Not only that, there are other mismatches between the two models:
- Syndication content lives long (see The Long Tail about the Long Tail approach). It should also be searchable for a long time. Typically, we do not have the same requirements for web content. Given the long life of syndication content, a syndication system should be provided with the means for letting external applications to query its content (see for instance the OpenSearch http://opensearch.a9.com/specification) and the means to store its content indefinitely.
- Syndication content has limited graphical requirements. In many instances, syndication content could do without HTML. In other instances, when HTML content is present (typically pointed to by links in RSS items), editing of its graphical content should require only few strokes (possibly, from a set of pre-configured templates). For editors, item creation should be a simple yet powerful action.
- Management of syndication feeds should be highly dynamic. A powerful syndication system should support a dynamic model for the creation, editing and removal of feeds. Power users should be provided with the ability to affect the number of feeds and the choice of who is the potential target for the feed. Compared to typical usage patterns in web development, syndication feeds are closer to weblogs than to portals. Editors (not site administrators) are often empowered to create content and feeds.
- Another major difference in comparing RSS feed management to WCM, is the RSS ability to cater content from heterogeneous data sources compared to data coming from a central repository provided by WCM. RSS Syndication Management Systems (at least according to our view of RSS SyMS) should be totally programmable by means of REST XML APIs. This way, content could come from ERP systems, Office applications, databases, etc. Similarly, RSS content should be searchable and accessible to heterogeneous systems (not only the web browser) such as mobiles, multi-media players, iPods, etc.
Given the differences, we believe that a hybrid management system (i.e., WCM + Syndication Management System - that for simplicity we refer to as Syndication Content Management) should be considered having the following components:
- a light-weight content authoring GUI interface to manually edit content and create syndication HTML templates;
- a syndication management system;
- a set of web content aggregators (to simplify the creation of automatic content to syndicate);
- a query processor (compliant to the OpenSearch architecture) to make the syndication content searchable over the internet.