Search Engine and Indexing of Dynamic Content

What’s the problem?

When loading a dynamic site using GWT, Dojo, YUI or any other Ajax based approach, it is difficult (and sometimes impossible) for the search engine to obtain your content for indexing. The initial load of the site may only contain a simple HTML element (often a div or span tag). The tag will then be filled in by JavaScript using asynchronous calls to the server (Ajax) after the page has been loaded.

The search engines never got to see the real content.

Example

I’ll use our course catalog as an example. You can browse the course catalog by going to http://www.scispike.com/training-general/course-list.html.

image[2]

We have 100’s of courses and the list of courses changes frequently. It is essential for us that the site allows us to modify course outlines and course categories with minimal hassle. In practical terms this means the course descriptions and their outlines must be retrieved from a dynamic source. The source for the page above has two dynamic areas. The course catalog (where the list of courses and their categories are displayed) and the search box for courses.

image9[8]

The course list is rendered from a persistent data source that is easily modifiable. That is, we have the ability to upload courses without redeploying the site. To make this happen, we have written some client side JavaScript (using GWT, we actually wrote this in Java) that when the pages has loaded will dynamically fetch the current course catalog from the persistent source.

The problem, of course is… what does the search engine see?

This is literally all they see! This is pretty tragic for us because it means that web crawlers can’t find our web pages and hence nobody will find our courses.

Sitemaps

The web crawlers allow you to setup an XML file called a sitemap. A sitemap is a simple XML format that lists all the pages on your site and gives other search hints to the search engines. It is a really simple format (you can read more about it at http://www.sitemaps.org/). The sitemap files looks like this:

The file contains a description of all the files you want the search engines to index. It also contains hints as to how frequent the pages change (tells the search engine how often they ought to crawl the site) and the relative priority of the pages (a hint to say some pages are more important than others).

Each of our course outlines can be obtained by a unique URL. This allows us to define the sitemap containing all the pages on the site.

To make this work, we have to make sure we also generate the sitemap from the persistent source. I hence created a special servlet that reads the content from the database and generates the sitemap on the fly.

All we had to do now was to go to the main search engines and submit the URL of the sitemap. I did it yesterday and I noticed that some time before the search engines start the crawl, hence I’m waiting eagerly on the result, but at least in theory this should work well.

Convenience list

When I went to Yahoo to submit my sitemap, I could only find an interface to submit individual URL’s. It would be tragic if I had to go to the Yahoo site to submit a new URL every time we added or changed a course, so I decided that we really need to provide a way for crawlers that do not use sitemaps to find our outlines.

I want humans to use our course-list. It has all kinds of fancy features (try out the search box and look at the auto completion), however, I want the crawlers to see the pages using standard HTML. What I decided to do is to also generate a simple list. The list is generated on the server. The list has no frills (it is as simple of a page as possible), but it is navigable for crawlers from the index page.

You can see this list if you know the URL (check out this URL), however, you would have to know the URL to find it. Our front page has a hidden link defined as follows:

A crawler will find the link and hence chase down all the outlines. This allows me to submit only one page (http://www.scispike.com). When pages change, the all-links.html also change. I guess I could have avoided the hidden link on the front page and rather submitted the hidden page, but I think this is better as I am pretty crawlers will (eventually) start finding the front page and new search engines should hence also pick up the hidden URL.

Summary

When creating a dynamic site, you have to pay particular attention to how your dynamic content is published to the search engines. A typical Ajax application will not be found by the search engines because the content is not provided in the initial html download rather rendered after the page has been loaded or after some interaction on your site.

You may submit a sitemap to the search engines that informs the engine of content not reachable through site navigation by crawlers (you would typically also provide the content reachable too).

In the example above, I was in luck, because all the content is available through a unique URL. Only the catalog is not crawlable. To solve this unique situation, I used an approach as shown below:

image[3]

I created some server side code that reads the dynamic content and generates a sitemap on the fly. The server also generates a static page (actually it is a dynamic page, but it will appear to the crawlers as static) containing all the links.

I did not want the list of links to be visible to the users (they have a nice dynamic course catalog) that they can use, so I provided a hidden link to the file that the crawlers only would use.

No Comments Yet.

Leave a comment