What’s the problem?
The search engines never got to see the real content.
I’ll use our course catalog as an example. You can browse the course catalog by going to http://www.scispike.com/training-general/course-list.html.
We have 100’s of courses and the list of courses changes frequently. It is essential for us that the site allows us to modify course outlines and course categories with minimal hassle. In practical terms this means the course descriptions and their outlines must be retrieved from a dynamic source. The source for the page above has two dynamic areas. The course catalog (where the list of courses and their categories are displayed) and the search box for courses.
The problem, of course is… what does the search engine see?
This is literally all they see! This is pretty tragic for us because it means that web crawlers can’t find our web pages and hence nobody will find our courses.
The web crawlers allow you to setup an XML file called a sitemap. A sitemap is a simple XML format that lists all the pages on your site and gives other search hints to the search engines. It is a really simple format (you can read more about it at http://www.sitemaps.org/). The sitemap files looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
The file contains a description of all the files you want the search engines to index. It also contains hints as to how frequent the pages change (tells the search engine how often they ought to crawl the site) and the relative priority of the pages (a hint to say some pages are more important than others).
Each of our course outlines can be obtained by a unique URL. This allows us to define the sitemap containing all the pages on the site.
To make this work, we have to make sure we also generate the sitemap from the persistent source. I hence created a special servlet that reads the content from the database and generates the sitemap on the fly.
All we had to do now was to go to the main search engines and submit the URL of the sitemap. I did it yesterday and I noticed that some time before the search engines start the crawl, hence I’m waiting eagerly on the result, but at least in theory this should work well.
When I went to Yahoo to submit my sitemap, I could only find an interface to submit individual URL’s. It would be tragic if I had to go to the Yahoo site to submit a new URL every time we added or changed a course, so I decided that we really need to provide a way for crawlers that do not use sitemaps to find our outlines.
I want humans to use our course-list. It has all kinds of fancy features (try out the search box and look at the auto completion), however, I want the crawlers to see the pages using standard HTML. What I decided to do is to also generate a simple list. The list is generated on the server. The list has no frills (it is as simple of a page as possible), but it is navigable for crawlers from the index page.
You can see this list if you know the URL (check out this URL), however, you would have to know the URL to find it. Our front page has a hidden link defined as follows:
<div style="display: none">
<p><a href="all-links.html">All pages on this site</a></p>
When creating a dynamic site, you have to pay particular attention to how your dynamic content is published to the search engines. A typical Ajax application will not be found by the search engines because the content is not provided in the initial html download rather rendered after the page has been loaded or after some interaction on your site.
You may submit a sitemap to the search engines that informs the engine of content not reachable through site navigation by crawlers (you would typically also provide the content reachable too).
In the example above, I was in luck, because all the content is available through a unique URL. Only the catalog is not crawlable. To solve this unique situation, I used an approach as shown below:
I created some server side code that reads the dynamic content and generates a sitemap on the fly. The server also generates a static page (actually it is a dynamic page, but it will appear to the crawlers as static) containing all the links.
I did not want the list of links to be visible to the users (they have a nice dynamic course catalog) that they can use, so I provided a hidden link to the file that the crawlers only would use.