There has been endless webmaster speculation
and worry about the so-called "Google Sandbox" - the
indexing time delay for new domain names - rumored to
last for at least 45 days from the date of first "discovery"
by Googlebot. This recognized listing delay came to
be called the "Google Sandbox effect."
Ruminations on the algorithmic elements of this sandbox
time delay have ranged widely since the indexing delay
was first noticed in spring of 2004. Some believe it
to be an issue of one single element of good search
engine optimization such as linking campaigns. Link
building has been the focus of most discussion, but
others have focused on the possibility of size of a
new site or internal linking structure or just specific
time delays as most relevant algorithmic elements.
Rather than contribute to this speculation and further
muddy the Sandbox, we'll be looking at a case study
of a site on a new domain name, established May 11,
2005 and the specific site structure, submissions activity,
external and internal linking. We'll see how this plays
out in search engine spider activity vs. indexing dates
at the top four search engines.
Ready? We'll give dates and crawler action in daily
lists and see how this all plays out on this single
new site over time.
* May 11, 2005 Basic text on large site posted on newly
purchased domain name and going live by days end. Search
friendly structure implemented with text linking making
full discovery of all content possible by robots. Home
page updated with 10 new text content pages added daily.
Submitted site at Google's "Add URL" submission page.
* May 12 - 14 - No visits by Slurp, MSNbot, Teoma or
Google. (Slurp is Yahoo's spider and Teoma is from Ask
Jeeves) Posted link on WebSite101 to new domain at Publish101.com
* May 15 - Googlebot arrives and eagerly crawls 245
pages on new domain after looking for, but not finding
the robots.txt file. Oooops! Gotta add that robots.txt
file!
* May 16 - Googlebot returns for 5 more pages and stops.
Slurp greedily gobbles 1480 pages and 1892 bad links!
Those bad links were caused by our email masking meant
to keep out bad bots. How ironic slurp likes these.
* May 17 - Slurp finds 1409 more masking links &
only 209 new content pages. MSNbot visits for the first
time and asks for robots.txt 75 times during the day,
but leaves when it finds that file missing! Finally
get around to add robots.txt by days end & stop
slurp crawling email masking links and let MSNbot know
it's safe to come in!
* May 23 - Teoma spider shows up for the first time
and crawls 93 pages. Site gets slammed by BecomeBot,
a spider that hits a page every 5 to 7 seconds and strains
our resources with 2409 rapid fire requests for pages.
Added BecomeBot to robots.txt exclusion list to keep
'em out.
* May 24 - MSNbot has stopped showing up for a week
since finding the robots.txt file missing. Slurp is
showing up every few hours looking at robots.txt and
leaving again without crawling anything now that it
is excluded from the email masking links. BecomeBot
appears to be honoring the robots.txt exclusion but
asks for that file 109 times during the day. Teoma crawls
139 more pages. Another bad bot called aipbot crawled
2306 pages. Blocked 'em with robots.txt to keep them
out.
* May 25 - We realize that we need to re-allocate server
resources and database design and this requires changes
to URL's, which means all previously crawled pages are
now bad links! Implement subdomains and wonder what
now? Slurp shows up and finds thousands of new email
masking links as the robots.txt was not moved to new
directory structures. Spiders are getting errors pages
upon new visits. Scampering to put out fires after wide-ranging
changes to site, we miss this for a week. Spider action
is spotty for 10 days until we fix robots.txt
* June 4 - Teoma returns and crawls 590 pages! No others.
* June 5 - Teoma returns and crawls 1902 pages! No others.
* June 6 - Teoma returns and crawls 290 pages. No others.
* June 7 - Teoma returns and crawls 471 pages. No others.
* June 8-14 Odd spider behavior, looking at robots.txt
only.
* June 15 - Slurp gets thirsty, gulps 1396 pages! No
others.
* June 16 - Slurp still thirsty, gulps 1379 pages! No
others.
So we'll take a break here at the 5 weeks point and
take note of the very different behavior of the top
crawlers. Googlebot visits once and looks at a substantial
number of pages but doesn't return for over a month.
Slurp finds bad links and seems addicted to them as
it stops crawling good pages until it is told to lay
off the bad liquor, er that is links by getting robots.txt
to slap slurp to its senses. MSNbot visits looking for
that robots.txt and won't crawl any pages until told
what NOT to do by the robots.txt file. Teoma just crawls
like crazy, takes breaks, then comes back for more.
This behavior may imitate the differing personalities
of the software engineers who designed them. Teoma is
tenacious and hard working. MSNbot is timid and needs
instruction and some reassurance it is doing the right
thing, picks up pages slowly and carefully. Slurp has
addictive personality and performs erratically on a
random schedule. Googlebot takes a good long look and
leaves. Who knows whether it will be back and when.
Now let's look at indexing by each engine. As of this
writing on July 7, each engine also shows differing
indexing behavior as well. Google shows no pages indexed
although it crawled 250 pages nearly two months ago.
Yahoo has three pages indexed in a clear aging routine
that doesn't list any of the nearly 8,000 pages it has
crawled to date (not all itemized above.) MSN has 187
pages indexed while crawling fewer pages than any of
the others. Ask Jeeves has crawled more pages to date
than any search engine, yet has not indexed a single
page.
Each of the engines will show the number of pages indexed
if you use the query operator "site:publish101.com"
without the quotes. MSN 187 pages, Ask none, Yahoo 3
pages, Google none.
The daily activity not listed in the three weeks since
June 16 above has not varied dramatically, with Teoma
crawling a bit more than other engines, Slurp erratically
up and down and MSN slowly gathering 30 to 50 pages
daily. Google is absent.
Linking campaign has been minimal with posts to discussion
lists, a couple of articles and some blog activity.
Looking back over this time it is apparent that a listing
delay is actually quite sensible from the view of the
search engines. Our site restructuring and bobbled robots.txt
implementation seems to have abruptly stalled crawling
but the indexing behavior of each engine displays distinctly
differing policy by each major player.
The sandbox is apparently not just Google's playground,
but it is certainly tiresome after nearly two months.
I think I'd like to leave for home, have some lunch
and take a nap now.
Back to class before we leave for the day kiddies. What
did we learn today? Watch early crawler activity and
be certain to implement robots.txt early and adjust
often for bad bots. Oh yes, and the sandbox belongs
to all search engines. Second Sandbox Case Study Article
About The Author: Mike Banks
Valentine - He is a search engine optimization specialist
who operates WebSite101 Ecommerce Tutorial and will
continue reports of case study chronicling search indexing
of Publish101.com.
|