"I am webby and I think webby" - AjiNIMC aka Aji Issac Mathew - "I thought and I wrote".

 AjiNIMC logo - Aji Issac Mathew I am Aji Issac Mathew also known as AjiNIMC at various forums. I am webby and I think webby, being a part time blogger, this blog is a documentation of my experiences and my learning.
Blog Stats (06 June 2008): There are currently 306 posts and 1100 comments (and 397,307 spam comments), contained within 17 categories.
RSS for Aji Issac Mathew's blog 
  I am into professional Web Marketing services which includes Web marketing strategies, SEO/SEM, Content Designing, Web Designing for usability, conversion improvement and various other things. There are limited availability per month. We don't take too many clients but we make sure that all our clients get their share of success. I worked on in-house sites for over 5 years, now is the time to help others with my experience. I have a great team helping me achieve this. A very creative and experienced team. Contact aji.issac (at the rate) digitalavenues.com and get your share of success.  

 Home >

Google Bot and Cache

Sep
19

As I promised in my previous post, I am writing about Google bot and cache. Before entering into it lets understand how search engine work and Spiders/Bots role in it.
search engine
(This is the simplest diagram for Search engine cache)
Here the spiders/bots/robots crawl the webpages and stores it in the page repository (huge Databases.If you have used cvs, svn or any version controlling apps then you will understand the word repo better. In simpler terms a store house). Then the algorithm is applied on the cache pages to get the SERPs (Search Engine Ranking Pages).So Ranking depends directly on the cached pages not what you have on your pages currently. Also the logic is redefined for the spider for sites and in general.

Sometimes you will see from your log files that Google is visiting your pages (if you think google is not visiting your pages, do check your log format. Also check the robots.txt) but not caching your pages. There can various reasons for it (filters, bans e.t.c. But with filter and bans I doubt whether google visits the pages). One of the reason is “no modification since last visit”.

With SVN we use svn diff to find the modificiation, in linux we simple do diff. Similarly Google checks whether the page is modified since last visit. IMO it will be a criminal offense to repeat what Gurus and Gods of search engines have already documented in their own excellence.

I commented on Matt’s blog but with no answers yet:-

As usual, great post Matt. I will be checking the video soon. I did read some of the university research papers on search engine working, the cache systems, ranking algos e.t.c. This post just made it clearer with an illustration. One solution to this is, adding the current date or some feeds. I have two questions.

  1. Say if a page is not getting updated for last few days, will the frequency of google visit be updated accordingly (less visits).
  2. Is a small change like date update or feeds, a change enough to avoid a Google 304 message?

According to me,

  • Answer 1: Yes the frequency will change, in the diagram see how the center logic redefines the bots logic.
  • Answer 2: Till now I do not think Google is taking bytes into consideration for “If modified since”. As a programmer you can always create a file for the modified content and check the size of modification.

Sometime in futher we can surely see,

Function GoogleIfModifiedSince($LastPageContent, $CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > Y) return true;
return false
}

Current function might be

Function GoogleIfModifiedSince($LastPageContent,$CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > 0) return true;
return false
}

As I have mentioned, add feeds, dates and some dynamic content to your pages to get fresh cache dates. I have always learned that Search Engines like pages with fresh content. So Search Engines considers a page as fresh if it is modified since last visit. Also if you care about bandwidth, you can save some consumed by Google Bots by adding a proper http 304 messages. If you have some doubts you can ask I will try to answer being in my limit :).

Related Posts

This post was written by AjiNIMC aka Web Kotler at 8:14 pm under category Tech Talks(




3 Comments »

  1. hi i wnt to know more about seo



    vijay on November 1, 2006 - 7:00 pm @ 7:00 pm

  2. the article is useful.



    web development chennai on January 8, 2007 - 3:59 pm @ 3:59 pm

  3. Thanks, great to hear that you found it helpful.



    AjiNIMC on January 8, 2007 - 10:37 pm @ 10:37 pm

RSS feed for comments on this post · TrackBack URI

Share your thoughts

You are visitor number