House of Fusion
Home of the ColdFusion Community

Search cf-talk

December 02, 2008

<<   <   Today   >   >>
Su Mo Tu We Th Fr Sa
   1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31       

Search over 2,500 ColdFusion resources here  >>>      
Home /  Groups /  ColdFusion Talk (CF-Talk)

How does Security affect search engine spiders?

  << Previous Post |  RSS |  Sort Oldest First |  Sort Latest First |  Subscribe to this Group Next >> 
I agree with Ade.
s. isaac dealey
10/12/08 05:54 P
And not include what it finds in its index?
Adrian Lynch
10/12/08 06:10 P
correct
Michael Dinowitz
10/12/08 06:36 P
>>is there a good bot/bad bot list?
Claude Schneegans
10/13/08 04:31 P
Top  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Doug Boude (rhymes with 'loud')
10/12/2008 05:27 PM

Hi all. I am curious if anybody knows how securing a site affects a search engine spider's ability to crawl it. For instance, if I have my entire site secured by means of authentication so that any page request is redirected to the login page if the appropriate security creds are not present in session, do spiders receive the same treatment? Are they also prohibited by my security from crawling any page except the login page? If this is true, what can I do to allow spiders to have access to crawl content but still apply security to regular "human" visitors? My only thought on that is to detect the fact that they are a spider (not sure how to do that though) and not implement security in that case. Thanks for your ideas and thoughts. Feel free to email them to me at dougboude@gmail.com. Doug  :0)

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Adrian Lynch
10/12/2008 05:38 PM

Oooooooo, not sure that's gonna be the best solution. Firstly, if you let Google crawl the secure content, I'll go to Google and view the cached version of your content. Secondly, the way to detect if it's Google can be spoofed. It'll make itself known to you via the user agent. Dump the CGI scope to find that. Thirdly, I reckon, and I'll need someone else to confirm this, that if you let Google in to index content that's not available to a non member and Google finds out, it'll penalise you. Adrian Building a database of ColdFusion errors at http://cferror.org/ Hi all. I am curious if anybody knows how securing a site affects a search engine spider's ability to crawl it. For instance, if I have my entire site secured by means of authentication so that any page request is redirected to the login page if the appropriate security creds are not present in session, do spiders receive the same treatment? Are they also prohibited by my security from crawling any page except the login page? If this is true, what can I do to allow spiders to have access to crawl content but still apply security to regular "human" visitors? My only thought on that is to detect the fact that they are a spider (not sure how to do that though) and not implement security in that case. Thanks for your ideas and thoughts. Feel free to email them to me at dougboude@gmail.com. Doug  :0)

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
s. isaac dealey
10/12/2008 05:54 PM

I agree with Ade. My general practice is to not secure reads for anything I want google to index and only apply security on pages where a user is entering some information that might require it. A lot of forum systems do this. You can browse the forum without being logged in, but you have to log in if you want to post something. -- s. isaac dealey  ^  new epoch isn't it time for a change?      ph: 781.769.0723 http://onTap.riaforge.org/blog

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Michael Dinowitz
10/12/2008 05:56 PM

There is an option in adsense to bypass login based security in order to index pages for ads. While your pages may not have ads on them, using this option guarantees that Google will get through your security. On 10/12/08, Doug Boude (rhymes with 'loud') <dougboude@gmail.com> wrote: ----- Excess quoted text cut - see Original Post for more -----

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Adrian Lynch
10/12/2008 06:10 PM

And not include what it finds in its index? Adrian There is an option in adsense to bypass login based security in order to index pages for ads. While your pages may not have ads on them, using this option guarantees that Google will get through your security. On 10/12/08, Doug Boude (rhymes with 'loud') <dougboude@gmail.com> wrote: > > Hi all. I am curious if anybody knows how securing a site affects a search > engine spider's ability to crawl it. For instance, if I have my entire site > secured by means of authentication so that any page request is redirected to > the login page if the appropriate security creds are not present in session, > do spiders receive the same treatment? Are they also prohibited by my > security from crawling any page except the login page? If this is true, what ----- Excess quoted text cut - see Original Post for more -----

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Michael Dinowitz
10/12/2008 06:36 PM

correct https://www.google.com/adsense/list-auth Use this section to allow the AdSense crawler to access pages that are behind a login. Our crawler will access these pages only to determine content for ad targeting purposes and will fully comply with Google's privacy policy <http://www.google.com/privacy.html>; While they are determining content for ad targeting, they are looking at the pages content. This content will not show up in their index but logic says that it will be used to effect your ranking. I've seen search results that show a site and when I click on it, I get the logic. This says that Google has indexed past the login. ----- Excess quoted text cut - see Original Post for more -----

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Claude Schneegans
10/13/2008 12:46 PM

>>My only thought on that is to detect the fact that they are a spider (not sure how to do that though) and not implement security in that case. Oups, not a good idea. There are mainly two sorts of spiders: good bots (ie:google) and bad bots (ie: those looking for mail addresses to spam) In neither case they should be reading your pages. Good bots, because there is no need to index secured pages, and bad bots should be banned from any page anyway. So just let the login page do its work : good bots will never try to submit the login form, bad bots may try,but with no password they'll be kicked out anyway.

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Matt Robertson
10/13/2008 03:34 PM

is there a good bot/bad bot list?  Not that I would trust it but it can't hurt to at least look at whether its feasible to use it as another weapon in the arsenal. I have an IP- and bot-identifying based system that works pretty well but I'm always up for newer and better info. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Claude Schneegans
10/13/2008 04:31 PM

>>is there a good bot/bad bot list? Not as I know, anyway, one cannot rely on user agents which can be faked so easily. Personally, I let just a few known bots in, based on the IP address, the only parameter that cannot be faked. For every other request, I have some tools that analyze automatically every visitor according to some criteria as: - Does it read robots.txt? - fails in some robot trap? - reads robots.txt but reads forbidden pages any way; - requests pages at too high rate. - reads javascripts but does not execute it. - does not read CSS, - clearly idenfies itself in the user agent or not. etc... ... and of course, presence of DECLARE or http in urls is the first test ;-) >>I have an IP- and bot-identifying based system that works pretty well but I'm always up for newer and better Such a system can only identify good bots for sure, but not bad bots and fakes. And the problem is not with good bots, but with bad guys. I also have a white list and a black list, but their only purpose is to bypass the rest of the tests.

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Claude Schneegans
10/13/2008 05:01 PM

>>Not as I know, anyway, one cannot rely on user agents which can be faked so easily. Just to illustrate this, as I was writing my last message, I just received a notice from my server reporting a new bad bot detected. its user agent is "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; SpamBlockerUtility 10.2.217.0)" and it was trapped because "p=releases';DECLARE%20@S%20CHAR(4000);SET%20@S=CAST(0x4445434C4152452040........" Was found in the URL. Just wonder what this "SpamBlockerUtility" is supposed to block ;-)

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Matt Robertson
10/13/2008 05:45 PM

Oh yes... I found out early on that I would get all bent out of shape and cranky doing log analysis without filtering out that declare stuff before it hit the logs in the first place. I figured there was no point to relying on user agent info but wanted to see if anyone had anything that I might pick over. Good old FunWebProducts ... a.k.a. "I Am a Moron" ... -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Larry Juncker
10/13/2008 05:56 PM

Looks to me as though it is blocking SQL injection attacks.... >>Not as I know, anyway, one cannot rely on user agents which can be faked so easily. Just to illustrate this, as I was writing my last message, I just received a notice from my server reporting a new bad bot detected. its user agent is "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; SpamBlockerUtility 10.2.217.0)" and it was trapped because "p=releases';DECLARE%20@S%20CHAR(4000);SET%20@S=CAST(0x4445434C4152452040... ....." Was found in the URL. Just wonder what this "SpamBlockerUtility" is supposed to block ;-)

Top  |   Parent  |   Reply  |   Original Post  |   RSS Feed  |   Subscribe to this Group
Author:
Claude Schneegans
10/13/2008 06:19 PM

>>Looks to me as though it is blocking SQL injection attacks.... It doesn't block anything, it SENDS SQL injection attacks! MY application blocked it. ;-)


<< Previous Thread Today's Threads Next Thread >>

Mailing Lists