How to block offline browsers A.K.A site rippers

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jdelasko
    New Member
    • Jan 2007
    • 4
    • 3.5.x

    How to block offline browsers A.K.A site rippers

    There's a lot of offline browsers that will sit on your site and download every single file. You don't need one of these on your site sucking up your bandwith and slowing your site down for legitimate users. Here's how to deal with them:

    Put the following lines in a .htaccess file in your site root directory. It will give ANY IP address using one of these ofline browsers an error 403. Included below are most currently known offline browsers:

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteRule ^.* - [F,L]



    Replace the last line above with this optional Rewrite Rule and any ofline browser will be redirected to the site you specify. This is handy if you simply want them off your site immediately. You can redirect them to that site that site that emails you all that presciption drug spam for instance:


    RewriteRule /*$ http://www.yourdomain.com [L,R]


    Just replace the 'yourdomain.com' with whatever web site you choose and the site ripper will be redirected there.

    Edit: in my original post, somehow a lone formatting tag, [/small], snuck in at the end of the above line of code. Sorry about that.... remove it or the code won't work... you'll get an internal server error. The code above is all correct and tested.

    Below is the RewriteRule I am currently using on my site:

    RewriteRule /*$ http://english-111745432354.spampoison.com [L,R]


    This rule redirects these bots to a site that will provide them with an endless supply of dynamically generated fake email addresses so that the user is provided with a gigantic collection of useless email addresses that has no commercial value. It's an excellent line to use if you want to waste a lot of their time.
    Last edited by jdelasko; Sat 27 Oct '07, 8:13am.
  • devilsown
    Member
    • Aug 2006
    • 80
    • 3.6.x

    #2
    Great post
    Devilsown Water Injection
    Grand prix forum

    Comment

    • Dean C
      Senior Member
      • Mar 2002
      • 4571
      • 3.5.x

      #3
      Nice post indeed, your rules could be optimized quite a lot though Also wouldn't it be nice if all these site rippers delivered a custom user-agent like the "honest" ones in your list
      Dean Clatworthy - Web Developer/Designer

      Comment

      • Reece^B
        Senior Member
        • Jan 2006
        • 290
        • 3.6.x

        #4
        What's the point in site rippers, purposely to slow down your site or is it for archive purposes like waybackmachine.org

        Comment

        • jdelasko
          New Member
          • Jan 2007
          • 4
          • 3.5.x

          #5
          Originally posted by Reece^B
          What's the point in site rippers, purposely to slow down your site or is it for archive purposes like waybackmachine.org
          Most of the software companies that provide offline browsers market them by saying something like "Download entire websites and view them at your convenience"

          The truth is, offline browsers are primarily used by hackers and spammers. These people use these tools to download every single file they can from your website in an attempt to gather any personal information they can or simply email addresses. Any of these companies that say these tools are for legitimate purposes are just blowing smoke. For instance, what good would downloading an entire vbulletin site be to the average user? The average user isn't going to set up a local php server to be able to view the site which is based on php and besides, without the sql data bases the fies themselves are useless. The main things site rippers are after include email addresses, photos and multimedia, or any other personal information they can get their hands on.
          Last edited by jdelasko; Sat 27 Oct '07, 8:02am.

          Comment

          • noppid
            Senior Member
            • May 2003
            • 625
            • 2.3.2

            #6
            jdelasko, actually, they get the site delieverd to them in static HTML and can publish it and it will work fine. It just won't be dynamic or update unless they pull the updates and add them too.

            The threat is real. I use geo IP and IP blocking of rouge datacenters to do the same. Since I have done this, the site much better off.
            Computer Help Forum
            An informed rider makes their first destination the motorcycle forum at rider info.

            Comment

            • jdelasko
              New Member
              • Jan 2007
              • 4
              • 3.5.x

              #7
              Originally posted by noppid
              jdelasko, actually, they get the site delieverd to them in static HTML and can publish it and it will work fine. It just won't be dynamic or update unless they pull the updates and add them too.

              The threat is real. I use geo IP and IP blocking of rouge datacenters to do the same. Since I have done this, the site much better off.

              Actually, they will download and store just about any file type.

              Comment

              • Darkblade
                Senior Member
                • Jul 2004
                • 690
                • 3.6.x

                #8
                Thanks for the post!

                About the last line, do I have to replace "RewriteRule ^.* - [F,L]" with "RewriteRule /*$ http://english-111745432354.spampoison.com [L,R]"? I'm kinda confused here and wanted to make sure.
                Metal Gear Forums - Discussion on the popular series of computer and console stealth-based games.

                My Mods: Coming Soon | My Tutorials: Coming Soon

                Comment

                • ChrisLM2001
                  Senior Member
                  • May 2003
                  • 1451
                  • 3.6.x

                  #9
                  Originally posted by jdelasko
                  Any of these companies that say these tools are for legitimate purposes are just blowing smoke. For instance, what good would downloading an entire vbulletin site be to the average user?
                  Yeah, the average user won't be downloading a whole site, but a designer might, if they're sick of a site with seizure ridden Flash ads, and a DTD and markup from the Stone Age.

                  Somewhere out there is a program that allows redesigning sites based on your specifics (more so than what browsers allow with a simple stylesheet replacement) -- that proggie's name escapes me, but is also a good alternative.

                  Use a site archiver on a couple of sites, mainly because the sites I feel won't be around much longer (this is especially true with videogame sites, that seem to vanish within 3 years after a game is released). They spider the site to your specifics (including deep linking), and if you want it to be an exact copy, or modified (like removing the ads).
                  "Anyone who conducts an argument by appealing to Authority
                  is not using his intelligence, he is just using his memory."
                  ~~~
                  Leonardo da Vinci

                  Comment

                  • jdelasko
                    New Member
                    • Jan 2007
                    • 4
                    • 3.5.x

                    #10
                    Originally posted by Darkblade
                    Thanks for the post!

                    About the last line, do I have to replace "RewriteRule ^.* - [F,L]" with "RewriteRule /*$ http://english-111745432354.spampoison.com [L,R]"? I'm kinda confused here and wanted to make sure.

                    You can use either line but NOT both. The two Rewrite rules just do something different.


                    RewriteRule ^.* - [F,L]

                    If used, will simply give an html error code 403 for every page that the site ripper requests. The site ripper may spend a little time on your site, but will give up after enough 403 errors.


                    RewriteRule /*$ http://english-111745432354.spampoison.com [L,R]

                    Does something a little more nasty. As soon as one of the site rippers shows up on your site, it will be redirected immediately to the url specified in this line. The particular url I use, *.spampoison.com, is a website that will recognize the visitor as a site ripper and provide it with and endless supply of dynamically generated pages that are full of dynamically generated, useless email addresses. The site ripper thinks it has struck gold, but will produce an email list that is useless and of no commercial value.

                    This last Rewrite Rule is just a way of fighting back at these bandwidth thieves. You can use any url you want in this last rule. Perhaps there's a site that sends you tons of annoying email spam, that you'd like to redirect site rippers to.

                    Comment

                    Related Topics

                    Collapse

                    Working...