robot text file

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • TheNewOne
    Senior Member
    • Aug 2011
    • 1033
    • 4.2.5

    [Forum] robot text file

    what is the best to put in here this is what I have for now
    Code:
    #****************************************************************************
    # robots.txt
    #     : Robots, spiders, and search engines use this file to detmine which 
    #       content they should *not* crawl while indexing your website.
    #     : This system is called "The Robots Exclusion Standard."
    #     : It is strongly encouraged to use a robots.txt validator to check
    #       for valid syntax before any robots read it!
    #
    # Examples:
    #
    # Instruct all robots to stay out of the admin area.
    #     : User-agent: *
    #     : Disallow:   /admin/
    #     : Disallow:   /moderator/
    #
    # Restrict Google and MSN from indexing your images.
    #     : User-agent: Googlebot
    #     : Disallow:   /images/
    #     : User-agent: MSNBot
    #     : Disallow:   /images/
    #****************************************************************************
    User-agent: *
    Disallow:
  • Wayne Luke
    vBulletin Technical Support Lead
    • Aug 2000
    • 74123

    #2
    Mine contains the following:

    Code:
    User-agent: Baiduspider 
    Disallow: /
    
    User-agent: BoardTracker
    Disallow: /
    
    User-agent: Gigabot
    Disallow: /
    
    User-agent: Twiceler
    Disallow: /
    
    User-agent: Slurp
    Crawl-delay: 2
    
    User-agent: msnbot
    Crawl-delay: 2
    
    User-agent: *
    Disallow: *.js
    Disallow: /clientscript/
    Disallow: /cpstyles/
    Disallow: /customavatars/
    Disallow: /customprofilepics/
    Disallow: /images/
    Disallow: /ajax.php
    Disallow: /attachment.php
    Disallow: /calendar.php
    Disallow: /cron.php
    Disallow: /editpost.php
    Disallow: /global.php
    Disallow: /image.php
    Disallow: /inlinemod.php
    Disallow: /joinrequests.php
    Disallow: /login.php
    Disallow: /member.php
    Disallow: /memberlist.php
    Disallow: /misc.php
    Disallow: /moderator.php
    Disallow: /newattachment.php
    Disallow: /newreply.php
    Disallow: /newthread.php
    Disallow: /online.php
    Disallow: /poll.php
    Disallow: /post.php
    Disallow: /postings.php
    Disallow: /printthread.php
    Disallow: /private.php
    Disallow: /profile.php
    Disallow: /register.php
    Disallow: /report.php
    Disallow: /reputation.php
    Disallow: /search.php
    Disallow: /sendmessage.php
    Disallow: /showgroups.php
    Disallow: /showpost.php
    Disallow: /subscription.php
    Disallow: /threadrate.php
    Disallow: /usercp.php
    Disallow: /usernote.php
    Translations provided by Google.

    Wayne Luke
    The Rabid Badger - a vBulletin Cloud demonstration site.
    vBulletin 5 API

    Comment

    • TheNewOne
      Senior Member
      • Aug 2011
      • 1033
      • 4.2.5

      #3
      if i install your file will it cause any problems any where

      Comment

      • setishock
        Senior Member
        • Jun 2005
        • 1334
        • 4.2.x

        #4
        It shouldn't as the robot.txt file is just a roadmap for the bots. It lets them know what's off limits to look at.
        If you'd like to get the skivy on how to put one together there's Google. Just pop in the search term robot.txt.
        ...

        Comment

        • TheNewOne
          Senior Member
          • Aug 2011
          • 1033
          • 4.2.5

          #5
          Say if i put it like this what you think or if you or any one can think of a better one or a better way to put it
          Code:
          #****************************************************************************
          # robots.txt
          #     : Robots, spiders, and search engines use this file to detmine which 
          #       content they should *not* crawl while indexing your website.
          #     : This system is called "The Robots Exclusion Standard."
          #     : It is strongly encouraged to use a robots.txt validator to check
          #       for valid syntax before any robots read it!
          #
          # Examples:
          #
          # Instruct all robots to stay out of the admin area.
          #     : User-agent: *
          #     : Disallow:   /admin/
          #     : Disallow:   /moderator/
          #
          # Restrict Google and MSN from indexing your images.
          #     : User-agent: Googlebot
          #     : Disallow:   /images/
          #     : User-agent: MSNBot
          #     : Disallow:   /images/
          #****************************************************************************
          User-agent: Baiduspider 
          Disallow: /
          User-agent: BoardTracker
          Disallow: /
          User-agent: Gigabot
          Disallow: /
          User-agent: Twiceler
          Disallow: /
          User-agent: Slurp
          Crawl-delay: 2
          User-agent: msnbot
          Crawl-delay: 2
          User-agent: *
          User-agent: *
          Disallow: *.js
          Disallow: /clientscript/
          Disallow: /cpstyles/
          Disallow: /customavatars/
          Disallow: /customprofilepics/
          Disallow: /images/
          Disallow: /ajax.php
          Disallow: /attachment.php
          Disallow: /calendar.php
          Disallow: /cron.php
          Disallow: /editpost.php
          Disallow: /global.php
          Disallow: /image.php
          Disallow: /inlinemod.php
          Disallow: /joinrequests.php
          Disallow: /login.php
          Disallow: /member.php
          Disallow: /memberlist.php
          Disallow: /misc.php
          Disallow: /moderator.php
          Disallow: /newattachment.php
          Disallow: /newreply.php
          Disallow: /newthread.php
          Disallow: /online.php
          Disallow: /poll.php
          Disallow: /post.php
          Disallow: /postings.php
          Disallow: /printthread.php
          Disallow: /private.php
          Disallow: /profile.php
          Disallow: /register.php
          Disallow: /report.php
          Disallow: /reputation.php
          Disallow: /search.php
          Disallow: /sendmessage.php
          Disallow: /showgroups.php
          Disallow: /showpost.php
          Disallow: /subscription.php
          Disallow: /threadrate.php
          Disallow: /usercp.php
          Disallow: /usernote.php

          Comment

          • Wayne Luke
            vBulletin Technical Support Lead
            • Aug 2000
            • 74123

            #6
            That would work. You can always adjust it later. Just a note, Baidu is China's answer to Google. I don't target China and that spider was eating all my bandwidth. So don't need my site indexed on their search engine. If you want Chinese visitors, you would want to remove the line with baidu in it and the line after it.


            User-agent: Baiduspider Disallow: /

            You can probably search here and on Google for other examples.
            Translations provided by Google.

            Wayne Luke
            The Rabid Badger - a vBulletin Cloud demonstration site.
            vBulletin 5 API

            Comment

            • TheNewOne
              Senior Member
              • Aug 2011
              • 1033
              • 4.2.5

              #7
              thanks will try that

              Comment

              • setishock
                Senior Member
                • Jun 2005
                • 1334
                • 4.2.x

                #8
                I did some diving in to the world of baiduspiders and came up with some interesting tibbets.
                They only check the robot.txt once every 2 days when the dns for their server updates.
                The 119 and 123 IP range is not registered with ARIN. This leads to htaccess IP errors when trying to block the IP's from cpanel.
                Sneaky little s***s...
                My email to the company has not been responded to as of yet. But if they remain true to form they should soon.
                ...

                Comment

                • TheNewOne
                  Senior Member
                  • Aug 2011
                  • 1033
                  • 4.2.5

                  #9
                  this is what I have got and even with this like this bots still go to some of the ones in this list. Do I have it wrong in some way?
                  Code:
                  robots.txt
                  Robots, spiders, and search engines use this file to detmine which 
                  content they should *not* crawl while indexing your website.
                  This system is called "The Robots Exclusion Standard."
                  It is strongly encouraged to use a robots.txt validator to check
                  for valid syntax before any robots read it!
                  Examples:
                  Restrict Google and MSN from indexing your images.
                  User-agent: Googlebot
                  Disallow: /images/
                  User-agent: MSNBot
                  Disallow: /images/
                  User-agent: Baidu Spider
                  Disallow: /images/
                  
                  User-agent: BoardTracker
                  Disallow: /
                  User-agent: Gigabot
                  Disallow: /
                  User-agent: Twiceler
                  Disallow: /
                  User-agent: Slurp
                  Crawl-delay: 2
                  User-agent: msnbot
                  Crawl-delay: 2
                  User-agent: *
                  User-agent: *
                  Disallow: *.js
                  Disallow: /admin/
                  Disallow: /moderator/
                  Disallow: /clientscript/
                  Disallow: /cpstyles/
                  Disallow: /customavatars/
                  Disallow: /customprofilepics/
                  Disallow: /images/
                  Disallow: /forum/ajax.php
                  Disallow: /forum/attachment.php
                  Disallow: /forum/album.php
                  Disallow: /forum/thanks.php
                  Disallow: /forum/thanks.php?do=statistics
                  Disallow: /forum/calendar.php
                  Disallow: /forum/cron.php
                  Disallow: /forum/editpost.php
                  Disallow: /forum/global.php
                  Disallow: /forum/image.php
                  Disallow: /forum/inlinemod.php
                  Disallow: /forum/joinrequests.php
                  Disallow: /forum/login.php
                  Disallow: /forum/member.php
                  Disallow: /forum/memberlist.php
                  Disallow: /forum/misc.php
                  Disallow: /forum/private.php
                  Disallow: /forum/online.php
                  Disallow: /forum/online.php?s=
                  Disallow: /forum/moderator.php
                  Disallow: /forum/newattachment.php
                  Disallow: /forum/newreply.php
                  Disallow: /forum/newthread.php
                  Disallow: /forum/poll.php
                  Disallow: /forum/post.php
                  Disallow: /forum/postings.php
                  Disallow: /forum/printthread.php
                  Disallow: /forum/private.php
                  Disallow: /forum/profile.php
                  Disallow: /forum/register.php
                  Disallow: /forum/report.php
                  Disallow: /forum/reputation.php
                  Disallow: /forum/search.php
                  Disallow: /forum/sendmessage.php
                  Disallow: /forum/showgroups.php
                  Disallow: /forum/subscription.php
                  Disallow: /forum/showpost.php
                  Disallow: /forum/threadrate.php
                  Disallow: /forum/usercp.php
                  Disallow: /forum/usernote.php

                  Comment

                  • setishock
                    Senior Member
                    • Jun 2005
                    • 1334
                    • 4.2.x

                    #10
                    Under the msnbot crawl delay you have user-agent * twice. If any of the bots get to be a real pest and if you have cpanel, you can go in to ip deny manager and put the ip in. It creates an htaccess file and parks in the right location. I did that on the baidu bots and stopped them from getting in altogether.
                    ...

                    Comment

                    • TheNewOne
                      Senior Member
                      • Aug 2011
                      • 1033
                      • 4.2.5

                      #11
                      2nd User-agent removed, they are still going to links in the file that they should not be
                      Last edited by TheNewOne; Sat 27 Aug '11, 11:38am.

                      Comment

                      • John Lester
                        Senior Member
                        • Jul 2000
                        • 412
                        • 4.1.x

                        #12
                        I'm curios why you don't have /admincp/ and /modcp/ on the disallow list? This is what I'm currently using for my robots.txt file. It's only been in place a few days and has stopped the Baidu spider Though one of the robot.txt check programs doesn't like the fact that some individual agents have been named, and the all inclusive * is used as well *shrug*

                        Code:
                        User-agent: Baiduspider 
                        Disallow: /
                        
                        User-agent: BoardTracker
                        Disallow: /
                        
                        User-agent: Gigabot
                        Disallow: /
                        
                        User-agent: Twiceler
                        Disallow: /
                        
                        User-agent: Slurp
                        Crawl-delay: 10
                        
                        User-agent: msnbot
                        Crawl-delay: 10
                        
                        User-agent: *
                        
                        Disallow: *.js
                        Disallow: /admincp/
                        Disallow: /chat/
                        Disallow: /clientscript/
                        Disallow: /cpstyles/
                        Disallow: /customavatars/
                        Disallow: /customprofilepics/
                        Disallow: /customgroupicons/
                        Disallow: /images/
                        Disallow: /install/
                        Disallow: /modcp/
                        Disallow: /signaturepics/
                        Disallow: /ajax.php
                        Disallow: /attachment.php
                        Disallow: /calendar.php
                        Disallow: /cron.php
                        Disallow: /editpost.php
                        Disallow: /global.php
                        Disallow: /image.php
                        Disallow: /inlinemod.php
                        Disallow: /joinrequests.php
                        Disallow: /login.php
                        Disallow: /member.php
                        Disallow: /memberlist.php
                        Disallow: /misc.php
                        Disallow: /moderator.php
                        Disallow: /newattachment.php
                        Disallow: /newreply.php
                        Disallow: /newthread.php
                        Disallow: /online.php
                        Disallow: /poll.php
                        Disallow: /post.php
                        Disallow: /postings.php
                        Disallow: /printthread.php
                        Disallow: /private.php
                        Disallow: /profile.php
                        Disallow: /register.php
                        Disallow: /report.php
                        Disallow: /reputation.php
                        Disallow: /search.php
                        Disallow: /sendmessage.php
                        Disallow: /showgroups.php
                        Disallow: /showpost.php
                        Disallow: /subscription.php
                        Disallow: /threadrate.php
                        Disallow: /usercp.php
                        Disallow: /usernote.php
                        BrainTalk is a support group for friends, family, caregivers, and patients with neurological disorders and other health related diagnosis.

                        BrainTalk Communities Inc
                        sigpic

                        Comment

                        • Zachery
                          Former vBulletin Support
                          • Jul 2002
                          • 59097

                          #13
                          Why tell ANYONE where that is?

                          Comment

                          • Wayne Luke
                            vBulletin Technical Support Lead
                            • Aug 2000
                            • 74123

                            #14
                            Originally posted by John Lester
                            I'm curios why you don't have /admincp/ and /modcp/ on the disallow list?
                            1) If you're AdminCP and ModCP are protected so a Bot gets a 401 Forbidden Error, you don't need them in the list. My Admin CP is restricted to my IP Address and then requires a username and 40 character password. The ModCP requires a username and a 40 character password.

                            2) Bots should never see links to those areas.
                            Translations provided by Google.

                            Wayne Luke
                            The Rabid Badger - a vBulletin Cloud demonstration site.
                            vBulletin 5 API

                            Comment

                            • Zachery
                              Former vBulletin Support
                              • Jul 2002
                              • 59097

                              #15
                              Ontop of what wayne said. There shouldn't normally ever be a link to those pages, so bots are not going to find them. But you're telling the world where they are by having them in your robots.txt

                              Comment


                              • ar15dcm
                                ar15dcm commented
                                Editing a comment
                                What about the footer links?

                              • Wayne Luke
                                Wayne Luke commented
                                Editing a comment
                                The footer links for the AdminCP and ModCP do not show for guest users. They show if the user has permission to access them. Guests do not. You should not have anything in your robots.txt referring to the AdminCP or ModCP. Doing so just gives hackers more ammunition.
                            widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
                            Working...