No announcement yet.

This topic is closed.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31

    I just read on this thread that boardreader used over 300mb bandwidth in spidering their forum.

    For those of us with very large forums, this could amount to quite a bit. I'm guessing they spider often, to keep their content current.

    Does anyone have any info on this?


    • #32

      That was correct a long time ago. Like I always say we are always trying to improve our crawler to maximize the use and the amount of bandwidth that we need to index a site. One of our solutions to this problem with large sites is the ability for us to crawl forums only back a designated amount of days. For example when we crawled fanhome for the very first time (over five million posts) we only crawled back 10 days. The first crawl is always the most intensive of course because each crawl after that is only once per day going back only one day. Anyway there are other savings that we do to help this for not only big sites but also small sites (we don't crawl html which is also a large savings). We might start crawling some html so we can add this to our cache so not to seem like we are taking the copyright of the pages of course. Hope this helps.



      • #33
        Originally posted by spurdon

        Great ideas on the site submissions and the cache layouts. I'm not sure how much work that would be though. As far as our crawling we try and only search open public sites that submit their site but sometimes it is hard to varify who is submitting those sites. We do however respect robots.txt so hopefully that helps some. I still think that maybe closing the cache would be the best way to stay away from harming peoples copyright. We will see what happens with that. I'm going to try a few different things to the cache to see if that solves some of the problems.

        I just submitted your site just now. It will take about a week before you can see your site on Boardreader. I am back logged with site submission now so when I get to yours you may find an email telling you it has been done again.

        Thanks for the comments,

        hey scott,

        Any idea what's happening with the spiderings? Been 8 days now and not a single visit from Mr. BoardBot. Not that i'm in a big hurry to get my site spidered or anything, just wondering what's the latest news and when I can expect to be spidered

        Visit the Web Scripts Directory @
        PHP, CGI, Perl, ASP, JavaScript, CFML, Python and more!



        • #34

          Sorry for the delay. I have been out of town and I have to enter them into our spider. I will do this today and you should be indexed within the next day or two. Let me know if it doesn't get done. Thanks,



          • #35
            Recently I've been blocking various iffy spiders from my site to protect the privacy of my users (prevent google caching images from our site, prevent e-mail harvesters, etc but permit inktomi, yahoo, google to spider main content (but not cache)).

            This is all very cool and is working well.

            Though today a user agent appeared that I'd not seen before:

            I also took note because the bot didn't identify itself clearly as there was no url to the botinfo (for example, the NPBot clear has a link to that advises on what it is, how to remove, etc... Yahoo Slurp, etc all generally identify themselves clearly).

            I don't really have a problem with Boardreader spidering my pages, but I would like to be able to opt out of the cache, through either a HTTP header or the use of a META-TAG.

            I would also like clarification on Jelsofts/vBulletins role as a technology partner. Specifically because mentions that vBulletin is an open source product, which it is not (the source may be distributed, but that remains proprietary and protected by copyright and cannot be redistributed or published).

            Frankly, I'm a sceptic... and whilst I like the idea of a dedicated message board index I am hesitant to give boardreader access to my sites if I have no control over the caching and use of my pages (i.e. will it be possible to request removal of a page, if for example an incident on my site leads to a page being deleted as it contains libellous information?).

            I therefore seek clarity on the relationship between BoardReader and Jelsoft in how it impacts or affects users (or is it merely that Jelsoft explained what their URLs look like?), as well as info from BoardReader about the long term use of our content and the tools in which to control usage of that content.

            Thus far I've banned user agents quite heavy handedly, at the moment I'm still letting BoardReader through on the 6 vBulletin sites I'm running, but I'll need a lot of peace of mind to keep it that way.

            Hope that doesn't come across too aggressive... I'm just very protective of my users privacy and data as well as our liability risks. And whilst search engine indexing is a good thing (brings in new users)... I need to find the balance between bad spiders and indexing (affects privacy and liability) and discoverability (that when people search on popular engines for terms that we cater for, that we are in the results).


            David K
            London Fixed-gear and Single-speed


            • #36
              i tried to register...

              This is an automatically generated Delivery Status Notification.

              Delivery to the following recipients failed.

              That's the end of that!


              • #37

                I just added your board Thanks for the interest in A good way to submit a site is throught this link:



                Thanks for the questions. I just sent you a more detailed explaination through email. Please let me know if you have any other questions. Basically we try our best to offer a solution to message board searching through but at times we struggle with how to offer features that have grown to be expected from the bigger search engines. Cache is one such thing. Anyone can opt out of this feature by sending us an email or when they sign up to get indexed. As for the boardreaderbot I will look into a dedicated page for this. Lastly we do not index profile info or harvest emails or the like. We are just trying to provide enthusiast in the message board community another way to connect with others with similar interset. We are every growing and we love these questions and comments to help us build a better environment for search.

                Thanks again for the questions, and shoot me a line if any other questions or comments come up.




                • #38
                  Personally I think that BoardReader is fine for me, since your boards are already open to the public, BR is just like Google where they freely index your forums.

                  If you don't want them to do that, I'm sure they have some way of not indexing your forums like robots.txt for Google? I'm not entirely sure though.


                  • #39
                    Gary W.,

                    I'm glad you like We do respect all robots.txt and also emails from users and or webmasters when they don't want to be crawled. Thanks again for the positive comments.



                    • #40
                      Just submitted RG.


                      We'll see how it goes!


                      • #41
                        i got on there a few times

                        i do agree though, the fact that all branding is removed is troublesome due to the fact that well everyone works hard as it all, it's like a package, content, design and members.


                        • #42
                          RapidGaming, I just got your site set up. Thanks for the sub!


                          No question on Branding,Content and Membership. These are important things and valuable things. One point I would like to make is our cache has a very simple structure as you all can see. We can not place all the html back into our cache because the more info we pull back and store from a board the more bandwidth we all will be using which equates to more money spent and the more hardware we would require to post full page views of that cached thread -- we are not like other search engines that can afford wonderful cached structures, we are offering a small view of what that post talked about. It is #1 for us to try and take a user back to the orginial page, but maybe one day we can offer a better snapshot!!

                          This is a catch 22 because some like the cache (we try and state where that content has come from in the cache) and some find it wrong to repost a deleted thread.

                          As you all know message board data is always up/down and all over the place. Cache in my view is a way to help MB users read past posts while a site might be down or that a thread might have been removed. Time will tell on what to do about cached pages but offering more features for our cache at this time is not possible. Please continue to offer new suggestions, they are all GREAT for us to know about.

                          THANKS, SCOTT
                          Last edited by spurdon; Tue 23rd Mar '04, 11:42am.


                          • #43

                            BoardReader is a very good tool.

                            But, as many people mentioned, witht the archive it would be nice if it was clearly stated (at the tom and at the bottom) is that the page the person is viewing is cached from http://url/

                            That would remove the problem of copyright, hopefully.

                            As far as removing old pages, I personally find access to old pages is nice. I never delete old threads at my forums. It was the idea of buletin boards in general that data is available at any time after its posting as opposed to real time chats, so I'd hate to see that feature go.

                            All the best,

                            HFT Online - Professional computer help ... with a personal touch


                            • #44

                              Good points. I agree. You can click in the top of the cached page and go to the home page of the site that the thread was from. We will look into adding more links within the cache to reinforce this point.




                              • #45
                                I asked my site not be cached and it is cached anyways.
                                Running vB since 4-14-2002