Using PHP to create a mirror of vB threads to HTML files

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Dave Baker
    Member
    • Jul 2000
    • 50

    Using PHP to create a mirror of vB threads to HTML files

    Hello! I'd like to enable users to search my vB using use my Excite for Web Servers ("EWS") search engine (EWS searches can be made in plain English, the results include "more like this one" links, and other features that I like better than those in the default vB search engine).

    EWS wants local html files to read and then it digests them, once a night per a cron job.

    So I need to convert vB threads to local html files (basically mirroring my site, as it exists at a particular time in the middle of each night). Then I'll have EWS read the html files.

    Here's the code I came up with, based on a nice hack posted elsewhere (http://vbulletin.com/forum/showthread.php?threadid=1092) on these boards. I call it "all_threads_with_mirror.php" and I run it by typing http://mysitename.net/all_threads_with_mirror.php into a browser. I wonder if anybody could suggest a more efficient way of getting at the threads in vB and of writing the resulting data to individual HTML files on my site's hard drive. Thanks!!

    Code:
    <? 
    require("global.php"); 
    
    $MIRROR_DIR = "/www/vhosts/mysitename/mirror";
    
    $threads=$DB_site->query("SELECT threadid,title FROM thread 
       WHERE visible=1 ORDER BY lastpost DESC"); 
    
    while ($threadarray = $DB_site->fetch_array($threads)) { 
       $threadid = $threadarray["threadid"]; 
       $title = $threadarray["title"]; 
       $title_in_html = htmlspecialchars($title);
       print "<a href=\"search/$threadid.php\">$title_in_html</a><br>\n"; 
       set_time_limit(60);
       $thread_text = "";
       if (!$file=fopen("http://mysitename.net/showthread.php?threadid=$threadid" , "r")) {
          echo("Could not open http://mysitename.net/showthread.php?threadid=$threadid");  
    // If fopen() returns 0, couldn't open file
       } else {
          while (!feof($file)) {    
    // Continue until feof() is true
             $thread_text .= fgetc($file);
          }
       }
       if (!$filetowrite=fopen("$MIRROR_DIR/$threadid.html" , "w")) {
          echo("Could not open $MIRROR_DIR/$threadid.html for writing");  
    // If fopen() returns 0, couldn't open file
       } else {
          fputs($filetowrite, $thread_text);
       }
    } 
    ?>
    I'm able to get about two HTML files written per second this way, of an average size of about 20k. But with 7,400 threads this takes about an hour. Server load on my Linux box typically goes up from about 0.2 to a shade under 2.0.

    Maybe there's PHP magic I don't know about, or I'm doing it inefficiently?

    [Edited by Dave Baker on 10-13-2000 at 08:42 PM]
  • Dave Baker
    Member
    • Jul 2000
    • 50

    #2
    Well, I just had one thought, anyway -- now that I've got this 150 megs of data saved (took over an hour), all I need to do is run the script once a night and re-save any threads that have changed in the past 24 hours ... even if I have 50 threads that are changed during a day, that takes less than a minute.

    So

    Code:
    $threads=$DB_site->query("SELECT threadid,title FROM thread 
       WHERE visible=1 ORDER BY lastpost DESC");
    now can be changed to read

    Code:
    // Only fetch stuff that's changed in last 24 hours; 
    // we'll run this php script once a night
    
    $searchdate = 1;  
    $datecut=time()-($searchdate*86400);
    
    $threads=$DB_site->query("SELECT threadid,title FROM thread
       WHERE visible=1 AND thread.lastpost >= $datecut ORDER BY lastpost DESC");

    Comment

    • Dave Baker
      Member
      • Jul 2000
      • 50

      #3
      This seems to be easier on the server - load is a bit lower than before - by reading and writing the files one line at a time rather than one character at a time. The number of files being processed per minute is only slightly higher, though.

      Code:
      <? 
      require("global.php"); 
      
      $MIRROR_DIR = "/www/vhosts/mysitename/mirror";
      
      $threads=$DB_site->query("SELECT threadid,title FROM thread 
         WHERE visible=1 ORDER BY lastpost DESC"); 
      
      while ($threadarray = $DB_site->fetch_array($threads)) { 
         $threadid = $threadarray["threadid"]; 
         $title = $threadarray["title"]; 
         $title_in_html = htmlspecialchars($title);
         print "<a href=\"search/$threadid.php\">$title_in_html</a><br>\n"; 
         set_time_limit(60);
         if (!$filetowrite=fopen("$MIRROR_DIR/$threadid.html" , "w")) {
            echo("Could not open $MIRROR_DIR/$threadid.html for writing");  // If fopen() returns 0, couldn't open file
         }
         $thread_text_line = "";
         if (!$threadfile=fopen("http://mysitename.net/showthread.php?threadid=$threadid" , "r")) {
            echo("Could not open http://mysitename.net/showthread.php?threadid=$threadid");  // If fopen() returns 0, couldn't open file
         } else {
            while (!feof($threadfile)) {    // Continue until feof() is true
               $thread_text_line = fgets($threadfile, 255);
               fputs($filetowrite, $thread_text_line);
            }
         }
         fclose($threadfile);
         fclose($filetowrite);
      } 
      ?>

      Comment

      • Dave Baker
        Member
        • Jul 2000
        • 50

        #4
        Yet another speed enhancement -- now I'm getting about 160 HTML files (one per thread) created per minute (up from about 120 files per minute) --

        Code:
        <? 
        require("global.php"); 
        
        $MIRROR_DIR = "/www/vhosts/mysitename/mirror";
        
        $threads=$DB_site->query("SELECT threadid,title FROM thread 
           WHERE visible=1 ORDER BY lastpost DESC"); 
        
        while ($threadarray = $DB_site->fetch_array($threads)) { 
           $threadid = $threadarray["threadid"]; 
           $title = $threadarray["title"]; 
           $title_in_html = htmlspecialchars($title);
           print "<a href=\"search/$threadid.php\">$title_in_html</a><br>\n"; 
           set_time_limit(60);
           if (!$filetowrite=fopen("$MIRROR_DIR/$threadid.html" , "w")) {
              echo("Could not open $MIRROR_DIR/$threadid.html for writing");  // If fopen() returns 0, couldn't open file
           }
           $thread_array = file("http://benefitsboards.net/showthread.php?threadid=$threadid");
           $thread_string = implode('', $thread_array);
           fputs($filetowrite, $thread_string);
           fclose($filetowrite);
        } 
        ?>

        Comment

        • dwh
          Senior Member
          • Sep 2000
          • 1224
          • 3.0.0 Release Candidate 4

          #5
          gr8 script gr8 idea. Do u know if there's any way to call php scripts from the command line? Doesn't work for me??

          Would be gr8 to put this in cron.

          Comment

          • dwh
            Senior Member
            • Sep 2000
            • 1224
            • 3.0.0 Release Candidate 4

            #6
            Originally posted by Dave Baker
            Well, I just had one thought, anyway -- now that I've got this 150 megs of data saved (took over an hour), all I need to do is run the script once a night and re-save any threads that have changed in the past 24 hours ... even if I have 50 threads that are changed during a day, that takes less than a minute.

            So

            Code:
            $threads=$DB_site->query("SELECT threadid,title FROM thread 
               WHERE visible=1 ORDER BY lastpost DESC");
            now can be changed to read

            Code:
            // Only fetch stuff that's changed in last 24 hours; 
            // we'll run this php script once a night
            
            $searchdate = 1;  
            $datecut=time()-($searchdate*86400);
            
            $threads=$DB_site->query("SELECT threadid,title FROM thread
               WHERE visible=1 AND thread.lastpost >= $datecut ORDER BY lastpost DESC");
            So if you run this less than once every 24 hours it will miss threads?

            I was thinking of making this script accept a parameter and rebuild everything or just yesterday's files. And to run via cron so that it would just update yesterday's threads....

            Comment

            • Shaman
              Senior Member
              • Jun 2000
              • 295

              #7
              Use a program that does it through TCP/IP. There are a number of them like UDMSearch that work really well.
              http://racing.kos.net
              http://www.rumour.com/

              Comment

              widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
              Working...