impex painfully slow

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ctbk
    New Member
    • Mar 2006
    • 3
    • 3.5.x

    impex painfully slow

    Clean import, old forum has ~ 2.000.000 posts, 34.000 users.

    Source - fudforum 2.5.0
    Target - vBulletin 3.5.0
    Module - 007 - Import Posts

    We are trying to migrate our forum to vb, but the impex process is way too slow... After having had some problems (in past, smaller, import test we made) with memory we set the number of post to import per page to 200.

    While first pages were crunched away quite fast, as it proceeded it took longer and longer. Now we are at 30 minutes per page, and often, after the php process finishes working, it stops without going further with the next batch of posts (the browser never receive the generated page... maybe this is due to some timeout between apache and the separate php process).
    Last received page was:

    From : 32600 :: To : 32800

    31' 35" of CPU time (and counting) ago.

    I doubt it can make it to 2.000.000, not with this exponential growth of the processing time.

    We are using php in cgi-mode, with 256 Mb limit, and 120000 secs of max_execution_time. We got no errors, but at this speed it seems impossible to complete the task.

    What can we try? Why is it so slow (I mean, c'mon, to analyze and convert 200 messages it can't take so long!) Maybe we could try disabling some test/process in the script?
  • Jerry
    Senior Member
    • Dec 2002
    • 9137
    • 1.1.x

    #2
    It's not going to be the analyzing of messages, with fud, its the huge source files that ImpEx has to open and search through to the file offset to find the message.

    It has to do that for every post.

    I would up the per page count not lower it if you have your php memory at 256 as that will cut down on the start up overhead.
    I wrote ImpEx.

    Blog | Me

    Comment

    • ctbk
      New Member
      • Mar 2006
      • 3
      • 3.5.x

      #3
      Originally posted by Jerry
      It's not going to be the analyzing of messages, with fud, its the huge source files that ImpEx has to open and search through to the file offset to find the message.

      It has to do that for every post.

      I would up the per page count not lower it if you have your php memory at 256 as that will cut down on the start up overhead.
      We are trying with 5000 posts per page, but it slows down anyway... What I don't understand is why the first pages are quite fast, and then starts sluggin more and more.


      Now is trying to do From : 25000 :: To : 30000, and so far it's taking 35', cpu is 99.9%, occupied mem 5%. Is there anything else we can try?
      Thanx

      Comment

      • Jerry
        Senior Member
        • Dec 2002
        • 9137
        • 1.1.x

        #4
        Originally posted by ctbk
        We are trying with 5000 posts per page, but it slows down anyway... What I don't understand is why the first pages are quite fast, and then starts sluggin more and more.
        Because when it opens the file, the position its looking for is at the top of the file and progressively it has to fseek more of the file, the internal workings of ImpEx are the same regardless of where the post text comes from, its just the getting the text that changes per post.

        Originally posted by ctbk
        Now is trying to do From : 25000 :: To : 30000, and so far it's taking 35', cpu is 99.9%, occupied mem 5%. Is there anything else we can try?
        Thanx
        In the source can you move threads into temp forums ? All I'm thinking is if you have a lot of a little ImpEx should be a lot quicker than a little of a lot.

        i.e. Lots of small form files opposed to a few huge ones.
        I wrote ImpEx.

        Blog | Me

        Comment

        • ctbk
          New Member
          • Mar 2006
          • 3
          • 3.5.x

          #5
          I got it! It wasn't the size of the message file to blame, but this line:

          Code:
          $try->set_value('nonmandatory', 'pagetext', $this->fudforum_html($this->html_2_bb($this->get_post($file, $post_details['foff'], $post_details['length']))));
          These functions were the things the CPU spent all his time on. (Maybe some bad regex? Some posts took 10/20 mins to be processed) The outcome wasn't satisfactory either: all the html was still in the converted posts, and it showed in the board.
          The messages' file size is really not a problem. After changing that line as follows:
          Code:
          $try->set_value('nonmandatory', 'pagetext',$this->get_post($file, $post_details['foff'], $post_details['length']));
          import goes super-fast, and the html->bb translation will be made in another, separate, script I assembled and tested.

          Comment

          • Jerry
            Senior Member
            • Dec 2002
            • 9137
            • 1.1.x

            #6
            Taking out the preg_replace and using the cleaner.php is the other option, yes !

            I'll have to profile it though, see why its eating the CPU with the HTML parsing.
            I wrote ImpEx.

            Blog | Me

            Comment

            • readordie
              Member
              • Sep 2005
              • 60
              • 3.5.x

              #7
              What file is this line found in? Because I'm having the same problem.

              Comment

              • Boof
                New Member
                • Jun 2006
                • 3

                #8
                find . -type f -exec grep -H '$this->fudforum_html($this->html_2_bb' {} \;

                shows that similar code exists in two files.

                systems/fudforum/007.php and systems/fudforum/010.php

                It is not exactly the same as above. Code searched is @version $Revision: 1724 $

                Comment

                • infocruceros
                  New Member
                  • Jun 2008
                  • 4
                  • 3.7.x

                  #9
                  it's simply incredible that Jerry or Impex development team hasn't fixed this big problem with the html process 2 years and a half later.

                  I spent a lot of hours in the last weeks dealing with the extreme slow impex speed with 800.000 posts SNITZ on a core2duo 4g ram server.

                  I deleted the html process functions and now the import runs fast. I think is more professional and fast proccess the html with mysql statements at the end of the import. Please, implement that in the next impex version!

                  Comment

                  • Mopquill
                    Member
                    • Feb 2008
                    • 65
                    • 3.7.x

                    #10
                    Give him a break, Impex is a very good product, it's one little bug. You don't need to call it incredible, and besides, 800,000 posts is A LOT. I don't think I know any other converters out there capable of doing all that work any faster.
                    Fun, ROMs, gaming, emulations, and a friendly community!

                    http://www.emulysianfields.com

                    Comment

                    widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
                    Working...