3 Replies Latest reply on Oct 5, 2012 9:02 AM by philmodjunk

    How to parse out the number of pages from html code

    PeterMontague

      Title

      How to parse out the number of pages from html code

      Post

           I used the following script to get the amount of pages in a book from html code:

            

           Let ( [ T = this::Child Source Code ;
                     start = Position ( T ; "<li><b>Paperback:</b> " ; 1 ; 1 ) + 22 ;
                     end = Position ( T ; " pages</li>" ; 1 ; 1 )
                   ] ;
                   Trim ( Middle ( T ; start ; end - start ) )
                  )
           This worked for me while I was using paperback books. But then when hardcover books came along the script didn't work for me. Paperback worked for me because it was the first instance of the word in a long piece of code. There is a list of about 40 different formats of books which could replace paperback. Can you see any other way that I could get this script to work? Or is there a way of getting it to work for all the different formats of books?

        • 1. Re: How to parse out the number of pages from html code
          philmodjunk

               Using just "Paperback:" and "Hardcover:" for the example:

               Let ( [ T = this::Child Source Code ;
                         startPap = Position ( T ; "<li><b>Paperback:</b> " ; 1 ; 1 ) + 22 ;
                         startHard = Position ( T ; "<li><b>Hardcover:</b> " ; 1 ; 1 ) + 22 ;
                         start = Max ( startPap ; startHard ) ;
                         end = Position ( T ; " pages</li>" ; 1 ; 1 )
                       ] ;
                       Trim ( Middle ( T ; start ; end - start ) )
                      )
               This assumes that "Paperback:" and "Hardcover:" will not both appear in T at the same time.
               To handle all 40 different formats, you can add additional startXXX variables and then include them in the Max function as well.
                
               PS, I don't suppose you could just get the position of the colon? (:) If the colon appears in all 40 formats and this is the first (or only) place where the colon appears in T, you can just get the position of it. This could also work if it is always the Nth time the colon appears in T.
          • 2. Re: How to parse out the number of pages from html code
            PeterMontague

                 I've found a pattern: "<li><b>" occurs a limited amount of times.

                 The first occurence of it contains the following in all of the source codes I checked: 

                 <li><b>Hardcover:</b> 196 pages</li>

                 This is a possible script I could use to parse out the amount of pages.

                  

                 Let ( [ T = this::Child Source Code ;
                           start = Position ( T ; "<li><b>" ; 1 ; 1 ) + 22 ;
                           end = Position ( T ; "</li>" ; 1 ; 1 )
                         ] ;
                         Trim ( Middle ( T ; start ; end - start ) )
                        )

                 But it won't work if there is a different format of book with more or less letters. Is there a workaround?

            • 3. Re: How to parse out the number of pages from html code
              philmodjunk
                   Let ( [ T = this::Child Source Code ;
                             BookPos = Position ( T ; "<li><b>" ; 1 ; 1 ) ;
                             bPos = Position ( T ; "</b>" ; BookPos ; 1 ) ; // find the first </b> to the right of <li><b>
                             L = Length ( T )
                           ] ;
                           LeftWords ( Right ( T ; L - bPos - 4 ) ; 1 )   // first word to the right of bPos + 4
                          )