1 2 Previous Next 16 Replies Latest reply on Nov 8, 2012 10:42 AM by philmodjunk

    How can I parse out the book description?

    PeterMontague

      Title

      How can I parse out the book description?

      Post

            

           I want to parse out the book description, in the last paragraph below.

           I tried using this. But it gave me a blank field.

            

           Let ( [ T = this::Child Source Code ;
                     start = Position ( T ; "<div id=\"postBodyPS\" style=\"overflow: hidden;\">\"" ; 1 ; 1 ) + 53 ;
                     end = Position ( T ; "</p></div>" ; 1 ; 1 )
                   ] ;
                   Trim ( Middle ( T ; start ; end - start ) )
                  )
           Does anyone know what I did wrong?
           How can I parse out the book description?

            

       <h2>Book Description</h2>

            

            

            

            

      <div class="buying"><span class="byLinePipe">Publication Date: </span><span style="font-weight: bold;">30 Jun 2011</span> <span class="byLinePipe"> | ISBN-10:</span><span style="font-weight: bold;"> 1933952695 </span> <span class="byLinePipe"> | ISBN-13:</span><span style="font-weight: bold;"> 978-1933952697</span> <span class="byLinePipe"> | Edition: </span><span style="font-weight: bold;">1</span></div>

            

            

            

            

        <div class="content">

          <div id="outer_postBodyPS" style="overflow:hidden; z-index: 1; height: 200px;">

            <div id="postBodyPS" style="overflow: hidden;">

               <div><p>VueScan is the world's most widely used software interface for digitizing film and prints on flatbed and film scanners. This powerful yet affordable program supports over 1500 scanners and 321 digital camera RAW file types, and is available for Mac OS X, Windows, and Linux. <br/><br/>  Much more than a simple scanner program, VueScan allows you to perform functions such as color restoration, adding sharpening filters, adjusting white balance, rotating images, and batch scanning multiple images. It also provides output to a variety of formats including TIFF, JPEG, and searchable PDF files (even all three simultaneously). The Pro version outputs to the RAW format and provides options for color adjustments, and more. <br/><br/>  Despite its popularity, the documentation for VueScan does not provide enough information to use the full power of the system and makes it difficult to get started. The VueScan Bible is the missing manual for new, experienced, and prospective users of VueScan.</p></div>

            

        • 1. Re: How can I parse out the book description?
          philmodjunk

               Preliminary testing by copying this text and pasting into a global field, then using the calcualtion in the data viewer reveals that start is getting the value 53 instead of the value that it should receive.

               carefully copy/pasting the match text, inserting the needed backslashes and comparing the result to what you have here reveals that you have an extra backslash and quote at the end of the text.

               You have: start = Position ( T ; "<div id=\"postBodyPS\" style=\"overflow: hidden;\">\"" ; 1 ; 1 ) + 53 ;
                  I used : start = Position ( T ; "<div id=\"postBodyPS\" style=\"overflow: hidden;\">" ; 1 ; 1 ) + 66 ;

               After that, I got the description text.

                

          • 2. Re: How can I parse out the book description?
            philmodjunk

                 Note that it may be wiser to use this expression for end:

                 end = Position ( T ; "</p></div>" ; start ; 1 )

            • 3. Re: How can I parse out the book description?
              PeterMontague

                   Hi Phil. I've been working on product description again. it works only sometimes and should be working more often. I've found some differences in the text preceding it. E.g.

                    

                   <div class="productDescriptionWrapper">
                         <div><p>Sascha Steinhoff is a computer expert by training and a photographer by passion. He used several rainy seasons in Galway, Ireland to learn everything he needed to know about scanners and scanning software. He soon realized that a good scan requires as much work and know-how as a good camera shot, and that a scanning workflow is essential to get the job done efficiently. <br/><br/>  Sascha recently gave up his position as an editor of a leading German technical magazine to move to Bangkok, Thailand, where he is currently working as a freelance journalist and project manager. <br/><br/>  Complementary information about the scanning techniques described in this book, as well as as a place where Sascha will reply to inquiries, can be found on his website at http://scanguru.info.</p></div>
                         
                         <div class="emptyClear"> </div>
                    
                   Here is another example:
                    
                        <div class="productDescriptionWrapper">
                              A guide to travelling, tasting and buying wine in the French wine regions. The book illustrates easy-to-follow routes through some of the most attractive and historical wine regions in the world. It gives advice on visiting wineries, how to survive a tasting and how to find bargains.
                              
                              <div class="emptyClear"> </div>
                         
                        Sometimes there are <p> tags and sometimes there aren't. How can I legislate for this?

                    

              • 4. Re: How can I parse out the book description?
                philmodjunk

                     Looks like you should be using "<div class="productDescriptionWrapper">" to find the beginning of the description instead of "<div id=\"postBodyPS\" style=\"overflow: hidden;\">"

                     You also have html tags embedded in the text.

                     I think a substitute function can be used to remove the extra tags.

                     A first cut at it:

                     Let ( [ T = child::Source;
                              start = Position ( T ; "<div class=\"productDescriptionWrapper\">" ; 1 ; 1 ) + 39;
                              end = Position ( T ; "</div>" ; start ; 1 )
                            ] ;

                              Substitute ( Middle ( T ; start ; end - start ) ;
                                                 ["<br/>" ; ¶] ; ["<br>" ; ""] ; ["<div>" ; "" ] ; ["<p>" ; "" ] ; ["</p>" ; "" ]; ["<div class=\"emptyClear\">" ; "" ] )
                           )

                     But we may need a custom function to use in place of substitute. I'm not at all sure that the above expression will remove every possible tag from the description text. We may need a recursive function that removes all text enclosed in the <> characters.

                • 5. Re: How can I parse out the book description?
                  PeterMontague

                       Hi Phil,

                       Thats not quite working yet.

                       On the first source code I tried the result was the description preceded by some empty spaces. There was also too many carriage returns in the result. That was on this source code:

                  http://www.amazon.co.uk/The-VueScan-Bible-Everything-Scanning/dp/1933952695/ref=sr_1_1?ie=UTF8&qid=1349549013&sr=8-1

                  The next source code I tried it one the result seemed to by the whole of the source code put into the description field:

                  http://www.amazon.co.uk/Discovering-History-Certificate-Patsy-McCaughey/dp/1906623481/ref=sr_1_1?ie=UTF8&qid=1349687152&sr=8-1

                        

                        

                  • 6. Re: How can I parse out the book description?
                    philmodjunk

                         Remember that you can use Trim to remove extra space at the beginning and end of the extracted text. My suggested calc replaces the <br/> tag with returns. If that inserts more returns than you want you can change that to just remove these tags instead of replacing them with returns.

                         When I test the source from the first link in a database using the unmodified calculation, I get:

                         <blank line here>
                               Sascha Steinhoff is a computer expert by training and a photographer by passion. He used several rainy seasons in Galway, Ireland to learn everything he needed to know about scanners and scanning software. He soon realized that a good scan requires as much work and know-how as a good camera shot, and that a scanning workflow is essential to get the job done efficiently.

                           Sascha recently gave up his position as an editor of a leading German technical magazine to move to Bangkok, Thailand, where he is currently working as a freelance journalist and project manager.

                           Complementary information about the scanning techniques described in this book, as well as as a place where Sascha will reply to inquiries, can be found on his website at http://scanguru.info.

                         As far as I can tell, the second source example does not have a description. Here's a modified expression that handles missing descriptions:

                          Let ( [ T = child::Source;
                                  start = Position ( T ; "<div class=\"productDescriptionWrapper\">" ; 1 ; 1 ) + 39;
                                  end = Position ( T ; "</div>" ; start ; 1 ) ;
                                  Desc = If ( start ; Substitute ( Middle ( T ; start ; end - start ) ;
                                                     ["<br/>" ; ¶] ; ["<br>" ; ""] ; ["<div>" ; "" ] ; ["<p>" ; "" ] ; ["</p>" ; "" ]; ["<div class=\"emptyClear\">" ; "" ] ) )
                                ] ;
                                If ( Left ( Desc ; 1 ) = ¶ ; Right ( Desc ; Length ( Desc ) - 1 ) ; Desc )
                              ) // let

                         If there is no start text, the expression returns no characters.

                          

                    • 7. Re: How can I parse out the book description?
                      PeterMontague

                           Thanks Phil. That is not working with this source code where there is no product description.

                           http://www.amazon.co.uk/The-Business-Guide-Effective-Writing/dp/1850912912/ref=sr_1_1?ie=UTF8&qid=1349813447&sr=8-1

                      • 8. Re: How can I parse out the book description?
                        philmodjunk

                             Ooops.

                             Make it:

                             Let ( [ T = child::Source;
                                      start = Position ( T ; "<div class=\"productDescriptionWrapper\">" ; 1 ; 1 ) + 39;
                                      end = Position ( T ; "</div>" ; start ; 1 ) ;
                                      Desc = If ( start > 39 ; Substitute ( Middle ( T ; start ; end - start ) ;
                                                         ["<br/>" ; ¶] ; ["<br>" ; ""] ; ["<div>" ; "" ] ; ["<p>" ; "" ] ; ["</p>" ; "" ]; ["<div class=\"emptyClear\">" ; "" ] ) )
                                    ] ;
                                    If ( Left ( Desc ; 1 ) = ¶ ; Right ( Desc ; Length ( Desc ) - 1 ) ; Desc )
                                  ) // let

                        • 9. Re: How can I parse out the book description?
                          PeterMontague

                               I've adapted this as sometimes the description is shown in a different way. It works most of the time. But it sometimes returns a huge amount of text e.g. on this webpage: I can't find an example of the startOverflow anywhere in the text. But there is an example of startWrapper. So why is startWrapper not working for me?

                                

                               Let ( [ T =this::Child Source Code ;
                                  startWrapper = Position ( T ; "<div class=\"productDescriptionWrapper\">" ; 1 ; 1 );
                                  startOverflow = Position ( T ; "<div id=\"postBodyPS\" style=\"overflow: hidden;\">" ; 1 ; 1 ) + 57;
                                  start = Case ( startWrapper and startOverflow; Min ( startWrapper ; startOverflow ) ;
                                                               startWrapper ; startWrapper ;
                                                               startOverflow
                                                             ) ;
                               end = Position ( T ; "</div>" ; start ; 1 ) ;
                                        Desc = If ( start > 39 ; Substitute ( Middle ( T ; start ; end - start ) ;
                                                           ["<br/>" ; ¶] ; ["<br>" ; ""] ; ["<div>" ; "" ] ; ["<p>" ; "" ] ; ["</p>" ; "" ]; ["<i>" ; "" ] ; ["</i>";""]; ["<br />" ; ¶]; ["</P>";"¶"]; ["</P>";""]; ["<EM>"; ""]; ["</EM>"; ""]; ["<div class=\"emptyClear\">" ; "" ] ) )
                                      ] ;
                                      If ( Left ( Desc ; 1 ) = ¶ ; Right ( Desc ; Length ( Desc ) - 1 ) ; Desc )
                                    ) // let
                                
                          • 10. Re: How can I parse out the book description?
                            philmodjunk

                                 I suggest a temporary modification of your calculation:

                                 Let ( [ T =this::Child Source Code ;
                                    startWrapper = Position ( T ; "<div class=\"productDescriptionWrapper\">" ; 1 ; 1 );
                                    startOverflow = Position ( T ; "<div id=\"postBodyPS\" style=\"overflow: hidden;\">" ; 1 ; 1 ) + 57;
                                    start = Case ( startWrapper and startOverflow; Min ( startWrapper ; startOverflow ) ;
                                                                 startWrapper ; startWrapper ;
                                                                 startOverflow
                                                               ) ;
                                 end = Position ( T ; "</div>" ; start ; 1 ) ;
                                          Desc = If ( start > 39 ; Substitute ( Middle ( T ; start ; end - start ) ;
                                                             ["<br/>" ; ¶] ; ["<br>" ; ""] ; ["<div>" ; "" ] ; ["<p>" ; "" ] ; ["</p>" ; "" ]; ["<i>" ; "" ] ; ["</i>";""]; ["<br />" ; ¶]; ["</P>";"¶"]; ["</P>";""]; ["<EM>"; ""]; ["</EM>"; ""]; ["<div class=\"emptyClear\">" ; "" ] ) )
                                        ] ;
                                        //If ( Left ( Desc ; 1 ) = ¶ ; Right ( Desc ; Length ( Desc ) - 1 ) ; Desc )
                                        List ( startWrapper ; startOverflow ; end )
                                      ) // let
                                 This will allow you to inspect the actual values for these temporary variables to see if any of them are not getting a value that you expect. With those values in hand, "hand execute" your case and middle functions to see if you can spot why you are getting the incorrect result that you are reporting.
                                  
                            • 11. Re: How can I parse out the book description?
                              PeterMontague

                                   Thanks Phil. I'll try that. 

                              • 12. Re: How can I parse out the book description?
                                PeterMontague

                                     I'm sorry I don't really understand what temporary values I'm supposed to expect. 

                                     I got a return of numbers from http://www.amazon.co.uk/Series-Unfortunate-Events-Calendar-2005/dp/1405216611/ref=sr_1_1?ie=UTF8&qid=1352285957&sr=8-1

                                     0

                                     122663
                                     122859
                                • 13. Re: How can I parse out the book description?
                                  philmodjunk

                                       Here's the analysis I was trying to get you to do:

                                  List ( startWrapper ; startOverflow ; end )

                                       Means that the values computed are:

                                       startWrapper = 0
                                       StartOverflow = 122663
                                       end = 122859

                                       That indicates that "productDescriptionWrapper" was not found in this source.

                                       that means we have this expression for start:

                                       start = Case ( 0 and 122663; Min ( startWrapper ; startOverflow ) ;
                                                                       0 ; startWrapper ;
                                                                       122663 )

                                       which means that start = startOverflow = 122663

                                       and end will be the first instance of "</div>" to the right of StartOverflow or 122663. 122859 seems like a reasonable value for that.

                                       That means that our Middle function is set up as:

                                       Middle ( Source ; 122663 ; 122849 - 122663 )

                                       That all seems very unlikely to return an incorrect result when you edit your function to remove the LIst function and return to using Substitute for this.

                                  • 14. Re: How can I parse out the book description?
                                    philmodjunk

                                         Once again, I pasted the page source into my test file and then your modified calculation appears to work, producing:

                                         Lemony Snicket's Calendar of Unfortunate Events 2005 - with 13 months.  Illustrated by Brett Helquist. This version includes UK/Eire notable dates in addition to the US/Canadian dates.

                                         The exact calculation that I used was:

                                         Let ( [ T =this::Child Source Code ;
                                            startWrapper = Position ( T ; "<div class=\"productDescriptionWrapper\">" ; 1 ; 1 );
                                            startOverflow = Position ( T ; "<div id=\"postBodyPS\" style=\"overflow: hidden;\">" ; 1 ; 1 ) + 57;
                                            start = Case ( startWrapper and startOverflow; Min ( startWrapper ; startOverflow ) ;
                                                                         startWrapper ; startWrapper ;
                                                                         startOverflow
                                                                       ) ;
                                         end = Position ( T ; "</div>" ; start ; 1 ) ;
                                                  Desc = If ( start > 39 ; Substitute ( Middle ( T ; start ; end - start ) ;
                                                                     ["<br/>" ; ¶] ; ["<br>" ; ""] ; ["<div>" ; "" ] ; ["<p>" ; "" ] ; ["</p>" ; "" ]; ["<i>" ; "" ] ; ["</i>";""]; ["<br />" ; ¶]; ["</P>";"¶"]; ["</P>";""]; ["<EM>"; ""]; ["</EM>"; ""]; ["<div class=\"emptyClear\">" ; "" ] ) )
                                                ] ;
                                                If ( Left ( Desc ; 1 ) = ¶ ; Right ( Desc ; Length ( Desc ) - 1 ) ; Desc )
                                              ) // let
                                    1 2 Previous Next