3 Replies Latest reply on Oct 5, 2012 1:59 PM by philmodjunk

    How to parse the edition

    PeterMontague

      Title

      How to parse the edition

      Post

           The edition of a book comes in the line of a long piece of source code that always contains the second occurrence of "<li><b>".

           E.g.

            

           <li><b>Publisher:</b> Jossey Bass; 1st edition (25 Oct 1993)</li>
           or
                <li><b>Publisher:</b> Atlas Press; New edition edition (Feb 1992)</li>
                 
                Sometimes the edition information is included. Sometimes it is not.
                 
                I don't want to include the word "edition".
                 
                How should I script for this?

            

        • 1. Re: How to parse the edition
          philmodjunk

               It looks to me like what you want is always the first word to appear after the first semi colon (;) to the right of "Publisher:"

               Let ( [ T = YourText Field ;
                         PubPos = Position ( T ; "Publisher:" ; 1 ; 1 ) ;
                         ParenPos = Position ( T ; "(" ; PubPos ; 1 ) ;
                         SemiPos = Position ( Middle ( T ; PubPos ; ParenPos - PubPos ); ";" ; PubPos ; 1 ) ; //only search for a ; between "Publisher:" and "("
                         L = Length ( T )
                       [
                        If ( SemiPos ; LeftWords ( Right ( T ; L - SemiPos ) ; 1 )
                     )

               If there is no semi-colon, presumably there is no edition information to parse. If there is no semi-colon, SemiPos will contain 0 which evaluates as false and this expression returns null.

          • 2. Re: How to parse the edition
            PeterMontague

                  

                 The script is not working.
                 It may not have been the first instance of "publisher:". So I modified it. It is definitely the first case of 
                 "<li><b>Publisher:</b>:"
                  
                 Here is what I did:
                  
                      Let ( [ T = this::Child Source Code ;
                                PubPos = Position ( T ; "<li><b>Publisher:</b>:" ; 1 ; 1 ) ;
                                ParenPos = Position ( T ; " (" ; PubPos ; 1 ) ;
                                SemiPos = Position ( Middle ( T ; PubPos ; ParenPos - PubPos ); "; " ; PubPos ; 1 ) ; //only search for a ; between "Publisher:" and "("
                                L = Length ( T )
                              ];
                               If ( SemiPos ; LeftWords ( Right ( T ; L - SemiPos ) ; 1 ))
                            )
                       
                      Its still not working though. Its leaving a blank field. Below are examples of what the line looks like in a number of cases.
                       
                 <li><b>Publisher:</b> Pogue Press; 1 edition (2 Jun 2010)</li>
                      <li><b>Publisher:</b> Jossey Bass; 1st edition (25 Oct 1993)</li>
                           <li><b>Publisher:</b> Swallows Tale Pr (Aug 1985)</li>
                            

                  

            • 3. Re: How to parse the edition
              philmodjunk

                   That'll teach me to suggest a complex calculation like this without testing it in the DataViewer!

                   This expression worked with each of your samples:

                   Let ( [ T = "<li><b>Publisher:</b> Jossey Bass; 1st edition (25 Oct 1993)</li>" ;
                             PubPos = Position ( T ; "Publisher:" ; 1 ; 1 ) ;
                             ParenPos = Position ( T ; "(" ; PubPos ; 1 ) ;
                             SemiPos = Position ( Middle ( T ; PubPos ; ParenPos - PubPos ); ";" ; PubPos ; 1 ); //only search for a ; between "Publisher:" and "("
                             L = Length ( T )
                           ];
                            If ( SemiPos ; LeftWords ( Right ( T ; L - SemiPos - PubPos ) ; 1 ) )
                           //List ( PubPos ; ParenPos ; SemiPos )
                         )

                   including the HTML tags will also work and be a bit safer, but if you do that make sure that you use: "<li><b>Publisher:</b>", not: "<li><b>Publisher:</b>:"