7 Replies Latest reply on Oct 5, 2012 8:34 AM by philmodjunk

    How to parse the date

    PeterMontague

      Title

      How to parse the date

      Post

            

           I'm trying to parse out the year of publication here:

           This is the only case of "Publication Date:" in a long piece of text.

            

      v class="buying"><span class="byLinePipe">Publication Date: </span><span style="font-weight: bold;">5 Jan 2009</span> </div>

            

           I want to parse out the last word before "</span>".

            

           Any suggestions?

        • 1. Re: How to parse the date
          philmodjunk

               </span> appears more than once. I would guess that by "last word" you want the year from the date: 2009.

               Let ( [ T = Yourtable::YourTextField ;   
                         PubPos = Position ( T ; "Publication Date:" ) ;
                         EndSpan = Position ( T ; "</span>" ; PubPos ; 2 )
                      ];
                      RightWords ( Left ( T ; EndSpan ) ; 1 )
                     )
                

                

          • 2. Re: How to parse the date
            PeterMontague

                 I adapted your script like this:

                  

                 Let ( [ T = this::Child Source Code ;   
                           PubPos = Position ( T ; "Publication Date:"; 1; 1 ) ;
                           EndSpan = Position ( T ; "</span>" ; PubPos ; 2 )
                        ];
                        RightWords ( Left ( T ; EndSpan ) ; 1 )
                       )
                  
                 This works fine half of the time. But sometimes it returns me the result "span".
                 Have I done something wrong?
            • 3. Re: How to parse the date
              philmodjunk

                   Good catch on the typo.

                   I suggest carefully examining the text in Child Source Code. PubPos might evaluate as 0 if the exact text used does not appear in it. EndSpan can return the wrong value if the </span> text immediately after the year is not always the second time it appears to the right of PubPos.

              • 4. Re: How to parse the date
                PeterMontague

                     I've found a pattern. In the source codes that I've studied date always appears after the second occurence of '<li><b>'

                     Here is a typical line containing it: 

                     <li><b>Publisher:</b> Jossey Bass; 1st edition (25 Oct 1993)</li>

                     Does anyone have a way to parse this out?

                • 5. Re: How to parse the date
                  philmodjunk

                       That's quite different from your original example:

                  v class="buying"><span class="byLinePipe">Publication Date: </span><span style="font-weight: bold;">5 Jan 2009</span> </div

                  vs.

                       <li><b>Publisher:</b> Jossey Bass; 1st edition (25 Oct 1993)</li>

                       Note that neither "publication date:" nor </span> appears in this text. You have several threads open dealing with different parsing challenges, are you sure you've responded to the correct thread?

                  • 6. Re: How to parse the date
                    PeterMontague

                         Hi Phil. Yes I can see that the date was in a different place earlier. But I think the date will be more consistent if I take it from here instead. I have a number of threads going looking to parse different parts of the code. I hope it's not confusing!

                    • 7. Re: How to parse the date
                      philmodjunk

                           The question I have:

                           Is the format of this text:

                           <li><b>Publisher:</b> Jossey Bass; 1st edition (25 Oct 1993)</li>

                           consistent in every case? Is there always the text Publisher: </b> followed be a date to the right of it in parenthesis. You'll note that I am often including "weasel words" after my suggested posts where I document the assumptions on your text formats that I used in designing the calculation.

                           Let ( [ T = this::Child Source Code ;   
                                     PubPos = Position ( T ; "Publisher:"; 1; 1 ) ;
                                     rParenPos = Position ( T ; ")" ; PubPos ; 1 )   //find the first ) to the right of "Publisher:"
                                  ];
                                  RightWords ( Left ( T ; rParenPos - 1 ) ; 1 )
                                 )
                           Note that a slight modification of this would extract the complete publication date into a date field and then Year ( yourdatefield ) could be used whenever you need the year of publication.