13 Replies Latest reply on Nov 8, 2012 10:35 AM by philmodjunk

    How to parse the url of an image

    PeterMontague

      Title

      How to parse the url of an image

      Post

           I was hoping to generically parse out the url of an image from a sequence of html. 

           Here is the text:

            

                        id="original-main-image"
           I want the url that comes before: id="original-main-image"
            

        • 1. Re: How to parse the url of an image
          philmodjunk

               <img src = "

               would seem to be a pretty easy tag to find and gain the position of. This should work out very similar to many of the other parsing questions that you've asked.

               Have you tried using one of them as an example and modifying it to work for this?

          • 2. Re: How to parse the url of an image
            PeterMontague

                 Hi Phil. There are actually about 40 instances of img src = " in the text. But there is only one instance of id="original-main-image".

            Most of the help you gave me was to find text that came after the text string. Anyway I've given it a go. I've tried the following:

                  

                 Let ( [ T = this::Child Source Code ;
                              start = Position ( T ; "<img src=\"" ; 1 ; 1 ) +  11;
                              end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 )
                         ] ;
                         Trim ( Middle ( T ; start ; end - start ) )
                        )
            • 3. Re: How to parse the url of an image
              philmodjunk

                   By finding the postion of id="original-main-image", you can use that position with a negative number to search from right to left starting with that position.

                   Let ( [ T = this::Child Source Code ;
                          end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                          start = Position ( T ; "<img src=\"" ; end ; -1 )
                          ];
                          Trim ( Middle ( T ; start + 10 ; end - start - 13 ) )
                         )

              • 4. Re: How to parse the url of an image
                PeterMontague

                     Thanks Phil. I adjusted it a little bit as follows:

                Let ( [ T = this::Child Source Code ;
                       end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                       start = Position ( T ; "<img src=\"" ; end ; -1 )
                       ];
                       Trim ( Middle ( T ; start + 10 ; end - start - 25 ) )
                      )

                • 5. Re: How to parse the url of an image
                  philmodjunk

                       As long as it works, I was tweaking it from your sample source in the data viewer when I used - 13.

                  • 6. Re: How to parse the url of an image
                    PeterMontague

                         I've noticed that the get image function is not working sometimes because of an inconsistency in the source codes. So I've adapted my get image function a bit to try and cater for both cases.

                          

                         Let ( [ T =this::Child Source Code ;
                                   endPos = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                                   startImgsrc = Position ( T ; "<img src=\"" ; endPos ; -1 ) ; 
                                   startRegisterImage = Position ( T ; "(\"original_image\"," ; 1 ; 1) + 22 ; 
                                   endHref = Position ( T ; "\",\"<a href=" ; 1 ; 1 ) ; 
                                   end = Case ( endPos and endHref; Min ( endPos ; endHref ) ;
                                                         endPos ; endPos ;
                                                         endHref
                                                       ) ;
                                   start = Case ( startImgsrc and startRegisterImage; Min ( startImgsrc ; startRegisterImage ) ;
                                                         startImgsrc ; startImgsrc ;
                                                         startRegisterImage
                                                       )
                                 ] ;
                                 If ( start ; Trim ( Middle ( T ; start ; end - start ) ) )
                          )
                         Any tips?
                    • 7. Re: How to parse the url of an image
                      philmodjunk

                           Am I correct that there are two different forms of this end tag? "original-main-image" and "original_image"?

                           Why do you need EndHref here? Since it searches from the beginning of the source code, that part of your expression my be finding a completely different Href tag than that of the particular image URL that you are trying to extract.

                      • 8. Re: How to parse the url of an image
                        PeterMontague

                             Yes there are two different end tags. But they are original main image and href. That's why I am using that endHref. 

                        • 9. Re: How to parse the url of an image
                          philmodjunk

                               Yes, but why do you need it? The original calculation didn't.

                               And please note that, as I already posted, it is likely the source of your trouble as the position function looks for this text starting with the first character in the source.

                          • 10. Re: How to parse the url of an image
                            PeterMontague

                                  

                                 Let ( [ T =this::Child Source Code ;
                                           endPos = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                                           startImgsrc = Position ( T ; "<img src=\"" ; endPos ; -1 ) ; 
                                           startRegisterImage = Position ( T ; "(\"original_image\"," ; 1 ; 1) + 20 ; 
                                           endHref = Position ( T ; "\",\"<a href=" ; 1 ; 1 ) ; 
                                           end = Case ( endPos and endHref; Min ( endPos ; endHref ) ;
                                                                 endPos ; endPos ;
                                                                 endHref
                                                               ) ;
                                           start = Case ( startImgsrc and startRegisterImage; Min ( startImgsrc ; startRegisterImage ) ;
                                                                 startImgsrc ; startImgsrc ;
                                                                 startRegisterImage
                                                               )
                                         ] ;
                                         If ( start ; Trim ( Middle ( T ; start ; end - start ) ) )
                                  )
                                 This got me the url of the image on this webpage:

                            http://www.amazon.co.uk/Series-Unfortunate-Events-Calendar-2005/dp/1405216611/ref=sr_1_1?ie=UTF8&qid=1352285957&sr=8-1

                            I was trying to accommodate for the difference between the different endings of the images urls on the above url's source code and for the source code of the following url:

                            http://www.amazon.co.uk/Santa-Co-Cross-Stitch-Chart/dp/B009CYL0JU/ref=sr_1_1?ie=UTF8&qid=1352287486&sr=8-1

                            Would you mind suggesting a way I can get the url of images with both types of endings please? I'm at a bit of dead end.

                                  

                            • 11. Re: How to parse the url of an image
                              philmodjunk

                                   apologies, but my time is limited today and I am deliberately trying to not spell out every detail--not only to save time, but to help you apply previous examples to the new expression. I've tried to get you to look at this part of your expression:

                                   endHref = Position ( T ; "\",\"<a href=" ; 1 ; 1 )

                                   The 1 shown in red is very likely all that you need to change here by specifying a different starting point to keep this expression from returning the location of some other hyper link in the source.

                                   When I use the original calculation:

                              Let ( [ T = this::Child Source Code ;
                                     end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                                     start = Position ( T ; "<img src=\"" ; end ; -1 )
                                     ];
                                     Trim ( Middle ( T ; start + 10 ; end - start -25) )
                                    )

                              after pasting the source into my test file, it correctly parses out the image URL for both page sources. I can find "original-main-image" in both sources. Sure your example pages are the right ones?

                              • 12. Re: How to parse the url of an image
                                PeterMontague

                                     I looked at the source code of http://www.amazon.co.uk/Series-Unfortunate-Events-Calendar-2005/dp/1405216611/ref=sr_1_1?ie=UTF8&qid=1352285957&sr=8-1 and I used the find function in safari and there is no "original-main-image" in it. 

                                     I have adapted a custom function and it looks like this. It is getting images from both types of pages for me now.

                                      

                                          Let ( [ T = this::Child Source Code ;
                                                 end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                                                 start = Position ( T ; "<img src=\"" ; end ; -1 )
                                                 ];
                                                 Trim ( Middle ( T ; start + 10 ; end - start - 25 ) )
                                                ) & 
                                          Parse ( this::Child Source Code ; "registerImage(\"original_image\", \"" ; "\",\"<a href=\"+'\"'+\"" ; 1 ) 
                                           
                                          Sometimes it downloads the a blank default .gif and replaces the image I had uploaded from Filemaker Go.
                                           
                                          To adapt to this scenario I've modified it.
                                           
                                If ( Let ( [ T = this::Child Source Code ;
                                       end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                                       start = Position ( T ; "<img src=\"" ; end ; -1 )
                                       ];
                                       Trim ( Middle ( T ; start + 10 ; end - start - 25 ) )
                                      ) & 
                                Parse ( this::Child Source Code ; "registerImage(\"original_image\", \"" ; "\",\"<a href=\"+'\"'+\"" ; 1 )  = "http://g-ecx.images-amazon.com/images/G/02/nav2/dp/no-image-no-ciu._V192200227_AA300_.gif
                                "" 
                                Let ( [ T = this::Child Source Code ;
                                       end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
                                       start = Position ( T ; "<img src=\"" ; end ; -1 )
                                       ];
                                       Trim ( Middle ( T ; start + 10 ; end - start - 25 ) )
                                      ) & 
                                Parse ( this::Child Source Code ; "registerImage(\"original_image\", \"" ; "\",\"<a href=\"+'\"'+\"" ; 1 )  )

                                      

                                • 13. Re: How to parse the url of an image
                                  philmodjunk

                                       From the link that you originally posted: