How to parse the url of an image
I was hoping to generically parse out the url of an image from a sequence of html.
Here is the text:
<img src = "
would seem to be a pretty easy tag to find and gain the position of. This should work out very similar to many of the other parsing questions that you've asked.
Have you tried using one of them as an example and modifying it to work for this?
Hi Phil. There are actually about 40 instances of img src = " in the text. But there is only one instance of id="original-main-image".
Most of the help you gave me was to find text that came after the text string. Anyway I've given it a go. I've tried the following:
By finding the postion of id="original-main-image", you can use that position with a negative number to search from right to left starting with that position.
Let ( [ T = this::Child Source Code ;
end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ;
start = Position ( T ; "<img src=\"" ; end ; -1 )
Trim ( Middle ( T ; start + 10 ; end - start - 13 ) )
Thanks Phil. I adjusted it a little bit as follows:
Let ( [ T = this::Child Source Code ; end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ; start = Position ( T ; "<img src=\"" ; end ; -1 ) ]; Trim ( Middle ( T ; start + 10 ; end - start - 25 ) ) )
As long as it works, I was tweaking it from your sample source in the data viewer when I used - 13.
I've noticed that the get image function is not working sometimes because of an inconsistency in the source codes. So I've adapted my get image function a bit to try and cater for both cases.
Am I correct that there are two different forms of this end tag? "original-main-image" and "original_image"?
Why do you need EndHref here? Since it searches from the beginning of the source code, that part of your expression my be finding a completely different Href tag than that of the particular image URL that you are trying to extract.
Yes there are two different end tags. But they are original main image and href. That's why I am using that endHref.
Yes, but why do you need it? The original calculation didn't.
And please note that, as I already posted, it is likely the source of your trouble as the position function looks for this text starting with the first character in the source.
I was trying to accommodate for the difference between the different endings of the images urls on the above url's source code and for the source code of the following url:
Would you mind suggesting a way I can get the url of images with both types of endings please? I'm at a bit of dead end.
apologies, but my time is limited today and I am deliberately trying to not spell out every detail--not only to save time, but to help you apply previous examples to the new expression. I've tried to get you to look at this part of your expression:
endHref = Position ( T ; "\",\"<a href=" ; 1 ; 1 )
The 1 shown in red is very likely all that you need to change here by specifying a different starting point to keep this expression from returning the location of some other hyper link in the source.
When I use the original calculation:
Let ( [ T = this::Child Source Code ; end = Position ( T ; "id=\"original-main-image\"" ; 1 ; 1 ) ; start = Position ( T ; "<img src=\"" ; end ; -1 ) ]; Trim ( Middle ( T ; start + 10 ; end - start -25) ) )
after pasting the source into my test file, it correctly parses out the image URL for both page sources. I can find "original-main-image" in both sources. Sure your example pages are the right ones?
I looked at the source code of http://www.amazon.co.uk/Series-Unfortunate-Events-Calendar-2005/dp/1405216611/ref=sr_1_1?ie=UTF8&qid=1352285957&sr=8-1 and I used the find function in safari and there is no "original-main-image" in it.
I have adapted a custom function and it looks like this. It is getting images from both types of pages for me now.
From the link that you originally posted:
Retrieving data ...