1 of 1 people found this helpful
If you're using version 12, you should be using the latest webscraping technique using the Insert From URL script step.
This will insert the resultant code from a page into a field of your choosing.
BUT (and this is a BIG caveat), every web site may handle 404 errors differently. IE take a look at this:
It's technically 404, but a web application is giving you a resulting error message. There IS a "404" text reference in the code itself.
What I would recommend doing is just flagging "suspect" 404 records to manually review later.
IE script it like this:
go to record - first
Insert From URL ( table::url into table::code field)
if ( patterncount (table::code ; "404" ) > 0 )
set field - table::flag = "SUSPECT"
go to record - next (exit after last)
PS - Welcome to Technet!
Hey, that just might be it. I forget about PatternCount.
That could go for sites with the words "incorrect", like the example you gave.
Thanks for the welcome.
I prefer this revamp they have done. It makes me want to contribute or ask more....
The reason I recommended flagging and manually reviewing pages was the high probability of false positives.
If you use something like patterncount(), it will return false positives anytime a legit live page (not 404/not found) contains the string of text you are searching for in the patterncount() function.
IE, if I scraped this technet page, it would be a false positive, because we've used the string "404" and "incorrect" numerous times now.
hopefully the number of pages you need to review will be small, you can chain keyword checks together too like this:
if ( patterncount (table::code ; "404" ) > 0 or patterncount (table::code ; "incorrect" ) > 0 or patterncount (table::code ; "missing" ) > 0)
set field - table::flag = "SUSPECT"
The only thing that would get in the way of being able to label a site as 404 for example, is if the code is downloaded as text (which is great).
For a JS site, it seems that if there is text involved, it doesn't show up. So those don't get labeled as 'suspect', even if they blaze 404 over the page.
But otherwise, I followed what you said, and the extra info you gave helps too... and so far it is looking great.
I also added a loop to a pause script that pauses for about a minute, waiting for the code to download. So at least when there is text in the code field, I can tell the script to loop the pausing until the field is filled, then move on.
Thanks for writing back again MIke
for Insert From URL you shouldn't need to pause the script. It should not continue to the next script step until the page code has loaded and inserted to a target field.
The pause WAS necessary in your original method, where you would have needed to wait for a webviewer to load first before scraping.
As for the JS, look at the code of the page to see if there is any discernable pattern that you can identify. Even if there's a tag like <body onLoad('not found')>, you can scrape it if it's unique.
Adding multiple conditions will make your script more robust and reliable for catching all types of 404s.
Just a thing I noticed, Insert From URL doesn't seem to work on mine, instead shows an error mesasge 507 which I do fine a bit hard to understand:
"Value in field failed calculation test of validation entry option"
It does work if I use GetLayoutObjectAttribute quite well.
I am rather excited to see Insert From URL in action though...they say it's got many possibilities.
It's Insert from URL
You need to pass the script step a URL, as you would in the address bar of a browser (IE http://www.google.com or google.com or www.google.com).
Here's the technical page on the script step:
Hi Mike (sorry I didn't reply for a few days)
Yes, it doesn't seem to accept a set variable $url of a field that has a URL in it. (eg if in <<URL>> is "http://www.google.com"", this isn't accepted...)
Which may mean that if set variable/set field doesn't work, I might have to copy/paste instead...it goes to a calculation box, so I would have thought just putting in the field that all the URLs are stored, would suffice.
Thanks for your help in working this out.
Actually, the 404 status code is sent in the HTTP header (which you don't see in the browser or WebViewer, but which is interpreted by the browser).
If you have FileMaker Server or Server Advanced, you could use the FM API for PHP and write a PHP script that
- reads out the URL field from one or several records
- uses the PHP curl functions (see http://stackoverflow.com/questions/408405/easy-way-to-test-a-url-for-404-in-php ) to read out the HTTP header
- in case of a 404 (or a 302, redirect) update a database field of the record(s) with the status
This script could then be called in a WV or using an Open URL script step.
(Remember having done that with XML/XSLT CWP and a Java extension function for XSLT).