10 Replies Latest reply on Aug 1, 2013 1:18 AM by mbraendle

    Running script through website to see if it is a 404

    happyez

      Heya marvellous techpeople

       

      I have tried many different way to work this out, and my knowledge only gets me so far...

       

      Does anyone know a way you can get a script to say whether a website is a 404, or more so, when a result appears down the bottom right of the webviewer, saying that the 'source cannot be located' (or somesuch)? That way, I can script through the thousands of URLs I have, and just clear/set field "" all the ones that don't work.

       

      I tried using GetLayoutObjectAttribute to do this ("content" or "source") but to no avail.

       

      Anyone have any idea?

        • 1. Re: Running script through website to see if it is a 404
          mikebeargie

          If you're using version 12, you should be using the latest webscraping technique using the Insert From URL script step.

           

          This will insert the resultant code from a page into a field of your choosing.

           

          BUT (and this is a BIG caveat), every web site may handle 404 errors differently. IE take a look at this:

          http://www.filemaker.com/asdfjasjdfas.html

           

          It's technically 404, but a web application is giving you a resulting error message. There IS a "404" text reference in the code itself.

           

          What I would recommend doing is just flagging "suspect" 404 records to manually review later.

           

          IE script it like this:

           

          freeze window

          go to record - first

          Loop

          Insert From URL ( table::url into table::code field)
          if ( patterncount (table::code ; "404" ) > 0 )

            set field - table::flag = "SUSPECT"

          end if

          go to record - next (exit after last)

          end loop

           

          PS - Welcome to Technet!

          1 of 1 people found this helpful
          • 2. Re: Running script through website to see if it is a 404
            happyez

            Hi Mike

             

            Hey, that just might be it. I forget about PatternCount.

             

            That could go for sites with the words "incorrect", like the example you gave.

             

            Thanks.

            • 3. Re: Running script through website to see if it is a 404
              happyez

              Thanks for the welcome.

               

              I prefer this revamp they have done. It makes me want to contribute or ask more....

              • 4. Re: Running script through website to see if it is a 404
                mikebeargie

                The reason I recommended flagging and manually reviewing pages was the high probability of false positives.

                 

                If you use something like patterncount(), it will return false positives anytime a legit live page (not 404/not found) contains the string of text you are searching for in the patterncount() function.

                 

                IE, if I scraped this technet page, it would be a false positive, because we've used the string "404" and "incorrect" numerous times now.

                 

                hopefully the number of pages you need to review will be small, you can chain keyword checks together too like this:

                 

                if ( patterncount (table::code ; "404" ) > 0 or patterncount (table::code ; "incorrect" ) > 0 or patterncount (table::code ; "missing" ) > 0)

                  set field - table::flag = "SUSPECT"

                end if

                • 5. Re: Running script through website to see if it is a 404
                  happyez

                  The only thing that would get in the way of being able to label a site as 404 for example, is if the code is downloaded as text (which is great).

                   

                  For a JS site, it seems that if there is text involved, it doesn't show up. So those don't get labeled as 'suspect', even if they blaze 404 over the page.

                   

                  But otherwise, I followed what you said, and the extra info you gave helps too... and so far it is looking great.

                   

                  I also added a loop to a pause script that pauses for about a minute, waiting for the code to download. So at least when there is text in the code field, I can tell the script to loop the pausing until the field is filled, then move on.

                   

                  Thanks for writing back again MIke

                  • 6. Re: Running script through website to see if it is a 404
                    mikebeargie

                    for Insert From URL you shouldn't need to pause the script. It should not continue to the next script step until the page code has loaded and inserted to a target field.

                     

                    The pause WAS necessary in your original method, where you would have needed to wait for a webviewer to load first before scraping.

                     

                    As for the JS, look at the code of the page to see if there is any discernable pattern that you can identify. Even if there's a tag like <body onLoad('not found')>, you can scrape it if it's unique.

                     

                    Adding multiple conditions will make your script more robust and reliable for catching all types of 404s.

                    • 7. Re: Running script through website to see if it is a 404
                      happyez

                      Hi Mike

                       

                      Just a thing I noticed, Insert From URL doesn't seem to work on mine, instead shows an error mesasge 507 which I do fine a bit hard to understand:

                      "Value in field failed calculation test of validation entry option"

                       

                      The field that Insert pulls the URL fro ("URL") and inserts into ("404_code") both are text fields without any validation. They are indexed, but I don't think that makes a difference. Just text fields. Maybe they need to be calcuation field?? Or maybe the URL has to not be in javascript...

                       

                      It does work if I use GetLayoutObjectAttribute quite well.

                       

                      I am rather excited to see Insert From URL in action though...they say it's got many possibilities.

                      • 8. Re: Running script through website to see if it is a 404
                        mikebeargie

                        It's Insert from URL

                         

                        You need to pass the script step a URL, as you would in the address bar of a browser (IE http://www.google.com or google.com or www.google.com).

                         

                        Not sure what you're trying to do passing in javascript.

                         

                        Here's the technical page on the script step:

                        http://www.filemaker.com/12help/html/scripts_ref1.36.46.html

                        • 9. Re: Running script through website to see if it is a 404
                          happyez

                          Hi Mike (sorry I didn't reply for a few days)

                           

                          Yes, it doesn't seem to accept a set variable $url of a field that has a URL in it. (eg if in <<URL>> is "http://www.google.com"", this isn't accepted...)

                           

                          Which may mean that if set variable/set field doesn't work, I might have to copy/paste instead...it goes to a calculation box, so I would have thought just putting in the field that all the URLs are stored, would suffice.

                           

                          Thanks for your help in working this out.

                          • 10. Re: Running script through website to see if it is a 404
                            mbraendle

                            Actually, the 404 status code is sent in the HTTP header (which you don't see in the browser or WebViewer, but which is interpreted by the browser).

                             

                            If you have FileMaker Server or Server Advanced, you could use the FM API for PHP and write a PHP script that

                             

                            - reads out the URL field from one or several records

                            - uses the PHP curl functions (see http://stackoverflow.com/questions/408405/easy-way-to-test-a-url-for-404-in-php ) to read out the HTTP header

                            - in case of a 404 (or a 302, redirect) update a database field of the record(s) with the status

                             

                            This script could then be called in a WV or using an Open URL script step.

                             

                            (Remember having done that with XML/XSLT CWP and a Java extension function for XSLT).