5 Replies Latest reply on Jul 25, 2012 5:08 PM by BobGossom

    Convert Scraped Website Content to Just Text

    DrewTenenholz

      All --

       

      I have a text field that contains some scraped website content. (It displays well in a WebViewer, for example.) What I really need to do is to pull out the text content (which is really NOT styled text, but simply text) into a different field. Does anyone know any tricks for doing this?

       

      Sincerely,

      Drew Tenenholz

        • 1. Re: Convert Scraped Website Content to Just Text
          BenHutson

          Hi Drew,

           

          You will need a method to parse the text which you have retrieved from the website. Typically these will contain HTML and CSS styling tags. It depends on the language in which it is written and how the content is laid out as to what techniques you will use to parse the data.

           

          There is an interesting article here on how to parse any language:

           

          http://www.soliantconsulting.com/blog/2011/12/how-to-parse-json-or-any-other-language-in-filemaker

           

          There is also a good custom function here, which could be tweaked for use, depending on what the data is like which you are trying to parse:

           

          http://www.briandunning.com/cf/559

           

          Otherwise you could post a sample of the data and myself or someone else could give help you with a script to parse the information.

           

          Depending on whether you are using FM12, or a previous version, and the end result you are trying to achieve, then there may be different options for you to try.

           

          Best Regards

           

          Ben

          • 2. Re: Convert Scraped Website Content to Just Text
            DrewTenenholz

            Ben --

             

            Thanks for the links; it's useful to know that someone hasn't invented the particular wheel I need, which is what I thought already.

             

            I've had quite a lot of experience with text parsing, enough to know that I don't want to go through the heck of it with this particular sub-piece of a larger project.  It's not that I find it hard, just tedious, and in this particular case pretty pointless. (I don't want to bore you with the details.)

             

            What I think I'd like to be able to do is take the 9,000 records with HTML content in them (from my set of 31,000) and basically run the routine from BBEdt called Markup>Utilities>Translate, which can convert Text <--> HTML, stripping/adding all valid HTML markup and converting entity codes.  It's slick, fast, and best of all, already done. 

             

            So, can anyone help with either Automater or AppleScript (or completely alternatively something that can run in ScriptMaster/Groovy) that can send a chunk of text out for processing and get me back the result?

             

            -- Drew Tenenholz

            • 3. Re: Convert Scraped Website Content to Just Text
              Mike_Mitchell

              I don't know about ScriptMaster, but what about using SmartPill and just run it through the PHP striptags function?

               

              Just a thought.

               

              Mike

              • 4. Re: Convert Scraped Website Content to Just Text
                sporobolus

                on 2012-07-24 15:19 DrewTenenholz wrote

                So, can anyone help with either Automater or AppleScript (or completely alternatively something that can run in ScriptMaster/Groovy) that can send a chunk of text out for processing and get me back the result?

                 

                there are dozens of ways to do this; passing it to BBEdit is a fairly awkward

                option, but it would go something like this …

                 

                assuming you will loop through the records you'll process, before your loop

                send a message to BBEdit to create blank document:

                 

                tell app "BBEdit" to make new document
                

                 

                then as you loop through your record, execute the following as Native AppleScript

                 

                   -- implied "tell app FMP" context
                   copy cellValue of cell "html" of current record to raw_html
                
                   tell application "BBEdit"
                     tell window 1 of front document
                       set contents of it to raw_html
                       translate html to text with tag removal and entity conversion without 
                create new document
                       copy contents to the_text
                     end tell
                   end tell
                   display dialog the_text
                
                   copy the_text to cell "text" of current record
                

                 

                you might also consider exporting to a set of files and using a BBEdit text factory

                • 5. Re: Convert Scraped Website Content to Just Text
                  BobGossom

                  Drew,

                   

                  What I do is navigate to the website with the web viewer, get the source of the page with a set field & GetLayoutObjectAttribute ( "<object name>" ; "Content" ) and then parse it in FMP. Note you have to name the web viewer object.

                   

                  I like not having to involve any 3rd party software/processes (beyond the default browser, linked to the Web Viewer).

                   

                  Bob Gossom