2 Replies Latest reply on Jan 23, 2013 12:44 PM by Stephen Huston

    Data Capture vs OCR Software

    c2m2e

      I’m appealing to all my techie contacts for help with finding the right software to solve a problem. I don’t know if one software product can accomplish the two tasks we need done, but I think I’m looking for either an OCR product or a Data Capture product or both.

       

      The first task we need to address is taking existing PDFs and getting their data match imported into our FileMaker Pro database. I’ve experimented with OCR software (ABBYY FineReader Express for Mac) to get the PDFs converted to Excel spreadsheets. The problem is, while it works fine with PDFs already in a table or spreadsheet type format, it doesn’t work well with PDFs in other formats. For example, we have tradeshow attendance directories that are laid out in non-uniform columns. Using the OCR software leaves the city, state, and zip all in one cell as well as the first name, last name, and title in one cell. This won’t work for database importing (unless there is a way around that).

       

      The second task we have is similar, but I don’t think OCR software will work. We have access to web-based membership lists we also want to match import to our database. They can’t be downloaded and printing as PDF leads to the same problem as above. I’m not experienced with data capture software but from what I’ve read it seems like it could work. I just can’t find one for Mac with a demo that I can test. ListGrabber (http://www.egrabber.com/listgrabberstandard/) looks like the right type of product, but they don’t have a Mac version.

       

      I would greatly appreciate any insight or suggestions!

       

      Many Thanks!

        • 1. Re: Data Capture vs OCR Software
          RubenVanDenBoogaard

          Hi Eric,

           

          An pdf can contain selectable text and/or images.  If you use the pdf function in OSX you get a pdf with selectable text, (unless you are making a pdf

          of an image).  Try opening the pdf in Preview and use the text tool to select the text.

           

          If the pdf already contains text, you can grap the text portion of the pdf using the 360Works Scribe plugin.

           

          If the text grabbed from the pdf has a logical order, you could use a looping script to split the text into records.

           

          If the pdf is an image (because you ran a document through a scanner or something) you can use the ABBYY FineReader to read the text and use the above steps to split it into separate records. OCR text is less regular (and error free) then text incorparated in pdf.

           

           

          There are several ways to grab text from the web.  You can grab the source of a webviewer or use a plugin to get the source of a web url

           

          Hope that helps,

           

          Best regards,

           

          Ruben van den Boogaard

          Infomatics Software

          ruben@infomatics.nl

          • 2. Re: Data Capture vs OCR Software
            Stephen Huston

            Eric Cunningham wrote, in part:

             

            The second task we have is similar, but I don’t think OCR software will work.  We have access to web-based membership lists we also want to match import to our database.  They can’t be downloaded and printing as PDF leads to the same problem as above.

             

            I have worked with some web-based systems where the reports are also available as comma-separated value files (.CSV). These can be imported into FileMaker without a hitch. Try contacting your web source support to see if CSVs can be made available to you.