    Converting Word docs into records



      I have a client who has their data in word documents. Specifically each document constitutes a record. The document contains a four column table. Generally, columns 1 & 3 are field names (the labels do have colons), while columns 2 & 4 are data. There are some exceptions:

      the 12th row has a sketch in columns 3 & 4

      the 20th row has a pic in columns 1 & 2

      Several other cells are merged either vertically or horizontally

      My difficulty is that even after I have it in a spreadsheet, it isn't in a normal format of rows for records and columns for fields because these documents were actually created as reports.  There are several thousand documents with about 35 fields each.  Does anyone have ideas on how to make this data importable to a db?


        • 1. Re: Converting Word docs into records

          It depends on how consistently the data is arranged in the files. Word documents can format things any way that user wants so if you have different formats in different files, you have a real mess.

          Here are some very general suggestions:

          Open the documents and save them as plain text with the .txt file extension.

          Put them in a folder in the documents directory.

          Experiment with get ( DocumentsPath ) to figure out how to get a return separated list of just the file names in that folder in Documents.

          Use a loop to load these filenames and their file path one at a time into a variable that you can use with an Import Records step to import the data as tab delimitted data. Import the data into an interim table of text fields.

          Once you can get the data from the entire batch of documents imported, set up a script that moves the data into the correct records of the actual working table or tables where such data should be stored. Depending on the complexity of your data and tables, this may be a fairly straight forward script that uses set field by name to load fields with data or it could be very complex.

          • 2. Re: Converting Word docs into records

            You are on a Mac.  The client is on a Mac? or provided a word.doc done on windows?

            TextWrangler is a free scriptable word processor with automation capabilites by the company that created BBEdit.

            Automator is an automation tool inclued with OSX.   https://www.google.com/search?q=automator

            Exploring the results of all the export options in Word on one of the files could be useful.

            Excel could remove column one and three.   https://www.google.com/search?q=os+x+automator+excel