1 Reply Latest reply on Aug 13, 2014 2:48 AM by Mike_Mitchell

    Extracting/Filtering Data based on Dates (but not in a date field)

    brianj77

      I have some data which was exported from an Outlook file (extracted to .csv format)

      I’m trying to extract/filter all messages within a date range but the date range is embedded in the body of the email so when I imported it I have one big text field containing text and dates

       

      These dates covered are 2010 - 2014 but I need to filter out the data for only 2013-2014 i.e. any ‘conversation or contact’ which took pace within July 2013 to July 2014 using the embedded dates as the date of the contact.

       

      Most of the data contains the ‘Sent’ with the dates (Sent: Wednesday, January 11, 2012 4:31 AM) but not always. Sometimes it is just ‘mentioned’ on the body of the text

       

       

      Typical email body data (with identities hidden for this purpose)

      Other data is just regular email ’threaded conversations’

       

      =======body data==========================

       

      From: “Web Enquiries" <enquiries@1111111.com>

      To: <asasasas@2222.net>

      Sent: Wednesday, January 11, 2012 4:31 AM

       

      …..text in body of email etc ……….

       

      blah blah i arrive on 21 March 2012 ...blah blah

      =================================

       

       

      Can anyone suggest a solution please?

       

      Thanks for your help

       

      Cheers

      Brian

        • 1. Re: Extracting/Filtering Data based on Dates (but not in a date field)
          Mike_Mitchell

          Brian -

           

          Having seen just a snippet, I can't give you precise instructions, but it appears you have a text parsing exercise. I would suggest a process along these lines:

           

          1) Determine if you have a string that can serve as a delimiter between messages. Perhaps you can use something like:

           

               From: "

           

          If all the messages begin with that, then you can use it to mark the beginning of a message.

           

          2) Using the Position and Middle functions, loop over the text in your Ginormous Monolithic Text Field and separate out the messages into individual records in another table. Just go from one occurrence of your delimiter to the next.

           

          3) You can use a similar process using the string Sent: to extract the date information (i.e., anything between Sent: and the next carriage return).

           

          The ugly part would be trying to extract a date that's embedded in the body text. This is especially true if you have a variety of formats involved. For example, if you have one person who says, "21 March 2012", another who writes "3/21/2012" and another who writes "March 21st", you'll have a mess on your hands. (I haven't seen the data, so I can't say for certain that's what you have.) But one thing you can try is use a return-delimited list of all month names, like this:

           

          January

          February

          March

          April

          May

           

          and so forth. Loop over that list and see if you can detect any instances of those words in the body text. If you find one, check to see if there are any 2- or 4-digit numbers in the range you need (13, 14, 2013, 2014). Flag those records for later examination.

           

          Someone else may have a better idea, but trying to extract variably-formatted text from a random jumble is usually a matter of trial and error based on what's in the particular data stream.

           

          HTH

           

          Mike

          1 of 1 people found this helpful