4 Replies Latest reply on Aug 16, 2011 11:57 AM by CloudMeister

    Duplicate Records Szenario

    CloudMeister

      Title

      Duplicate Records Szenario

      Post

      Hi all,

      since I am still relatively new to FileMaker I would like to see if any of you have a better idea for how to handle the following Scenario:

      The Problem:

      Customer Database is without table for customers, but with some 24 000 Orders incl. Customer Address information.

      Wild typographical errors and styles (examble East Northbrook drive vs. E. Northbrook Dr.)

      The Vision:

      Obtain Unique Customers out of that Database and relate them from a customer table to the orders table

      My Solution:

      Using the 360 Works Scriptmaster Plugin Module TextSimilarity we can calculate how similar records are. We are comparing the following fields: Billing Company Name, Billing Street Address & Billing ZIP Code

      In order to compare 24 000 Records to eachother I have written a script that does the following:

      1) Set a variable for a draft cust ID (this variable is counted up as used)

      2) Loops through all Records that do not have a draft cust ID Assigned in order to assign the same draft cust ID to similar records. Individual records without similarity to other records have their own individual cust ID.

      The script is running as I write this. Eventhough it is running locally on one computer it looks like it will be running 3+ Days.

      After that a human will still have to go through the individual Records in order to see if there are errors and to make corrections as needed.

      As this is an extremely timeconsuming method I am looking for something a little more performant. Any Ideas?

       

      Regards Ernst

           - find close matches with  

       

        • 1. Re: Duplicate Records Szenario
          philmodjunk

          No matter how you slice it, I think you'll need that human element to review and evaluate possible "near matches" to see if they are truly identical given how "messy" your data is.

          I do suggest you create a related customer table. Instead of using the variable to assign customer ID's, I'd create records in this table and use an auto-entered serial number field in this table to assign the customer ID numbers. The script could use the data in your existing table to create records with customer name and other data specific to the customer while also assigning a unique customer ID number.

          During the review process, fields with similar name/address data can be fairly easily combined to link to a single record in this new table with the help of some fairly simple scripting.

          This won't speed up the process at all, but should leave you with a much better table structure when you are done that you can use from this point going forward to better manage your data.

          • 2. Re: Duplicate Records Szenario
            CloudMeister

            Thanks for the reply Phil

            Actually I am doing something quite similar. The script is now creating the Cust ID which will serve not only to identify probable unique candidates but also as the UID for the Customer Table.

            In other words using the draft cust ID I will sort out all of the customer information and replicate that into the new customer table.

            However, because my script is analyzing all records that have not yet been assigned (in the beginning 25 000) it takes almost 2 Minutes to complete the loop for one record (of 25 000 in Total). At that speed, assuming that duplicates will speed things up it could take up to 13 days to complete the cycle.

            Any ideas how to speed that up?

            • 3. Re: Duplicate Records Szenario
              philmodjunk

              Not without knowing a lot more about how your table is designed, the script is written and how the textSimilarity tool works.

              • 4. Re: Duplicate Records Szenario
                CloudMeister

                Just found my solution, regrettably it has nothing to do with FM.

                For all trying to eliminate duplicate records that can be similar and are not neccessarily exact duplicates the Solution would be FuzzyDupes!

                It works extremely fast and gives you high quality output. 

                http://www.kroll-software.de/products/fuzzydupes_e.asp

                check it out!Wink