5 Replies Latest reply on Dec 10, 2011 1:21 PM by old_cedar

    Handling and Managing Duplicates


      I'm attempting to write a script for a database currently containing half a million records and more are added all the time. The script is supposed to handle and manage duplicates for the companies back up database. What they want is for the script to find out whether or not the record is/has duplicates (intended to be plural) if so the script will then compare the comments if they are identical the script will delete the secondary record if they are differnt then the comments will be merged. The primary record is determined by a numerical field containing dates known as last updated. the part im struggling with currently is Identifing the last updated field within the script. How would I base the merge off of the comments as well as the contents of the last updated field. If anyone had some helpful advice it would be most helpful. I'm also toying with a variation of this script that handles and manages duplicates in a much different manner. Where i duplicate the table in a relationship tab and i create a self-join linked a unique(key) field then make a portal of the duplicated table on a form layout of the original so you can scroll through and get an idea of how many duplicates there are. Im also trying to get The script to go to all of the portal records and combine fields with a script. If anybody could offer some tips, hints, help, any form of advice what so ever I'd greatly appreciate it.


      Copies of the Script I'm using can be made available upon request.

        • 1. Re: Handling and Managing Duplicates

          Hi, user (sorry, no name given):


          Does each record have a unique ID? How do you determine in the first place whether to try comparing two records?


          If it is as "simple" as searching for duplicate unique IDs, then you just use your self-relationship based on unique ID to compare the last updated date of the current record with that of the first related matching record, and similarly the contents of the comments in both. Roughly speaking, you could loop through found records, tag the records to be merged and deleted, then go back and merge and delete the tagged records.


          I'm sure there's more to your situation, though.




          Erik Wegweiser

          Intelligent Database

          Boston, MA


          • 2. Re: Handling and Managing Duplicates

            I'd use a number of steps.


            1. Export the table and experiment there until I'm happy the result is correct.

            2. Add a "utility" field, which can be used to store a tag/note

            3. create a relationship based on the criteria necessary to determine what a duplicate is

            4. Use the relationship to identify any duplicates

            5. Show only the duplicates

            6. Loop through them comparing them with their parent, modifying the parent record if necessary. Tag the changed parent records.

            7. locate changed records Check the results

            8. if OK, Import changed fields in changed records back to the source.




            • 3. Re: Handling and Managing Duplicates

              I thought of marking them but with so many it didn't seem feasible. The

              question i had asked recently I've already taken care of i have a much more

              tricky problem at hand if you wouldn't mind helping me with it please

              respond to this email with your thoughts if not then i appreciate your

              initial help and wish you the best. My script currently runs throughs all

              the records sorts them it goes from the first record to the next scanning

              for duplicates using an extended stream of if functions if the record

              proves to be a duplicate it will then merge the comments. if the record is

              deleted then naturally the rest will move to takes its place placing your

              marker on the record you already need to compare to the Primary record but

              the script doesn't know that so it takes a step forward and has effectively

              skipped a record this can be counter-acted by placing a go to so that the

              marker will be replaced on the primary record. One would imagine that this

              would solve the weird pattern in which the script merges the duplicates but

              hasn't so thats where i'm at trying to get it to work.

              • 4. Re: Handling and Managing Duplicates

                How would i set up a relationship.

                • 5. Re: Handling and Managing Duplicates



                  The relationship method is simple and quick.  Assuming that you have a unique Identfier field [ tablex::UniqueID ] and a creation timestamp field.  [ tablex::cTime ]


                  1.     Set up a relationship to that field. [tablex::UniqueID to tablex::UniqueID which will result in tablex::UniqueID=tablex 2::UniqueID]

                          Remember there maybe more than one duplicate.


                  2.     In the relationship set the sort order of the creations timestamp to descending. [newest at the top.]

                          This maintains the older record [the original] as the preferred maintained record.


                  3.     Create 3rd field Warning where [if( tablex::cTime=tablex 2::cTime, "", "Duplicate" )]

                          [Remember when not specified only the first record of a relationship is referenced in an equation.]


                  4.     Use a script trigger to execute a script to merge the records [fields] when the Warning field modifies.  [blank to Duplicate and back to blank.]


                  5.     Be sure the script also deletes the offending duplicate record.


                  Note: Use the exact function to compare the respective fields you want to merge.  If there is an exact match, then there is no need to merge or execute further when the Warning field modifies back to blank.  Remember it is the related record that triggers the Duplicate warning and is the duplicate.


                  This will auto perform whenever a record is active.  This would be for ongoing problems. 

                  For cleaning up past issues set up a Batch [found set] looping script that finds all "Duplicate" records and execute the merge.  Exit the loop when there are no more found records.


                  Be warned such cleanups on large record volumes require some time.  The amount of time depends greatly on the system components and your scripting skills.  Good luck.


                  Message was edited by: old_cedar