4 Replies Latest reply on Jul 28, 2009 8:15 AM by LaRetta_1

    how to compare and erase duplicate records based on multiple fields

    var

      Title

      how to compare and erase duplicate records based on multiple fields

      Post

      The database contains 5 fields:

       

      SLIDE ID  

      ARTIST NAME

      PLACE

      DATE

      DESCRIPTION

       

      the issue being.... upon the last import --- all records, instead of the last 150, were imported from a copy of the original file---- thus creating duplicates of 100000+ records........

      the problem being... some of the field entries had been modified on the original(i.e. spelling of artist name, corrected place or date, etc.....).

      Solution?:  in order to "delete" all records whereby the fields match exactly and  "mark" all records which are duplicates based on the SLIDE ID field but were modified in one or more of the other fields indicated above (i.e. ARTIST NAME, PLACE, DATE, DESCRIPTION).....  At that point I could manually delete the small number of records which had been altered without having to wade through the entire database.   

      Also... in order for this not to happen in the future.... what is the appropraite script to run when you are wanting to import only a selected number of records? the business had previously had fmp files for each year-- which were later merged into one file, but 1994 had somehow been skipped--- so i would need to import only that year.  I have now created a new field = DATE MODIFIED, so in the future it would be easier to see the latest modification if duplicates were to arise again......

      ANY SUGGESTIONS?? OR ROBUST SCRIPTS?  

      This position has now been allocated to me and I am now trying to teach myself the appropriate protocols.

       

        • 1. Re: how to compare and erase duplicate records based on multiple fields
          LaRetta_1
            

          Hi Var,

           

          Is the SlideID the unique serial within the file?  If so, were the records within FileMaker modified or were the incoming records modified BEFORE they were imported?

           

          You can find all duplicate SlideIDs (if it should be unique) by performing a find for ! within the SlideID field.  If the record set is unsorted, the first will be the original and the second will be the duplicate. 

           

          I don't want to go into specific script for you until I am sure we are on same page here.  We need clearer rules here.  Are changes made (before you get them) to the imported records or are changes made to the records within FileMaker and which record do you want to keep?  For future, if you always only want to ADD records into FM which aren't already there (and NOT update records with external changes), then you can set the SlideID to validation 'unique' and 'validate always' and you won't get duplicates any more. 

          • 2. Re: how to compare and erase duplicate records based on multiple fields
            var
              

            Perhaps i should elaborate on the contents of the database.......  it is a Photographic Archive of works from the 1970's to the present, whereby the SLIDE ID consists of the Year - Roll # of the film/Contact sheet.... (i.e.  1977-310), whereby the  other fields pertain to the artists name, location, date, etc....... when the image was taken.   Thru time certain names of people have been added, spelling corrections, or specific places and dates changed or entered based upon the "accuracy" and "updates" needed from when the original database was configured.  Therefore, all fields have the potential to be updated as "errors" are found or "corrections" are needed but the SLIDE ID- is the unique identifier and is rarely, if ever, changed.  The issue is that some of the data had been altered prior to and after recent SLIDE IDs were imported from a copy to the original.  Only the SLIDE IDs which were "new" (this years/2009 entries) were meant to be imported, but instead the entire database became duplicated (except for the last 300 entries- since they did not exist in the original).  What i was trying to do is omit all of the entires whose fields are exactly redundant, and then detect the SLIDE IDs which are redudant by the ID, but where the other fields have been changed- and then omit the entry which is incorrect .  Like I stated, only 100-200 fields should have been changed (where errors were found or additions needed) in the past month---- but the archive consists of 100s of 1000s of SLIDE IDs/rolls of film/contact sheets.    

            This is the first time that this has happened, usually "new" data is entered only to the "original", but due to an assignment abroad- the only copy of the database i had at the time was one i had on a drive from a back-up that had been performed 3 weeks prior-  therefore, only the "new" entries were meant to be imported to the "original" when i returned to the office where the original file resides.  

            • 3. Re: how to compare and erase duplicate records based on multiple fields
                

              @var:

               

              do you really need to compare all the records? The reason i ask is that when you import records the new record will always be saved in order of entry. So the first batch will contain 1977-nnn...2008-nnn then another batch will have exactly that same order again 1977-nnn...2008-nnn and then your newest records will be shown in the DB. So you have 3 groups of records in your DB. Now supposing that you only changed data in the original (thus first group of) records, the way to find the duplicate records that you want to delete:

              Find all records

              Go to the first record and copy ID ... enter find mode and find that ID, that should result in 2 records.

              The second record should be the first of the unwanted records, so that's the record you should select.

              Find all records again ... now choose "omit multiple records" from the menu and enter the current recordnumber minus one (the current recordnumber is in the statusarea) en click omit. The records will be omitted and if there is actually no difference between the old records and the records you have imported to much, the record that is selected after the omit should be the first record of 2009 and the found set is the set of records that you would like to keep.

              Now find the omitted records and throw them away.

               

              When editing a database this large, people do a find first and then edit the result..  that's usually the first and oldest record, so by keeping the oldest record you're pretty safe.

               

              To prevent this happening in the future, start using a real recordID. There are several ways to do that but my favorite way to use an entry-timestamp-field plus a serialnumber-field and concatenate these two fields to form the ID and if you want to make it even more unique you could add a field with your NIC: Substitute ( GetValue ( Get ( SystemNICAddress ) ; 1 ) ; ":" ; "" ) and add that to the concatenation. (When you import data from a duplicate database make sure that you import with autoenter-features off)

               

              For making sure that you don't import duplicate i cannot give you a certain recipe, there are lost of possible ways to do that. Mostly it depends on your way of working, the demands you have etc. Using a real recordID is just a start :smileyhappy:

               

              regards, Menno

              • 4. Re: how to compare and erase duplicate records based on multiple fields
                LaRetta_1
                   You say you want to ‘manually delete the small number of records which had been altered.’  Then you say that the ‘issue is that some of the data had been altered prior to and after recent SLIDE IDs were imported from a copy to the original. ‘ If there are two identical SlideID records (one was already in your table and the other was imported in) and changes might have been made to either the first OR the second record (and you have no modification date) then how will you know which to delete?  You can concatenate all the fields together in a calculation (result is text) as: SlideID & “_” & Name & “_” & … etc Then perform a search on this calculation for duplicates – using !, sort by this calc and keep the first one, deleting the rest (using a loop) because they will be exact duplicates and are unnecessary.  I wouldn’t bother with a relationship since this is a one-time issue.  You will be left with records which are 1) either unique and there is only one of them or 2) two copies of the same SlideID which must be manually compared to determine which to keep. To find the second set of duplicates, base it on SlideID this time and search for !.  Sort the records on SlideID and begin comparing the two records.  And always back up before working on your data like this. 

                To import only 1994 from another FMP file, open the other file and perform a find first for the records and THEN import.  And always back up first.  You used the phrase that your SlideID is “rarely, if ever, changed“, indicating that it consisted of “the Year - Roll # of the film/Contact sheet.... (i.e.  1977-310).”  If it can EVER EVER change, it should not be used as the uniqueID.