7 Replies Latest reply on May 11, 2017 8:06 AM by Cécile

    Script to manage duplicates before uploading 1200 new contacts

    Cécile

      Background

      I "inherited" a  database of about 40k contacts (logged in inconstistently over 17y. by a succession of employees) where there might be already several duplicates. I haven't had time in the last 18 mo that I have started this position to actually figure out a process to clean up the old stuff but when I upload a new complaint, I normally check the DB to see if the person record already exists. Duplicates can be spotted by similar addresses, emails, etc. When we only have a name to go by, if the writing style is strikingly similar, we assume it is the same person otherwise we create a new record. If the name and email is the same but the address is different, we combine them, put in the file notes the old address and destroy the duplicate.

       

      Relevant for scripting purposes

      So I received a massive amount of complaints for a specific event which is impossible for me alone to log one by one so I have created a process to extract all the info from the emails. It is now in an Excel table which I prepped to upload the contacts and create the new complaints "en masse".

       

      I want to find a way to:

       

      1. Compare the new contacts from the Excel list to those in the database
      2. Get the list of all the contacts for which there might already be a record(s) in the DB

       

      Bonus points!

      I don't think - but don't know - if this list of contacts for which there might already be a corresponding record in the DB will be big so I am planning to go through that list manually. However if the list was more than 30 contacts, it would be great to have the result showing like this:

       

      Full Name
      (xls)
      Full Name
      (FM)
      First Name
      (xls)
      First Name
      (FM)
      Last Name
      (xls)

      Last Name

      (FM)

      Prov.
      (xls)
      Prov.
      FM

      Email

      (xls)

      Email

      FM

      Bill BrownWilliam BrownBillWilliamBrownBrownONONBWB@me.caBWB@me.ca
      Bill BrownBill H. BrownBillBill H.BrownBrownONONbbrown@gmail.com
      Bill BrownHenry Klout-BrownBillHenryBrownKlout-BrownONMBBWB@me.caBWB@me.ca
      Claire KittClaire KittsClaireClaireKittKittsBCBCkittyc@live.com
      Claire KittClara KClaireClaraKittKBCBC
      Claire KittClaireKittkittyc@live.comkittyc@live.com
      Claire KittMary C KittsClaireMary CKittKitts
        • 1. Re: Script to manage duplicates before uploading 1200 new contacts
          mikebeargie

          All of this is possible, but some of it could be highly complex.

           

          Usually how I handle this is uploading the contacts into a temporary table, and establish a temporary relationship to my actual contacts table based on a list of potential duplicate keys.

           

          I then present the user a list of the temporary contacts where there are >0 matches in the related table, requiring them to manually approve or deny those contacts for import. I usually display a portal of the related records so the user can decide if it's a match or not. Usually displaying full name, address and phone/email for both the temp record and the suspect duplicates.

           

          Once the user is "done", they manually press a button that deletes any duplicates, and imports the temp table into the main contacts table.

           

          If you want a columnar view like you displayed you will need to do a pivot table, which is the really complex part.

           

          Usually I establish relationships based on an auto-enter "listkey" field, with a calc like this:

          List(
          Filter ( email ; 1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

          Filter ( email2 ; 1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

          Filter ( fullname ; 1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

          Filter ( phone1 ; 1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

          Filter ( phone2 ; 1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

          Filter ( cell ; 1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP")

          )

          I do that on both the temporary and production side, essentially stripping any special characters or spaces out of those fields and putting them in to a list so you can relate the temp side to the production side.

          • 2. Re: Script to manage duplicates before uploading 1200 new contacts
            Cécile

            Thank you for responding so quickly. I forgot to mention that I am a beginner at developping in FM... I am not sure I understand what you mean by temporary and production side. Like a dev DB and a production DB?

            I am not fully grasping the knowledge surrounding the components of that solution.

             

            Let me rephrase to see if I understood enough what I am supposed to do so that I do not do a big booboo.

             

            1) Ok I create a table: ContactsToCheck

            • with the following fields: CtC_Email, CtC_Phone, CtC_Full name, CtC_Province, CtC_First Name, CtC_Last Name, CtC_ListKey, CtC_id, _CiDid).
              (where _ as a suffix means a foreign key)

             

            2) In my existing ContactsInDatabasetable:

            • I have the fields CiD_Email, CiD_Phone, CiD_Full name, CiD_Province, CiD_First Name, CiDLast Name, CiDFull Name,CiD_id
            • I add the following fields CiD_ListKey, _CtCid).

             

            3) In the fields CiD_ListKey and CtC_ListKey which I set as calculation fields, I put the following calculation:

             

            List (

            Filter ( CtC_Email;1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

            [etc.])

            )

            List (

            Filter ( CiD_Email;1234567890qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP");

            [etc.])

            )

             

            4) I create a joined[<-is this what it is?]tablePotential_Duplicates

            • with the following fields: PD_ListKeyISsimilar, PD_ListKeyISsame, CiD_ListKey, CtC_ListKey, _CtCid, _CiDid).

             

            5) I establish a relationship by linking PD_id to  CiD_id and CtC_id fields in their respective tables

             

            6) I make a layout: POTENTIAL DUPLICATES based on table Potential_Duplicates,

            • with request results of If PD_ListKeyISsimilar=Yes
            • presenting fields ContactsToCheck::CtC_id, ContactsInDatabase::CiD_id, PD_ListKeyISsame (Yes/No)
            • making the _id fields mouse over pop-ups where contact info is presented in the pop-up

             

            7) User can press button that will refine list to

            • request results of If PD_ListKeyISsame=Yes
            • presenting fields ContactsToCheck::CtC_id, ContactsToCheck::CtC_Full Name, ContactsInDatabase::CiD_id

             

            So is this ok? I am very unfamiliar with creating relationship without damaging the db. (Yes I work on the live one so I usually do that stuff in the morning so we can restore if I make a bad change... I know, I know... I need to learn how to work on the development db but I don't know how to proceed when it looses synchronicity with the active db. So I usually practice on a local db and then repeat my steps in the live one. I know it is not good practice. But I am crash teaching myself bits by bits...

            • 3. Re: Script to manage duplicates before uploading 1200 new contacts
              mikebeargie

              CTC is your temp table, by production I meant the final "in production" data that your users are seeing, that would be your CiD table

               

              You were going well until step #4. There is no third table here, you should be joining CTC to CID based on the ListKey field that you created in both table.

               

              Then from a list view based on CTC, show a portal of records from CID.

              1 of 1 people found this helpful
              • 4. Re: Script to manage duplicates before uploading 1200 new contacts
                Cécile

                Ok. I will work on this and post back if I hit a road block. Thank you!!

                • 5. Re: Script to manage duplicates before uploading 1200 new contacts
                  richardsrussell

                  De-duping is the bane of database developers everywhere. Long term, you're going to want to be proactive rather than reactive, and even then you'll find yourself having to deal with this kind of problem way more often than any sentient being should ever have to. But here are a couple of ideas that will help lessen the pain:

                   

                  (1) Start off your data-entry process by clicking on a "Create" button that will take you to a screen that displays only 1 field, a Global Text field called "Family Name Default". There type in the family name of the person you're thinking about adding to your database. Then click on a "Continue" button that will either (A) start the creation of a new record (with the default name already entered in "Family Name") if no other person with that family name already exists in the database OR (B) take you to another screen with a portal that displays all the people on file whose "Family Name" matches the default you just typed in. Next to each such name will be a "See" button that'll take you to that person's record, if you think it's the same person you're currently considering. At the bottom of the screen will be another "Continue" button that will re-enact Option A above if none of the people in the portal are the ones you want.

                   

                  (2) In addition to "Family Name", have another field called "Former Name", into which you can store the former family name of someone who's adopted a new one (as frequently happens in cases of marriage and divorce).

                   

                  There are some tricks that allow you to find all occurrences of multiple-word family names (like "Smith Jones" bringing up all Smiths and all Joneses) or hyphenated ones (like "Miller-Sanchez" bringing up all Millers and all Sanchezes). If you'd like to see a sample file of how this works, e-mail me directly at RichardSRussell@tds.net and I'll send you a file that does it.

                  • 6. Re: Script to manage duplicates before uploading 1200 new contacts
                    Cécile

                    Those are interesting ideas but not applicable for us. We receive complaints in 5 modalities: letter, email, fax, webform contact form, directly or redirected by other organisations. As a result, fields cannot be mandatory and people love to make it un-standardizable. We even get anonymous and instead of just calling them anonymous, I started to nickname them to their recognizable trait (Mr. Hates Turtles, Ms. Profanity Crusade, etc.) so that we can put together the "Frequent Filers" material or ease the search when we stumble on a complaint made by a "real" full name person which sounds strangely similar to an anonymous case.

                    Basically it has to be done by a human being and we have to tolerate a certain level of messy. So that is why I took the habit of checking for duplicate at creation time. But with 1200 new records at once, I have to find a way to reduce the list of those to check to only the "potential ones", hence my question.

                    • 7. Re: Script to manage duplicates before uploading 1200 new contacts
                      Cécile

                      Well that was interesting....

                       

                      listKey.jpg

                      The results of the listkey calculation as suggested above (I added the missing quote in front of the string). It finds the first similarity and stops there. Giving the weirdest associations. Since in the filtering list email comes first, we can see here that it looked for matching digits and returned pairs that match on that respect.