5 Replies Latest reply on Jun 3, 2015 3:17 PM by philmodjunk

    Retrieving a list of duplicate values via a relationship

    calexmac

      Title

      Retrieving a list of duplicate values via a relationship

      Post

      I am trying to work out how to retrieve a list of duplicate values in the fastest possible way from a big table and I would like the duplicated values in a value list.

       

      The table in question has a lot of rows (~25 million).  The table has a primary key field - PKey -  and a field - X - that can legitimately hold duplicate values.

       

      As a result of prior steps, I have a value list of primary keys, and I would like to get all values of X from the records that have a primary key represented in this list.  This operation gets run many times for different primary key value lists, so speed is paramount.

       

      Using the primary key value list as a multi-key into the big table works well.  Its fast.

      I have also tested FMPs duplicate query operator - ! - and that works well - its also fast, even when operating on millions of records.

      But this is where I run aground…. I want the results in a list (not just a found set).  So I could do that from either the match TO side of the relationship or from inside the big table itself.

       

      Both have a problem.  Using the relationship, I don’t know how to configure the relationship (based on the primary key value list) to find only the duplicated X values (if that were possible, it would be easy to get the list - which is a very fast operation).  And from the context of the table itself, the ! operator will find all the duplicates, but then getting them into a list - and quickly - is a problem.

       

      Any suggestions?

       

        • 1. Re: Retrieving a list of duplicate values via a relationship
          philmodjunk

          Instead of a value list, why not a portal placed inside a popover. That will take up no more room than a value list formatted field, but provides display options not possible for a value list.

          But your criteria for specifying what values will appear in that value list remain the main challenge. Can you describe in more detail what you are trying to do with this list?

          • 2. Re: Retrieving a list of duplicate values via a relationship
            calexmac

            The value list will be used elsewhere in the database, not for presentation to a user, but as input to another set of scripts.  These scripts are performing locality sensitive hashing - i.e. finding similar items amongst a large data set.  The 'duplicates' in the list are the similar items that have been found in the ~25 million records.  

            Ideally I would like to get the duplicates in a list back via the relationship.  This would be tidy and, if supported by some native FMP config, likely fast (that is an assumption).  But I simply cannot figure that out.

            So my current test bench is experimenting with a 4 step process:

            1  a GTRR to go to the big table (which is fast enough)

            2  a !-based duplicate constrain query on an unsaved calculated field (acting as a proxy) (which is sort of fast)

            3  a transfer of the target field from the found set via a summary function using 'as a list' (not particularly fast)

            4   and then a plug-in custom function that returns only the uniques from the list of step 3 (which is very fast).

            Speed-wise, these combined steps are ok, but not great. When the GTRR lands on ~80k of records (a typical number), then steps 2-4 take about 1 second.  That might sound quick, and it is certainly better than what I have achieved before.... but in a single simulation run (which this module is part of), these steps will be executed almost 50k times, so the run time is about 14 hours.

             

            • 3. Re: Retrieving a list of duplicate values via a relationship
              philmodjunk

              The value list will be used elsewhere in the database, not for presentation to a user, but as input to another set of scripts.  These scripts are performing locality sensitive hashing - i.e. finding similar items amongst a large data set.  The 'duplicates' in the list are the similar items that have been found in the ~25 million records.  

              Don't see why that needs a value list to do if the user is not selecting from that value list. That seems a needless complication.

              I'm wondering if a self join relationship combined with a script can "flag" records that are "duplicates" at the time the record is created. If the number of related records via the self join is 1, it's unique. If it's 2, both records need to be flagged. If it's more than 2, the new record needs to be flagged. Your relationship would then match to the "flag" value. This handles each new record as it is added or each existing record as it's modified (though the logic would be a bit more complex for records that get changed). For a large set of existing records, you need to set up a one time batch operation that does the needed flagging. That one time operation could take a looong time, but once done, this might then work...

              • 4. Re: Retrieving a list of duplicate values via a relationship
                calexmac

                That is an interesting idea.  And we have just prototyped it and indeed it does work on a test bench.  In production, new records are added to the big table regularly, so I we need to think through how that process would be affected.  A fair few downstream scripts written by others in the team rely on the duplicates in a list, so the refactoring effort is not small... 

                • 5. Re: Retrieving a list of duplicate values via a relationship
                  philmodjunk

                  And don't forget that both editing data and deleting records can result in a change in which records have duplicates...