1 2 Previous Next 15 Replies Latest reply on Dec 22, 2011 9:44 PM by MicheleOlson

    Need Fast Method to Compare MultiKey Lists

    jlb

      I am looking to see if anyone has a fast method for comparing two long/large multikey fields.

       

      In current situation

       

      FieldA - Multi-Key List of SN's - 31,961

      FieldB - Multi-Key List of SN's - 14,281

       

      FieldC - Multi-Key List of SN's from A, not in B.

       

      I have tried the custom function ZapValues from Brian Dunning's website, and

      I created a script that loops through the lists

       

      (Script Image Attached)

      Screen Shot 2011-12-12 at 4.27.29 PM.png

       

      both approaches result in the correct answer for FieldC, but come with a HUGE time expense.

       

      Any ideas on a better (faster) method.

       

      JLB

        • 1. Re: Need Fast Method to Compare MultiKey Lists
          usbc

          You could use a calc field based on pattern count.

          Start the calc with a case statement to determine which field has the most entries and then branch the calculation appropriately.

          Alternatively, you could use the same approach with a replace .

           

          Chuck

          • 2. Re: Need Fast Method to Compare MultiKey Lists
            BruceHerbach

            Can you use a relationship?  Put list FieldB list into a Global field and set a Not Equal relationship to FieldA.  Then use the List Function to pull what is left.  I think it should be all of the FieldA values that don't have a match with FieldB. 

             

            Bruce

            • 3. Re: Need Fast Method to Compare MultiKey Lists
              mrwatson-gbs

              If you just want to compare the fields for equality (as sets ignoring repeated values, i.e. "1¶2¶1" equals "1¶2"), you can:

               

              1. Create 2 value lists - one based on FieldA, one based on FieldB. To seperate the values from the contents of other records you can either:
                1. copy the two values to a one-record table, so that the value lists are based on only the required content, or
                2. create a self-relation "MyTable.ThisRecord" (where ID=ID) and refer to MyTable.ThisRecord::FieldAand MyTable.ThisRecord::FieldB
              2. Compare the contents of the two value lists using:

                ValueListItems ( Get( FileName ) ; "FieldA" ) = ValueListItems ( Get( FileName ) ; "FieldB" )

                where "FieldA" and "FieldB" are the names of the two value lists.
              • 4. Re: Need Fast Method to Compare MultiKey Lists
                thosliot

                Bruce

                 

                Unfortunately that won't work - in fact the relationship is effectively a cartesian join, since for each value in the global there is at least one matching value (i.e. non-equal value) in FieldA

                 

                cheers

                 

                Tom

                 

                On 13 Dec 2011, at 05:40, BruceHerbach wrote

                 

                Can you use a relationship?  Put list FieldB list into a Global field and set a Not Equal relationship to FieldA.  Then use the List Function to pull what is left.  I think it should be all of the FieldA values that don't have a match with FieldB.

                 

                • 5. Re: Need Fast Method to Compare MultiKey Lists
                  BruceHerbach

                  Tom,

                   

                  I had a similar problem to solve and set up a global field to hold a list of IDs and compared them against a list of IDs for an event.  Attached is an image of the relationship.   The left side of the  bottom  entry is the global field used to store the current list of IDs. The right side is the ID field from a TO with available IDs.  Doing a list of the zzid field from the right side to yields just the IDs not in the global field.  From you description,  I think this is what you are looking for.  I used a script to pull a list of IDs to put in the global.  This would be your FieldA list.  The right side would be a TO showing the source of your FieldB list.

                   

                  The key to this may be the way FileMaker is handleing this.  You are correct that for any single ID all the other IDs should show up.  What happens with this is that FileMaker seems to look at the list as a "Whole" so that if an ID in the global field has a match in the list of records on the right side,  that record is eliminated from the found set so that the rest of the IDs no longer see it and it doesn't show up when I use the list function to get a list of IDs.  In short it doesn't act as a cartesion join,  it just shows the list of non matching IDs.

                   

                  The drawback to this is that you end up adding TOs to the relationship graph.  The advantage is that this can be very fast as long as the field is an indexed field. 

                   

                  Hope this is helpful

                  Bruce

                  • 6. Re: Need Fast Method to Compare MultiKey Lists
                    SEG-IT-Support

                    When you're using the custom function, do you know how long it takes to compare both lists? What's "huge" time expense for you? I've tried something similar in one of my solutions using a calculated field (indexed) and it didn't seem so bad. Took approximately 10 seconds the first time to compare both lists (approx 10000 values).

                    • 7. Re: Need Fast Method to Compare MultiKey Lists
                      jlb

                      Huge = minutes. 

                       

                      Of course, this depends on the amount of values in each list, but I need some method that will not cause the system to have a spinning wheel, as has been the case for every approach (with exception to the loop script I created, that just took wait time, no wheel).  The time elapsed to run the loop script was 4 minutes and 41 seconds.  From my testing of other methods, I get similar execution times.

                       

                      This tool will be used by the exectuive managment team to forecast the # of recipients who are targeted to get a campaign.  It will be unacceptable to have them wait for an unknown period of time to have the results returned.

                       

                      I am really open to any approach.

                      Will be testing some other theories to achieve what I need.  Including exporting records to a one "working" table, creating TO's, calcs, etc to get my list, grabbing the unique list for my needs and then deleting the working record.

                       

                      Thanks to all replies.   I will post my method of choice when I find it meets the speed requirements I must have.

                       

                      JLB

                      • 8. Re: Need Fast Method to Compare MultiKey Lists

                        Hi JLB,

                         

                        Have you tried the AntiFilterValues Custom Function from Brian Dunning's website?

                         

                        It would at least avoid the scripted routine that you needed for the ZapValues CF.

                         

                        The author of AntiFilterValues, Bruce Robertson, claims it is fast and he won't have made that claim lightly. You are, however, going to provide it with a sizeable challenge so we'll be interested to see what gains you can achieve.

                         

                        I HTH,

                         

                        John

                        • 9. Re: Need Fast Method to Compare MultiKey Lists
                          jlb

                          John,

                           

                          I did see that function earlier, but had not used it. Since you mentioned it, I thought Iwould incorporate it now to test speed and to allow other viewers to get an idea of how long it took to execute.

                           

                          As stated in my first post;

                          FieldA - Multi-Key List of SN's - 31,961

                          FieldB - Multi-Key List of SN's - 14,281

                           

                          I did see a vast improvment, the time to execute was 56 seconds the first time, 57 seconds the second test.  The resulting value list was also correct.

                           

                          Ran a second test with a smaller  list for FieldB:

                          FieldA - Multi-Key List of SN's - 31,961

                          FieldB - Multi-Key List of SN's - 202 (much smaller)

                           

                          Time elapsed was only 1 second.

                           

                          Obviously, the size of the lists is a factor in speed.  However, in most situations that arise, my lists are going to be on the larger side, simply do the fact of the nature of our business.  I'm not completely satisfied I've found the best approach, but this technique will defintely be in the tool belt from now on.

                           

                          For other viewers - the AntiFilterValues custom function works great - keeping in mind the size of your own field lists as the basis for performance.

                           

                          JLB

                          • 10. Re: Need Fast Method to Compare MultiKey Lists
                            thosliot

                            Bruce,

                             

                            I'm sorry I misunderstood, I thought you meant a relationship between the two lists of values, rather than between one list and the source of the other.

                             

                            (BTW it's not me that's looking for a solution, but 'JLB')

                             

                            cheers

                             

                            Tom

                            • 11. Re: Need Fast Method to Compare MultiKey Lists

                              Hi JLB,

                               

                              Thank you for the feedback. I'm pleased it has worked for you given the challenge that your task posed.

                               

                              Also good to know that Bruce's claim for speed was in fact realised with your lengthy lists.

                               

                              Cheers,

                               

                              John

                              • 12. Re: Need Fast Method to Compare MultiKey Lists
                                sporobolus

                                on 2011-12-12 15:30 jlb wrote

                                FieldA - Multi-Key List of SN's - 31,961

                                FieldB - Multi-Key List of SN's - 14,281

                                 

                                FieldC - Multi-Key List of SN's from A, not in B.

                                 

                                I have tried the custom function ZapValues from Brian Dunning's website, and

                                  I created a script that loops through the lists

                                 

                                unless you can find a way for this to be handled by an index, this is the sort of problem which tends to perform poorly in FileMaker once the data set reaches some "largish" size

                                 

                                if you are running Mac OS X, i would consider using the shell, where solutions are often a few orders of magnitude faster than a FileMaker; there is also a rich world of tutorial information available; after a quick search i chose to test the three solutions proposed here, which use comm, awk, or a short Python script:

                                 

                                http://stackoverflow.com/questions/2159988/finding-set-complement-in-unix

                                 

                                in summary i tested with sets similar to yours and found that all the suggested approaches timed at roughly 1/20 second of CPU time; awk & Python produced correct results, though the awk result retained duplicates from the A set, while the Python result was de-duped; comm's result was wrong

                                 

                                to use any of these, you'd have to export the multikeys as two files, a.txt and b.txt for example, and the line ends must be converted from CR (the default on Macs) to LF — a shell script can do this:

                                 

                                tr '\r' '\n' <filemakeroutput.txt >readyforshell.txt
                                

                                 

                                (and before you re-import the result, do the opposite)

                                 

                                you can see my test method and results below; i modified the Python script to make it easier to execute from FileMaker; comm included some keys from the B set; i didn't diagnose it fully, but it looks like the unwanted keys had fewer digits than most keys update: comm's failure had to due with lines repeated more times in the A set than in the B set; if you have no dupes, this won't be a problem

                                 

                                i set this up in a BBEdit shell worksheet, so shell commands are intermingled with results below; the following will be a good test of Jive's code markup from email

                                 

                                # create random sets of keys (including some dupes)
                                jot -r 35000 -p 100000 > /tmp/a.txt
                                jot -r 15000 -p 100000 > /tmp/b.txt
                                
                                # how quickly does this awk expression produce (A - B)?
                                time awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/b.txt /tmp/a.txt >/tmp/c.txt
                                
                                real     0m0.072s
                                user     0m0.056s
                                sys     0m0.005s
                                
                                
                                # how quickly does comm produce (A - B)?
                                time comm -23 <(sort /tmp/a.txt) <(sort /tmp/b.txt) >/tmp/d.txt
                                
                                real     0m0.064s
                                user     0m0.056s
                                sys     0m0.011s
                                
                                # how quickly does this python script produce (A - B)?
                                time /usr/bin/python -c "open('/tmp/e.txt', 'w').write(''.join(set(open('/tmp/a.txt')) - set(open('/tmp/b.txt'))))"
                                
                                real     0m0.081s
                                user     0m0.036s
                                sys     0m0.019s
                                
                                # note the different lengths of the result sets
                                wc -l /tmp/c.txt /tmp/d.txt /tmp/e.txt
                                    30180 /tmp/c.txt
                                    30877 /tmp/d.txt
                                    25451 /tmp/e.txt
                                    86508 total
                                
                                # are there dupes in the results? compare to numbers above ...
                                # awk produced duplicates
                                sort -u /tmp/c.txt | wc -l
                                    25451
                                
                                # comm produced duplicates
                                sort -u /tmp/d.txt | wc -l
                                    26081
                                
                                # Python did not produce duplicates
                                sort -du /tmp/e.txt | wc -l
                                    25451
                                
                                # apply the same set operations to check results ...
                                # check the results by computing complement of the each result against the two original sets;
                                # when the set sizes are compared we should see:
                                # (C - A) = null
                                # (C - B) = C
                                
                                # check using awk again
                                # awk result is good
                                awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/a.txt /tmp/c.txt | wc -l
                                        0
                                awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/b.txt /tmp/c.txt | wc -l
                                    30180
                                
                                # comm result is in error on the comparison to B ...
                                awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/a.txt /tmp/d.txt | wc -l
                                        0
                                # (C - B) < C
                                # !! comm has produced a set with some elements of B in the result
                                awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/b.txt /tmp/d.txt | wc -l
                                    30180
                                
                                # Python result is good
                                awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/a.txt /tmp/e.txt | wc -l
                                        0
                                awk 'FNR==NR{a[$0];next} (!($0 in a))' /tmp/b.txt /tmp/e.txt | wc -l
                                    25451
                                
                                # check by using comm again (we already don't trust comm)
                                # awk result is good
                                comm -23 <(sort /tmp/c.txt) <(sort /tmp/a.txt) | wc -l
                                        0
                                comm -23 <(sort /tmp/c.txt) <(sort /tmp/b.txt) | wc -l
                                    30180
                                
                                # comm result is in error again, though comm computes it differently
                                comm -23 <(sort /tmp/d.txt) <(sort /tmp/a.txt) | wc -l
                                        0
                                comm -23 <(sort /tmp/d.txt) <(sort /tmp/b.txt) | wc -l
                                    30247
                                
                                # Python is good
                                comm -23 <(sort /tmp/e.txt) <(sort /tmp/a.txt) | wc -l
                                        0
                                comm -23 <(sort /tmp/e.txt) <(sort /tmp/b.txt) | wc -l
                                    25451
                                
                                # check the results by using Python again
                                # awk result is good
                                /usr/bin/python -c "open('/tmp/f.txt', 'w').write(''.join(set(open('/tmp/c.txt')) - set(open('/tmp/a.txt'))))" ; wc -l /tmp/f.txt
                                        0 /tmp/f.txt
                                /usr/bin/python -c "open('/tmp/f.txt', 'w').write(''.join(set(open('/tmp/c.txt')) - set(open('/tmp/b.txt'))))" ; wc -l /tmp/f.txt
                                    25451 /tmp/f.txt
                                
                                # comm result is in error again
                                /usr/bin/python -c "open('/tmp/f.txt', 'w').write(''.join(set(open('/tmp/d.txt')) - set(open('/tmp/a.txt'))))"; wc -l /tmp/f.txt
                                        0 /tmp/f.txt
                                /usr/bin/python -c "open('/tmp/f.txt', 'w').write(''.join(set(open('/tmp/d.txt')) - set(open('/tmp/b.txt'))))"; wc -l /tmp/f.txt
                                    25451 /tmp/f.txt
                                
                                # Python result is good
                                /usr/bin/python -c "open('/tmp/f.txt', 'w').write(''.join(set(open('/tmp/e.txt')) - set(open('/tmp/a.txt'))))"; wc -l /tmp/f.txt
                                        0 /tmp/f.txt
                                /usr/bin/python -c "open('/tmp/f.txt', 'w').write(''.join(set(open('/tmp/e.txt')) - set(open('/tmp/b.txt'))))"; wc -l /tmp/f.txt
                                    25451 /tmp/f.txt
                                
                                
                                

                                 

                                Message was edited by: steve harley (fixed line breaks Jive introduced)

                                 

                                Message was edited by: steve harley (found reason for comm failure)

                                • 13. Re: Need Fast Method to Compare MultiKey Lists
                                  sporobolus

                                  just to note that using {code} did not stop Jive from botching the long lines in my email above; i will try to fix it on the website so cut & paste will work

                                  1 2 Previous Next