and what I want to do is either mark or delete the redundant phrases as below.
Can you give an example? I don't see any phrases marked or deleted as redundant. Do you mean that "The cat sat on the" is redundant because another record has "The Cat sat on the mat"?
Yes, that's right. The first record should be retained, the others are redundant and thus should be marked.
However, if there were a record "The cat sat on the" with a frequency of 2 then that should not be marked or deleted as there might be another word following e.g. "stairs".
Sorry if not clear!
So the following entries in red are redundant and the ones in black are not?
The cat sat
The cat sat on the mat
The cat sat on the stairs
Do you have FileMaker Advanced? I can conceive of a simple custom function and a self join relationship that will successfully match values between records.
cfWordList ( Words )
Case ( WordCount ( Words ) < 2 ; Words ;
List ( LeftWords ( Words ; 1 ) ; cvWordList ( RightWords ( Words ; WordCount ( Words ) - 1 ) ) )
Then you can define a MultiValue key field, cPhraseMatchKey as: cfWordList ( YourTextIfeld )
Define a calculation field set to return number, cWordCount as WordCount ( YourTextField )--> I think you have this field already.
and use both fields in a relationship defined like this:
Yourtable::cPhrasematchKey = YourTable 2::cPhraseMatchKey AND
YourTable::cWordCount < YourTable 2::cWordCount
Note, if I have this correct, the second pair of match fields should allow "The Cat Sat" to match to "The Cat Sat On The Mat" but not the reverse.
This, BTW can be used to calculate the frequency and you can now find all records where the Frequency is greater than 1, use Go To Related Records to bring up a group of redundant phrase records where you can sort them by word count so that you keep the phrase with the most words but then mark or delete the rest.
Thanks again for your input. I think you have what I'm after. I already have the frequencies. And my definition of redundancy is where an identical string of words is contained within another record with the same frequency.
So if I have a string of 25 words all with a frequency of 25, the longest one is to be reatined all the others are to be marked for deletion.
However, if any of the strings have a frequency of less than 25, then they should not be marked for deletion.
Is this the type of work you do professionally? If so perhaps I can consult you professionally for a "budget" solution off list?
You may click the icon to the left of my comments to send me a private message.