Similarity clustering problem

Question asked by mbraendle on Jan 27, 2012
Latest reply on Jul 24, 2014 by mr_scott

Finding records that are equal (e.g. are duplicates) using the ! operator and grouping them is a standard task in FileMaker.

However, I would like to cluster records that are similar (e.g. which may have some words in common). I.e. the task is to calculate a similarity measure and to relate the keys of the records which are above a given similarity threshold in a join table. And store that similarity measure too.

The calculation of the similarity measure is quite straightforward (and is well-known from Stanton's theory on text retrieval from the early 70ies):

1. for each text record, one extracts the unique words and stores them in a word-wordcount index (one can also exclude stopwords in this step), using a script. A relation is made between the text records and word-wordcount index via the text record key.
2. from the word-wordcount index, one creates the so-called reverse index, which contains only unique words over the whole record set (either by script, or by exporting and grouping the words outside FM and reimporting the deduplicated words). A relation is made between the reverse index and word-wordcount index using the word itself as key. In addition, in the word index the unique words are numbered sequentially (yielding a pointer for each word). The N unique words span a N-dimensional space.
3. Each text record now spans a N-dimensional vector of unique words, the components of the vector being the pointers of the unique words. Most of the components will be 0.
4. The similarity measure between two text records is now calculated Stanton's formula, just by determining the angle between the two vectors (or better the cosinus of the angle, which is the normalized scalar product of the two vectors). E.g. if the angle is 0 (or the cosinus is 1), the two text vectors perfectly match.

(I have used steps 1 and 2 in another context, see presentation given at german FileMaker Konferenz 2010).

The problem of this approach is that one runs easily into a combinatorial explosion with database size. The number of similarity measures to calculate grows as N*(N-1)/2 for N records.

E.g. for a database with about 67'000 records as I have one would have to calculate 2.27 billion combinations.

Even if one restricts oneself to records that have only the same number of words, I still need to calculate 167 million combinations.

I can still break down the clustering by language (stored in a separate field), but that will not help much.

Do you have any brilliant idea how one could accelerate the process? Maybe using a multi-self-join?

Best regards,

Martin