7 Replies Latest reply on Jul 24, 2014 4:14 PM by mr_scott

# Similarity clustering problem

Finding records that are equal (e.g. are duplicates) using the ! operator and grouping them is a standard task in FileMaker.

However, I would like to cluster records that are similar (e.g. which may have some words in common). I.e. the task is to calculate a similarity measure and to relate the keys of the records which are above a given similarity threshold in a join table. And store that similarity measure too.

The calculation of the similarity measure is quite straightforward (and is well-known from Stanton's theory on text retrieval from the early 70ies):

1. for each text record, one extracts the unique words and stores them in a word-wordcount index (one can also exclude stopwords in this step), using a script. A relation is made between the text records and word-wordcount index via the text record key.
2. from the word-wordcount index, one creates the so-called reverse index, which contains only unique words over the whole record set (either by script, or by exporting and grouping the words outside FM and reimporting the deduplicated words). A relation is made between the reverse index and word-wordcount index using the word itself as key. In addition, in the word index the unique words are numbered sequentially (yielding a pointer for each word). The N unique words span a N-dimensional space.
3. Each text record now spans a N-dimensional vector of unique words, the components of the vector being the pointers of the unique words. Most of the components will be 0.
4. The similarity measure between two text records is now calculated Stanton's formula, just by determining the angle between the two vectors (or better the cosinus of the angle, which is the normalized scalar product of the two vectors). E.g. if the angle is 0 (or the cosinus is 1), the two text vectors perfectly match.

(I have used steps 1 and 2 in another context, see presentation given at german FileMaker Konferenz 2010).

The problem of this approach is that one runs easily into a combinatorial explosion with database size. The number of similarity measures to calculate grows as N*(N-1)/2 for N records.

E.g. for a database with about 67'000 records as I have one would have to calculate 2.27 billion combinations.

Even if one restricts oneself to records that have only the same number of words, I still need to calculate 167 million combinations.

I can still break down the clustering by language (stored in a separate field), but that will not help much.

Do you have any brilliant idea how one could accelerate the process? Maybe using a multi-self-join?

Best regards,

Martin

• ###### 1. Re: Similarity clustering problem

Martin,

I'm not clear on all of your parameters for each step of the calc, but perhaps some of these ideas are germane:

-You can easily turn each text field into a word list by substituting a carriage return for every space and perhaps other word separators as appropriate (dashes, slashes and the like).

-This field may be helpful in creating your word database, although you'd need a method to turn each delimited field into a record. A looping script on a found count is one way to do it within FMP, or with an export and manipulation in a text editor it could also be done.

-At this point you'll have a record for each word in the found count of master records. A self join in this [interim] database and the count function will give you the word count of each word. This could be used to update the word count in the master word database, or add new words as appropriate.

-The field in the first step - which breaks the field into a word list - could be used to create a link to the master word database. You could then sum up the count field in the master word db, across this relationship,  to be used in your final calc.

-I don't know if you have to account for repetitions of words in the original text field, if so, you could do this in the looping script also deal with this.

-It's unclear if you have to deal with word combinations, the occurrence of a string of (n) words in a row. If so, you could also have the looping script set a field to the combinations, but this would add greatly to the indexing size and complexity of the script.

I'm certain there are several important issues that are missing in the above, but is any of it helpful? If so, post back and we can discuss it in more detail.

1 of 1 people found this helpful
• ###### 2. Re: Similarity clustering problem

Bob,

thank you for your answer. Actually, the creation of the indexes (as described by you and by me in step 1 and 2) has already been solved by me in another project, where I had made these wordlists for a 23 GB database (only text!) of about 840'000 OCRed pages, So I can use the scripts and calculations from this project.

The real challenge is the calculation of the similarities between pairwise records; this task scales with the square of N (N being the number of records). Usually it is said that a task should only scale with N log N (which is less than N squared) so that it can be handled with increasing computing power that follows Moore's law.

The problem is to find criteria that eliminate as many of the pairwise combinations as possible from the calculation of similarities. It's somehow a vicious circle: to be able to do this one should know the similarities in advance ...

At the moment, I think I have two criteria (or techniques) that are based on the content of the text field itself:

• finding identical records (similarity = 1) with ! (which is trivial)
• finding maximally non-coinciding pairwise records (similarity = 0) using a (multi) self-join relation between the word-delimited fields you have mentioned. This should work well after one has eliminated the stop words. (Since the database contains several languages (English: 68%, German: 30%, French: 1%, and about 20 other languages in the remaining 1%), at least English and German stopword lists are required.)

Knowing how many of the words in the above-mentioned multi self-join relation match would already be a big help. Maybe the word-wordcount table can be used for that.

Another aspect which hasn't be discussed is word order and semantics, which may come into play. E.g. in the case of "seawater analysis" and "analysis of seawater", the two phrases have similarity = 1 (after elimination of the stopword "of") and do mean the same, but in the case of "elements of crystallography" vs. "crystallography of the elements", statistical similarity is also 1, but meaning is different. Probably not solvable without additional criteria (for example classification).

• ###### 3. Re: Similarity clustering problem

Hi, Martin:

A product called "ScriptMaster" from 360Works has a Module called "Text Similarity". This is its description:

This calculates how similar two pieces of text are to each other, using the Levenshtein distance method ( http://en.wikipedia.org/wiki/Levenshtein_distance ).

The lower the number returned, the more similar the text is.

Here is the Screen shot of the "Text Similarity" ScriptMaster module.

Like you, I am interested in an efficient method of calculating and index value, say between 0 and 50, or 1 to 100. With this, I could provide a search and filter interface that could return a set of found records that are displayed in a filtered portal of results — with a similarity threshold value of…say 84 to 100 from the search that returns — sorted in descending order by their similarity value. This would allow us to begin "scrubbing" through results, scripting a method of archiving them while remapping key associations throughout the solution (or using the "RefreshFM" product from Goya, the developer of BaseElements).

In testing playing around with it), I found that it accounts for the minor differences (exact same source and target = 0 value ) between things such as Macdonald, MacDonald, Mac Donald, McDonald, McDonnell, old Macdonald, ☀️…(EMOJI farm tractor and animals HERE), etc. I've been thinking out how I'd like something to work (ideally) all week, then recalled today the "play time" I had with this ScriptMaster Module (ScriptMaster includes a companion FileMaker file with 89 various libraries that do some amazing things, and are featured often in Matt Petroseky's FileMakerMagazine.com video techniques.

I'll be looking to get a few moments with Jesse and Sam of 360Works at DevCon so I can discuss the business problem, then solicit their input on how I could best achieve what I want in the simplest, most practical way.

Martin, I hope you'll catch this update to your post from more than 2 years ago, as I'm eager to get your input — or find out how you solved it.

- - Scott

Here is the Screen shot of the "Text Similarity" ScriptMaster module.

• ###### 4. Re: Similarity clustering problem

It sounds like you've got document classification based on a bag-of-words model figured out. This is handy if you start out knowing what classes to you to classify the documents into; but, as you're seeing, document classification techniques were not designed to figure out what the clusters should be on their own. Cluster analysis may be the place to look for more ideas. In particular, switching to a tf-idf model could help prioritize in dimension reduction before applying a clustering algorithm targeted at high-dimensional data.

• ###### 5. Re: Similarity clustering problem

Volker Krambrich and I have given a presentation on search strategies at FileMaker Konferenz 2013. Similarity searching can be found starting on page 31:

http://de.slideshare.net/fmkonferenz/fmk-2013-suchstrategien-martin-braendle-volker-krambich

• ###### 6. Re: Similarity clustering problem

That's very nice, Martin. Thank you. I am almost finished with translating the example file and companion slide show to English, and am anxious to play around with it.

Best regards,

- - Scott

• ###### 7. Re: Similarity clustering problem

Hi, Martin:

I'm a bit uncertain of the logic flow, what records are required to be created manually, and which records get created automatically through the 5 scripts in the example file. It would be great if you could provide a summary or overview of the user flow as well, as not all of the scripts show a starting "context", or LayoutName (though I do see "Go to Layout" script steps inside some scripts).

For instance, I don't know if I must create Titel and a Urheber records AND then create a companion CT_Titel_Urheber record, or which record must be created in order for other records to be created automatically?

Titel (Title)

CT_Titel_Urheber

Urheber

WortIndex

Similarity

Titel1 (Title::id_Titel to Similarity::id_Titel1)

Titel2 (Title::id_Titel to Similarity::id_Titel2)

Log

StopWords

StopWordRule

I know I will have many more questions, as I am eager to make an efficient and fast data mining feature within my solution.