I think this would be much easier to do if your records in the first table were structured like this:
// and so forth...
In otherwords, you'd have 22 records of one word each instead of 2 records with 11 different word fields. That structure would permit a simple relationship link to match a given word record in the corpus table to all instances of this word in the first table. It also permits sentence fragments of any size you choose to work with instead of limiting you to a max of 11 words.
Sentencefrags::Word = Corpus::Word
A calculation field defined in Corpus as Count ( SentanceFrags::Word ) would give the total count of all instances of that word in the sentencFrags table. The second count will be trickier. What "foundset" are you referring to there? It may require a script that is perfomed immediately after performing the find.
Thank you, PhilModJunk.
I was referring to the results of a search for a particular word occurring in field 'word06' only -- for instance the word 'part'.
In this new structure, I'm not sure how to rebuild the fragment once it is broken into different records, and that's crucial so that users can read the search results as sentence fragments and not as individual words. Maybe by adding another field? Eg:
ID Word fragment
1 For 1
2 the 1
3 first 1
4 time 1
5 as 1
6 part 1
7 of 1
8 the 1
9 school 1
10 project 1
11 that 1
12 and 2
13 since 2
14 it 2
15 is 2
16 all 2
17 part 2
18 of 2
19 the 2
20 same 2
21 deal 2
22 we 2
Question: In this new structure, how would I restrict the search to those records that used to be field 'word06' in my previous structure (records 6, 17, 28, 39, etc)?
It can be done, in fact the sentence fragment would be stored in it's own table with a relationship to such a word list table. An added field could record the word's position which would then make all of the other requirements possible. But before I dive into such detail, I can't help noticing that separating the words in a text field into a list of separate words is very easy to do and can be done in a variety of ways. Thus, this elaborate structure (both the original and my suggested "improvements") might not be necessary.
Can you tell me a bit more about what you are doing in this project? Sometimes it's a good idea to stop and rethink before getting so invoived in the trees that we lose sight of the forest.
Sure. I'm trying to build a 'concordancer', which is a tool that helps linguists find word patterns in text. There are several concordancers available out there, but all face problems when dealing with large quantities of text. I plan to have millions of records in my database.
The typical output of a concordancer can be seen in the Contemporary Corpus of American English, for instance, at americancorpus.org, and consists of listings of 'text fragments' containing a particular word (or other string).
I have scripts that turn whole texts into these sentence fragments, which I can then import into FM.
I'm inclined to go with this structure then:
Fragments::_pk_FragID = Frag_Words::_fk_FragID
Corpus::Word = Frag_Words::Word
Corpus::Word = WordsByPosition::Word AND
Corpus::gWordPosition = WordsByPosition::Position
WordsByPosition and Frag_Words would be two table occurrences of the same data source table. The fields in this table would include:
Word (text and foreign key to Corpus)
_fk_FragID (Number and foreign key to Fragments)
Position (Number, the position of the word in it's related Fragment
In Corpus you could define two calculation fields:
cTotalCount as: Count ( Frag_Words::Word )
and cPositionCount as: Count ( WordsByPosition::Word )
gWordPosition would be a global number field where you would specify the position of the word so that cPositionCount can then compute the desired count.
Hmm, another option comes to mind that would give you a simpler table structure.
A summary report with no body, just a Sub Summary Part based on Frag_Word, when Sorted by Word could be created. Then a "Count Of" Summary field could be used to give you both counts. This approach would eliminate the Corpus table (Replace by Frag_Words) and the WordsByPosition table occurrence. You would lose, however, any cases of a zero count so that might not work for you.
sorry, PhilModJunk, I'm running into problems:
-what did you mean by 'the position of the word' in 'gWordPosition' in table corpus? The position of words change depending on which fragment they are in.
-table 'fragments', should it include all 11 words of each fragment (as in the original post) or just one word (as in the modified structure)?
I've placed the database online at philmodjunk.fp7
Thank you very much!
You indicated earlier that you wanted a word count specific to a found set. When I asked for more info, you indicated that if you search for "The" in the 10th word field you then wanted a word count for that word in that position. (At least that's how I interpreted this.) If you go tot Corpus you can Perform a find for the word "The", enter 10 in gWordPosition and then the calcualtion field I specified will give you the count I understood you to want here. The field is a global field (the global storage option is specified), it only stores one value for the entire field. It is only used to temporarily restrict your word count to a specific position in the sentence fragment. If the database is shared over a network, different users can use this global field to specify a different word postion all at the same time without interfering with one another.
The fragments table would include the sentence fragment in a single field. It can be as many or as few words as you want. One of the advantages to the change in structure is that you are no longer restricted to 11 word fragments. You can use any size that works for you.
thank you very much, PhilModJunk!