9 Replies Latest reply on Mar 11, 2011 5:18 PM by tonyberber

    word frequency?

    tonyberber

      Title

      word frequency?

      Post

      I have a database in FM Pro 11 with two tables. The first one (named concordances) holds sentence fragments, in which each word is in a separate field. I use it to search for occurrences of particular words in field word06, for instance the word 'part':

      Id 1

      word01: for
      word02: the
      word03: first
      word04: time
      word05: as
      word06: part
      word07: of
      word08: the
      word09: school
      word10: project
      word11: that

      Id 2
      word01: and
      word02: since
      word03: it
      word04: is
      word05: all
      word06: part
      word07: of
      word08: the
      word09: same
      word10: deal
      word11: we

      etc. up to record 771.

      The second table (named corpus) has all the word frequencies in the whole database:

      Id 1
      word: the
      frequency: 28373


      Id 2
      word: of
      frequency: 24761

      ...

      Id 39383
      word: part
      frequency: 771

      etc. up to record 101576,

      What I'd like to do is to calculate the joint frequencies of words in fields word01 through word06, and then match these with the frequencies already in table corpus. This would like something like this:


      Id 1
      word: the 
      frequency in found set: 36
      frequency in corpus: 28373

      Id 2
      word: of 
      frequency in found set: 22
      frequency in corpus: 24761

      Thank you ahead for any pointers!

        • 1. Re: word frequency?
          philmodjunk

          I think this would be much easier to do if your records in the first table were structured like this:

          ID    Word
          1      For
          1      the
          1      first
          1      time
          1      as
          1      part
          1      of
          1      the
          1      school
          1      project
          1      that
          2      and
          2      since
          3      it
          4      is
          5      all
          // and so forth...

          In otherwords, you'd have 22 records of one word each instead of 2 records with 11 different word fields. That structure would permit a simple relationship link to match a given word record in the corpus table to all instances of this word in the first table. It also permits sentence fragments of any size you choose to work with instead of limiting you to a max of 11 words.

          Sentencefrags::Word = Corpus::Word

          A calculation field defined in Corpus as Count ( SentanceFrags::Word ) would give the total count of all instances of that word in the sentencFrags table. The second count will be trickier. What "foundset" are you referring to there? It may require a script that is perfomed immediately after performing the find.

          • 2. Re: word frequency?
            tonyberber

            Thank you,  PhilModJunk.

            I was referring to the results of a search for a particular word occurring in field 'word06' only -- for instance the word 'part'.

            In this new structure, I'm not sure how to rebuild the fragment once it is broken into different records, and that's crucial so that users can read the search results as sentence fragments and not as individual words. Maybe by adding another field? Eg:

            ID    Word fragment
            1      For 1
            2      the 1
            3      first 1
            4      time 1
            5      as 1
            6      part 1
            7      of 1
            8      the 1
            9      school 1
            10      project 1
            11      that 1
            12 and 2
            13  since 2
            14 it 2
            15 is 2
            16 all 2
            17 part 2
            18 of 2
            19 the 2
            20 same 2
            21 deal 2
            22 we

            etc.

            Question: In this new structure, how would I restrict the search to those records that used to be field 'word06' in my previous structure (records 6, 17, 28, 39, etc)?


            • 3. Re: word frequency?
              philmodjunk

              It can be done, in fact the sentence fragment would be stored in it's own table with a relationship to such a word list table. An added field could record the word's position which would then make all of the other requirements possible. But before I dive into such detail, I can't help noticing that separating the words in a text field into a list of separate words is very easy to do and can be done in a variety of ways. Thus, this elaborate structure (both the original and my suggested "improvements") might not be necessary.

              Can you tell me a bit more about what you are doing in this project? Sometimes it's a good idea to stop and rethink before getting so invoived in the trees that we lose sight of the forest.

              • 4. Re: word frequency?
                tonyberber

                Sure. I'm trying to build a 'concordancer', which is a tool that helps linguists find word patterns in text. There are several concordancers available out there, but all face problems when dealing with large quantities of text. I plan to have millions of records in my database.

                The typical output of a concordancer can be seen in the Contemporary Corpus of American English, for instance, at americancorpus.org, and consists of listings of 'text fragments' containing a particular word (or other string).

                I have scripts that turn whole texts into these sentence fragments, which I can then import into FM.

                • 5. Re: word frequency?
                  philmodjunk

                  I'm inclined to go with this structure then:

                  Fragments---<Frag_Words>---Corpus---<WordsByPosition

                  Fragments::_pk_FragID = Frag_Words::_fk_FragID

                  Corpus::Word = Frag_Words::Word

                  Corpus::Word = WordsByPosition::Word AND
                  Corpus::gWordPosition = WordsByPosition::Position

                  WordsByPosition and Frag_Words would be two table occurrences of the same data source table. The fields in this table would include:
                  Word (text and foreign key to Corpus)
                  _fk_FragID (Number and foreign key to Fragments)
                  Position (Number, the position of the word in it's related Fragment

                  In Corpus you could define two calculation fields:
                  cTotalCount as:  Count ( Frag_Words::Word )
                  and cPositionCount as:  Count ( WordsByPosition::Word )

                  gWordPosition would be a global number field where you would specify the position of the word so that cPositionCount can then compute the desired count.

                  Hmm, another option comes to mind that would give you a simpler table structure.

                  A summary report with no body, just a Sub Summary Part based on Frag_Word, when Sorted by Word could be created. Then a "Count Of" Summary field could be used to give you both counts. This approach would eliminate the Corpus table (Replace by Frag_Words) and the WordsByPosition table occurrence. You would lose, however, any cases of a zero count so that might not work for you.

                  • 6. Re: word frequency?
                    tonyberber

                    thank you so much,  PhilModJunk! I'm implementing this and will report back!

                    • 7. Re: word frequency?
                      tonyberber

                      sorry, PhilModJunk, I'm running into problems:

                      -what did you mean by 'the position of the word' in 'gWordPosition' in table corpus? The position of words change depending on which fragment they are in.

                      -table 'fragments', should it include all 11 words of each fragment (as in the original post) or just one word (as in the modified structure)?

                      I've placed the database online at  philmodjunk.fp7

                      Thank you very much!


                      • 8. Re: word frequency?
                        philmodjunk

                        You indicated earlier that you wanted a word count specific to a found set. When I asked for more info, you indicated that if you search for "The" in the 10th word field you then wanted a word count for that word in that position. (At least that's how I interpreted this.) If you go tot Corpus you can Perform a find for the word "The", enter 10 in gWordPosition and then the calcualtion field I specified will give you the count I understood you to want here. The field is a global field (the global storage option is specified), it only stores one value for the entire field. It is only used to temporarily restrict your word count to a specific position in the sentence fragment. If the database is shared over a network, different users can use this global field to specify a different word postion all at the same time without interfering with one another.

                        The fragments table would include the sentence fragment in a single field. It can be as many or as few words as you want. One of the advantages to the change in structure is that you are no longer restricted to 11 word fragments. You can use any size that works for you.

                        • 9. Re: word frequency?
                          tonyberber

                          thank you very much, PhilModJunk!