0 Replies Latest reply on Mar 12, 2011 5:59 AM by tonyberber

    Google corpus in FileMaker

    tonyberber

      Title

      Google corpus in FileMaker

      Post

      I'm importing part of the Google corpus (http://googlesystem.blogspot.com/2008/05/using-googles-n-gram-corpus.html) to FM Pro 11. The part I'm working on contains millions of word sequences found on the web.

      It is structured as such:

      id, w1, w2, w3, freq
      1,deep,freeze,makes,56
      2,deep,fryer,is,316
      3,deep,impact,makes,107

      This is how the data are presented, and I cannot change this structure because millions of records are involved.

      Each record is an actual sequence; for instance, record 1, 'deep freeze makes' is a sequence found on the web and it occurs 56 times (at the time Google prepared the corpus).

      I can search for any particular word in any of the three positions (w1, w2, w3), which returns all sequences that contain that word in that position. The search is really fast!

      What I'd like to get is a summary of the words found in a search. For example, if I search for 'deep' in w1 and it returns the results above, then the summary report should like something like this:


      search word: deep
      position: w1
      summary (sorted by frequency):
      deep: 3
      makes: 2
      freeze: 1
      fryer: 1
      impact: 1
      is: 1


      The words occurring near the search word cannot be predicted.

      Any help appreciated!