8 Replies Latest reply on May 24, 2014 1:31 AM by mardikennedy

    File size and indexing conundrums

    mardikennedy

      Hi,

       

      We are doing a bit of data rescue from a venerable, non-FMP source. Odd results on FMP file size and indexing contributions tho'.

       

      File size, imported data, full indexing: 825 Mb

      File size, compressed copy 1: 539 Mb

       

      Make a fresh copy, switch off all indexes and uncheck the option to rebuild them. Save a compressed copy.

      File size, compressed copy 2: 539 Mb (unexpected)

      File size, clone: 256 kb

      File size, using clone and importing data from Compressed copy 2: 719 Mb (very unexpected, no indexes)

      File size, Compressed copy 3: 539 Mb (mmm)

       

      This would suggest that

      a) indexes have no overhead (highly implausible), or

      b) indexing parameters are ignored

       

      Bug, or really special feature? (***Windows 7; file size figures via Properties)

       

      All the best,

      Mardi

        • 1. Re: File size and indexing conundrums
          taylorsharpe

          Did you just set Indexing to Automatic or actually click the "All" indexing dialog box.  If you didn't click the "All" button, then the indexes were not created and won't be until needed such as a search on the field or relationship, etc.  That would be my first guess why the size is still 539 Mb. 

          • 2. Re: File size and indexing conundrums
            mardikennedy

            No, I switched OFF the indexing - I wanted to REMOVE it.  I also wanted to prevent it from being rebuilt.

            • 3. Re: File size and indexing conundrums
              taylorsharpe

              I suspect that you are not comparing a version that has indexes completely built to an unindexed one.  I was wanting to know if the indexed version that was the same size really had indexing created or was it just to automatically create as needed.  If the later is true, then you were basically comparing two undinexed files and that is why they were both the same size. 

              • 4. Re: File size and indexing conundrums
                mardikennedy

                I started with a FULLY INDEXED file, eg, all sorts of Find operations occurred instantly.

                 

                I then tried to REMOVE the indexing.  (I wanted to know how much of the file size could be attributed to the indexing.)

                 

                File sizes after the various experiments would suggest that either indexes incur no overhead, or that something is left over.

                 

                The new files definitely do not appear to have indexes - any kind of search operation is very, very slow.

                 

                I've also done a workstation restart in case it was a Windows thing - no change.

                 

                Hope that clarifies things,

                M

                • 5. Re: File size and indexing conundrums
                  mardikennedy

                  Rethinking.  My error was that I didn't think that I *really* had 539 Mb of data - I was expecting it to be less because it seemed like such basic text etc.  However, there are 320k records, so I guess even 'little' ones add up.

                   

                  Thereafter, given that a few fields are quite long text entries and therefore the indexing for QuickFind is 'involved', I expected that the overall indexing component would be more than half the total file size.  (which it almost was, but because I didn't trust the 'data size', I didn't check this way)

                   

                  The data was extracted from the source in the form of a series of .tab files so I went back to these and yes, they add up to roughly 539 Mb.  So that actually is the data count.

                   

                  The interesting discovery (for me) is that comprehensive indexing (I'd tested every field) only added 'half again', ie 539 + 286 = 825 Mb

                   

                  The silly assumptions we make!  (some of us)

                  • 6. Re: File size and indexing conundrums
                    ch0c0halic

                    I’ve included a lot of miscellaneous information. I hope this adequately explains why your test may not be testing what you think it is testing.

                     

                    When FMP does an import it uses way more space than is required. That’s why saving a compressed copy after a large import always removes space.

                     

                    Data is stored in ‘blocks’ in FMP. When manually entering data each block may be filled with 50% to 80% before creating a new block. However, when importing FMP uses a very conservative approach and creates more blocks than are needed for the data. And individual Block fill percent may very a lot.

                     

                    Indexing is done while importing. (A very good reason to turn off indexing if the import is very large).  But, FMP may not have built any Field indexing if the fields were not indexed in the first place. Meaning, if you never did something that required an index to be built (for example doing a Find) FMP won’t automatically build an index during import. Turning off indexing and not rebuilding them during the Save may not have had any effect on the building of required indexes

                     

                    NOTE: Even if you turn off indexing FMP will still create indexes required by Relationships.

                     

                    The indexing data (part of the record's Meta data) is stored separately from the Data and will vary greatly vary in size depending on the number of unique values in each indexed field. FMP automatically adds blocks to an Empty index space on first opening the file anticipating the requirement.

                    • 7. Re: File size and indexing conundrums
                      mardikennedy

                      Hi Jimmy,

                       

                      Info much appreciated.  I won't be able to answer/ frame extra questions until tonight/ tomorrow but more then,

                       

                      Regards,

                      Mardi

                      • 8. Re: File size and indexing conundrums
                        mardikennedy

                        Info is very welcome.  The actual task is an experiment for a 'disaster recovery' strategy.  The db in question is an old BASIS one. (Conventional mainframe setup with Prod and Dev environments.)  The initial task was to see if the data, via .tab, could be ingested by FMP and then exported .tab, and then successfully imported by Dev BASIS.  Orig db includes date and time fields, though mostly straight text.  Dev BASIS is mean - if even one record fails a  validation, the whole batch gets tossed.

                         

                        Thus far, less that half the data has been ingested, and that has arrived piecemeal, so far in about 35 separate ports.

                         

                        FTR, it's been unexpectedly successful.  The FMP field defs include date and time and export is being done UTF-8.  The only Dev BASIS problems thus far have been unrelated to the FMP contribution to the project.  (Still need to do further testing in Dev BASIS re functionality but successful port is a good sign.)

                         

                        Naturally, these imports could (and should) be done sans indexing, but I personally wanted to know just how good FMP could be on 'deep text' searches.  Answer: bloody fantastic.  It was fun to show off to non-FMP colleagues.  (after the first few ports, I did searches to trigger indexing and then with the later ports, the indexing continued to be updated.  Slowish, but QuickFind worked a treat.)

                         

                        The ongoing experiment requires both (truly) unindexed files (sticking to the actual disaster recovery brief) and also fully indexed ones (hey, I'm half way there so why not?) because then there are all sorts of other interesting possibilities.

                         

                        Also FTR, the BASIS source is unexpectedly non-relational, thus the very small clone (256 kb) and the ease of imports.

                         

                        I'm not sure whether to do the next tranche of imports (prob end of next week) into the indexed or unindexed versions?  Is it better to keep updating the indexes in smallish increments, or do it in one mega-chug overnight?  My gut feeling is to do it in increments (now approx 10 mins per import) but the behaviour might change moving from 230k records, towards 670k records.

                         

                        A future data massage project holds temptations and thus far it seems to be more practical to excise the 'fat text' fields into a separate table (?no index?), do the processing there and bring just the results back in.  To put it another way - Replace operations are relatively fast in fields with little data but very slow in fields with a lot of data.  Tips welcome.

                         

                        All the best,

                        Mardi