1 2 3 Previous Next 30 Replies Latest reply on Nov 22, 2015 10:45 PM by everyman

    Indexing very large datasets

    everyman

      What is the conventional (or unconventional) wisdom on importing very large datasets?

       

      We have a database that grows by about 3.1 million records each year, once a year.  Much of this data (mostly, the text data) is fairly static and doesn't change from year to year;  I have put this into separate tables.  But the meat of the thing is a set of number fields and number calcs; we get this stuff about this time every year and import the new annual data.  For better or worse, almost all of these fields need to be searchable --- which means importing and indexing takes longer every year.

       

      Are there any accepted "best practices" for importing large datasets?  I am importing (from an Access table) into a single-table "import" FileMaker database.  That file is set up as an external data source in my main database.  (Since the import table & data is only needed once a year, I moved it to a separate file --- deleting 3.1M records takes a while.)  The import routine uses lookups to create new records and will (if necessary) update existing records in the more static tables.


      The basic structure is three tables: property, owner, tax assessment.  The assessment table has new entries for each year; each entry links to the property description and the owner information.


      Thanks for any light you can shed on this.  We need to keep the import time as brief as possible.  (It generally takes on the order of a weekend to import the data.  I'd like to keep that fairly linear: if after a few years it begins to look like three days (and then four and five) I'll be back to keeping it in an annual file.

        • 1. Re: Indexing very large datasets
          siplus

          When deleting lots of records, PSOS can do wonders.

           

          Importing tons while being in Freeze window / on a layout with no fields showing and with allow user abort (off) is faster.

           

          Saving a clone (no records) and importing into it, then replacing the old shared file with the new one is faster than deleting all records then importing.

           

          Maybe you can get the data monthly, instead of yearly ?

          • 2. Re: Indexing very large datasets
            everyman

            When deleting lots of records, PSOS can do wonders.

            Noted, and I'll do that as soon as I can run FMS 14.  (Right now we are stuck at FMS 12 due to OS constraints.)


            Importing tons while being in Freeze window / on a layout with no fields showing and with allow user abort (off) is faster.

            Yes.  We already import using empty layouts.  The real holdup appears to be indexing the new data, which takes longer as the file grows.  (That said, importing into a table of about two dozen fields is significantly faster than importing into a table of six or eight dozen columns....


            Saving a clone (no records) and importing into it, then replacing the old shared file with the new one is faster than deleting all records then importing.

            Agreed.  That's actually why I moved the import data to a separate file that I *only* care about a few days a year.  Otherwise I have to take the live file offline for as long as it takes to clone and import data, or warn users not to make changes or add notes for the duration.  With the import (from Access) in a single table defined as an external data source, I really don't care how long it takes to delete all those records and empty the file.


            Maybe you can get the data monthly, instead of yearly ?

            Not possible, unfortunately.  We get this stuff every November, and it has to be up and running by early December (but the earlier the better).  What I have done is arrange the new data in descending order of importance and utility, so that the more frequently-used counties are available first.

            • 3. Re: Indexing very large datasets
              siplus

              Well, one could open the empty clone and turn off indexing for all fields, then index when necessary. Data won't be indexed while importing, only on search.

               

              But even without this, I think another good improvement to your importing such a lot of data would be using a machine with SSD disk(s).

              • 4. Re: Indexing very large datasets
                garyods1

                Having had some experience with large real estate data sets here is what you might consider.

                 

                • First work with a generous amount of RAM
                • Now days an inexpensive SSD will speed you up immensely
                • As siplus suggested clone your databases
                • using your cloned updating databases set the indexing off on all indexed fields (save a copy of this/these empty databases too)
                • write a script to import the records, and perform finds on every field you want indexed, run this script when you go home at night.
                • you should return to a fully indexed database(s) the next morning

                 

                However with a generous SSD this won't take much time.  You mentioned Access, you might find that some data manipulation occurs faster in Access, With large datasets I still use V Foxpro 9 to parse, and setup files, then import into Filemaker for my user interface.

                • 5. Re: Indexing very large datasets
                  bigtom

                  garyods1 wrote:

                   

                  • First work with a generous amount of RAM

                  This is an easy thing to do.

                  • Now days an inexpensive SSD will speed you up immensely

                  I would agree that SSD storage is a huge change. More than you would expect. Also cheap for what you get. Enterprise SSDs are also getting more and more reasonable. The reads and writes are so much faster. I suppose you could also look at keeping the source file on a different SSD so there is just one way traffic for the import.

                  • using your cloned updating databases set the indexing off on all indexed fields (save a copy of this/these empty databases too)

                  This just makes sense.

                   

                  Hardware is one of the easiest way to make improvements and often overlooked. I once deleted 2M records and it did not take long. SSDs are nice that way.

                  • 6. Re: Indexing very large datasets
                    everyman

                    Thanks.  Turning off indexing does make sense to speed up the import.  That said, I'm concerned about re-indexing 6 million records next year, and 9 million the year after, and then 12 million...

                     

                    What I'm getting is that it boils down to this:

                     

                    1. Clone the database to SSD and turn off indexing in the clone
                    2. Import existing data into the clone
                    3. Import new data into the clone
                    4. Turn on indexing (scripted "Perform Find" on all indexed fields)
                    5. Upload the newly-indexed file to the server

                     

                    In a perfect world we'd use SSD for the server disks/array.

                     

                    Have I got this right?

                    • 7. Re: Indexing very large datasets
                      everyman

                      You mentioned Access, you might find that some data manipulation occurs faster in Access, With large datasets I still use V Foxpro 9 to parse, and setup files, then import into Filemaker for my user interface.

                      We get the data as Access files from external sources, but we don't have Access in-house (all-Mac shop).  I have found that defining the .mdb's as external ODBC sources is preferable to importing from .tab files -- much faster and generally much easier to deal with.  It also eliminates the wait (which can be days or weeks) for people to export their Access data to .tab files.

                       

                      The thought did strike to me to look at using PostgreSQL for the backend, with FMP as the user interface.  But just experimenting with that is a months-long project, time I don't have when that idea occurs (which is every November).

                      • 8. Re: Indexing very large datasets
                        bigtom

                        everyman wrote:

                         

                        In a perfect world we'd use SSD for the server disks/array.

                        Just do it.

                        • 9. Re: Indexing very large datasets
                          greatgrey

                          Just thinking. How would having each year's data in its own table/file and some type of portal setup to show your search results? Not sure if practical but it would keep import time down not affect the current data.

                          ps

                          and as the data aged once say it got to 5 years old you could merge several years into one file i.e. 2001 to 2005, 2006 to 2010.

                          • 10. Re: Indexing very large datasets
                            everyman

                            greatgrey wrote:

                             

                            Just thinking. How would having each year's data in its own table/file and some type of portal setup to show your search results? Not sure if practical but it would keep import time down not affect the current data.

                            I've thought about doing this, but I don't see it as practical.  With standalone files, there's no good way to list all years together; there's no good way to see year-over-year changes or compare two consecutive years, let alone two arbitrary years.  At least, none that is straightforward.  I suppose it could be done with a lot of extra scripting and (for example) using executeSQL() and virtual portals.  Bottom line is: if I can keep the annual import time down to a weekend, it'll be fine.

                            • 11. Re: Indexing very large datasets
                              bigtom

                              How far away is the Access data? Same network? How is the connection? Could this be an issue?

                               

                              I really think a hardware change to SSDs is going to be your best bet. If you have a RAID swapping in one SSD at a time and letting it rebuild if your controller supports it is an option to keep running during the upgrade.

                              • 12. Re: Indexing very large datasets
                                garyods1

                                Sorry had to wait for my wife to retire.

                                 

                                Given the date timing of this I'm guessing you are updating annual assessors tax data in your files.

                                 

                                Your Import file probably only needs to be indexed on Tax ID

                                 

                                I'm guessing your primary file is your Property file with 3m+ records, related to the Owner files with same number of records - for fractional ownership I'd relate a second Owners2 with additional ownership information. Then there is the Assessment info also related to the Property file, this is probably your file that's growing each year.  The good news is that this file is probably relatively small, not too many fields mostly numeric data.  Consider this, modify your Assessment info file to contain 5 years of data depth,  So you would have fields like  yr1 land val, yr1 imp val, etc (for all your current fields), then in the same record yr2 land val, yr2 imp val, etc - through year 5.

                                 

                                Now each year you when you update replace the values in yr5 fields with the values in yr4 fields and yr4 with yr3 when you have finished replacing yr3 and yr2 clear out yr1 for the new import data.  Once you have imported your new year values, you will have 5 years worth of data (15 million records worth of assessment data currently), easily accessable with only 3m+ records.  Of course you will have to create an archive file once you go over 5 years, but that can be processed when you aren't under the time pressure of the annual updates.

                                 

                                You might also consider storing sales data in a separate file, this will keep you from having to archive owner info annually and only save the changes.

                                 

                                Hope this helps

                                • 13. Re: Indexing very large datasets
                                  RubenVanDenBoogaard

                                  If it is a once a year job, taking the file off the server and do an import on a fast SSD equipped machine ( + save as compact?) and get it back on the server will be the easiest option.

                                   

                                  Switching indexing off and on is faster, but I wonder if that is still the case after a few years when your database already has a couple of million records in it.

                                   

                                  An other option could be to import the data in a FM file, upload that to the server and use a server sided script to import x records every night until all the records are in the original database. that would spread out the load on the system over x nights.

                                   

                                  Hope that helps

                                  • 14. Re: Indexing very large datasets
                                    Mike_Mitchell

                                    everyman wrote:

                                     

                                    I'd like to keep that fairly linear: if after a few years it begins to look like three days (and then four and five) I'll be back to keeping it in an annual file.

                                     

                                    That might not be an entirely bad idea. Here's why:

                                     

                                    1) By re-doing this import every year, your indexes are taking a beating. They'll eventually have to be rebuilt when they don't work right anyway.

                                     

                                    2) As you note, the time involved in the ever-increasing data load is eventually going to be prohibitive, regardless of what you do.

                                     

                                    What you could do instead is build a separate interface file, using a one-to-one relationship. Import only the key fields associated with the main data archive, plus a key field for the year. Then you can do your searching there by pointing the key field and year to the appropriate year's data file. Searching will be only a tiny bit slower than doing it all in one file, because the data archive files can still be indexed. But you only have to do it once.

                                    1 2 3 Previous Next