Nobody can tell how long it will take, it all depends on your hardware. I certainly would not want to run this on a laptop. You need a machine with lots of RAM and very fast disks to make this efficient. A server probably.
The machine itself is pretty fast. has 16 GB RAM and an 8 core i7 cpu. The disk where I have the files is a 4 disk RAID 5 array connected to the laptop by USB 3. The concern is more that I will probably need to use the laptop for other things. I was hoping it would only take 1-2 days and I could let it run over the weekend and have it all done....
I guess another question is, if I host this basically read-only table (file) on filemaker server, the backup would likely take a really long time. But since the data won't change, I guess I could just keep offline backups and not do server backups on the file. If this is the case, in this case would there really be a reason to go through externalizing the containers?
Having the files internal in the file will force a compleete 800GB backup if you change a comma.
You may get the priize for largest FileMaker file
Just curious ... How long did the original import take ?
Strange, that the transfer to external storage would be so much slower. Perhaps this external disk is the bottleneck
> "RQST_TREE" has about 800 GB of BLOB data spread out across over 1 million records. ODBC import was able to pull all this data into filemaker as container data, and exporting the field contents proves that the original documents are intact and readable
That is a very good point. I do have it configured to use secure storage. I bet the encryption is killing me. I will restart the job with that turned off.
The original import took somewhere between 25-30 hours over Gb network. If I have to do this again, I'll probably set up the field default and storage options after importing a few records. Then I can import the rest with the file already set up to do external storage. That way FM won't have to do a huge file compact to recover all the empty space when it's done either.
I'll keep you posted on the performance of this file, especially as it relates to using 360 works scribe to pull keywords from all the supported container data for searching.
How are you find Scribe's performance? Are you using it with native electronic documents or scanned PDFs with hidden OCR? Is it pulling the text as expected? Missing parts?
Pretty much all the documents I'm working with are native electronic documents (word, excel, powerpoint, pdf, txt and some others) so far the extraction of text content has been pretty good. The problem I've found is with the scripting of doing this for the large amount of documents in the file, and the fact that the data imported don't have the proper document name and extension on the container. Since the column type in SQL was varbinary, there is no name (All files get the name Untitled.dat), but the name and extension are stored in two other columns. I've been able to export the container contents to a temp file with a calculated name, and then read the contents of the calculated doc with Scribe. But I also have the script calling a send event to delete the document when it's done reading it.
The problem is that document deletion calls a new process on every loop iteration, and it sometimes gets behind. This has caused the export to fail, since there can be filenames with different record IDs that have the same filename. (I should probably concatenate the record Uid into the filename to make it unique) This or some other issue with calling so many external events caused filemaker to hang during the extraction. And the consistency check on a 800 GB file takes a LONG time.
I've also checked out using Scriptmaster to rename the filename within the container, and that works, but it doesn't really just rename it, rather it streams the binary data from the container and puts the stream back in the original field with a new name. This will slow down the process as well.
I figure the best approach is still externalizing the container data, so at least it will be a bit easier to recover if I do something that causes the indexing job to crash.
Secure storage is actually the faster option for this process. The "encryption" is not of the container data, but of the storage structure naming conventions. "Secure" also protects against the OS problems encountered if you try to save thousands of files into the same OS-level directory, which can grind this process to a halt even with just a few thousand records. Stay "secure."
It seems the data is encrypted by default
e.g. "Alternatively, you can choose to keep the data in its native format through open storage"
The help article is here:
> The "encryption" is not of the container data, but of the storage structure