FMSA 11 meltdown w/data loss (again)
I've been having recurring incidents of FM Server Advanced 11 crashing and taking a good deal of data with it. Previously, it seemed to be Java-related (see http://forums.filemaker.com/posts/051c6481a0 and http://forums.filemaker.com/posts/a98433496a for descriptions of that).
I'm not certain if today's crash was the same thing since I wasn't able to connect to the iStat server running on the machine, but I did come up with a little new info describing what happened. I'll try to be as concise & organized as possible:
FileMaker Server Advanced 188.8.131.52
Mac Mini Server Edition (headless), OS X Server 10.6.6, 8GB RAM, 120GB SSD w/94GB free
1. Came in this morning and nobody could connect to the FM server. Server would ping, but VNC/SSH/iStat connections were unresponsive. Rebooted the server. Users began to report loss of data they had entered the previous day.
2. Checked the FMSA logfiles. Events stopped being logged around 1:20 PM yesterday with no further log entries until the server reboot this morning. An informal poll of the staff seemed to indicate that the data they lost was entered after 1:20 PM.
3. Hourly backups continued to run through 8 PM yesterday and then stopped (they're set to run 24/7).
4. The daily backup successfully ran at 4 AM today.
So: FMSA was functioning on at least some level up through 4 AM today. Users were successfully connected to hosted files through at least 6:30 PM yesterday. The hourly backup from 8 PM yesterday contained the data found to be missing from the hosted files this morning.
It's like FMSA stopped writing data to disk for the actively hosted files (and the logfiles) some time around 1:20 PM yesterday, even though it appeared to be running normally by all external appearances.
I have only 64MB allocated for the cache and the cache flush interval is only 1 minute. I set them to those values in an attempt to get FMSA to write data to disk frequently so we'd have minimal loss when it exploded. By the time we'd discovered the crash, it looked like data hadn't been written for about twenty hours, including a five-hour period when users were working in hosted files with no apparent problems.
This is massively frustrating, and in my case, career-threatening. If anyone can put two and two together from the above info and come up with any sort of educated guess about what happened, I would be immensely thankful. Please.