Filemaker server 220.127.116.11
OS: Windows Server 2012 R2 Standard
Filemaker server 18.104.22.1688
OS: windows sever 2008 R2 standard
Filemaker Pro Advanced Client 22.214.171.1245
OS: Windows 10
(the client is not really important here and is just used to make it easier to replicate and see the bug)
We originally encountered the issue on filemaker server 126.96.36.1995 running on a mac mini.
Description of the problem
We discovered this issue with a certain use case of us, where we had the specific requirement that certain scripts on the server may not run simultaneously. Therefore made use of a locking mechanism, relying on FM's unique validation.
Before our script begins its tasks, it first navigates to a Lock table and attempts to create a lock record. Every type of lock record has a specific name and this must be unique, because there may never be more than one lock of a certain type. So there is a unique validation constraint on the "Name" field in our Lock table (Always validate, no override).
We then use set error capture to capture the commit error (504), if there was no error then the script has obtained the lock and can continue.
Once it is done it removes the lock, allowing others to obtain it. If it encounters error 504, then there already exists a lock with the same name and we revert our newly created record.
When we fail to obtain the lock we wait for a moment and we keep trying again to obtain the lock, until a specified time limit is exceeded and we stop trying. This is done with the logic that the lock may be released any moment and if we keep trying for a while then the probability to successfully obtain the lock should increase.
Because we rely on the FM database engine to prevent duplicate locks this looks like a very secure mechanism. Unfortunately when multiple serverside sessions are constantly attempting to commit the lock record, our locking mechanism can fail. The moment the lock is released, a race condition is created and multiple records are committed at the same time.
How to replicate
To replicate the issue we have included a demo file of our locking mechanism. In here we simply launch 10 serverside scripts that in a loop will constantly try to obtain the lock. During 120 seconds they will keep looping with a 0,1 second pause. The frequency in attempts (seconds pause) is intentionally chosen a little bit quicker than our original use case to increase the change of our problem occurring.
- Host the demo file on a filemaker server. Make sure Maximum Simultaneous Script Sessions is set to be above 10. (This is to prevent the interference of any strange behavior of FMS when the maximum number of sessions is exceeded, which would be completely unrelated to our issue.)
- Full access login: User: Admin Pass: Admin
- You will see an empty lock table, to initiate the demo simply click on the green run button.
- The script runs and a single lock record (LCK), with name "LCKName" is created. (The server session we started the earliest will probably claim the lock before the others.) We now have one good LCK record and 9 server side sessions that are trying to obtain the lock but constantly fail. (error 504 in server error log)
- Now delete the LCK record, by clicking on the bin. Now it is a rather a matter of luck if the bug will occur or not, so if you see again just one LCK record delete it again. And keep doing so, until multiple LCK record are visible. When no more LCK records appear, click run again. But due to the settings in this demo you should almost have a 100% chance of encountering the bug on the first run.
- When multiple LCK records appear with the same name, you have successfully replicated the issue. You can then see that these records where committed at exact the same millisecond.
For now no solid workaround exist for this issue, a temporary solution in our use case was that we only run once and then quit. This way we drastically reduced the chance that our bug occurs.