1 2 3 Previous Next 38 Replies Latest reply on May 12, 2016 5:22 PM by NickLightbody

    What is happening when FileMaker Server becomes overloaded (and how to avoid it)

    NickLightbody

      Summary: we will describe, discuss and illustrate the statistics that enable you to understand the why and how of FileMaker Server performance and suggest means of delivering a predictable and acceptable performance to your users.

      Why this is important

      FileMaker Server 13 is a wonderful and very reliable product - provided - as with any product - you recognise, understand and work within its limits.

      However, Server is a binary product - in the sense that it either performs “good" or it performs “bad” - very slowly - but very reliably - as it grids through its backlog until its load has reduced sufficiently for it to catch up on its queued calls and return to “good" mode.

      There is really very little middle ground - so when you look at the server statistics and watch the graph crawling along the floor - thinking that you are not really using its full capacity - you may in fact be deluding yourself - as we will illustrate.

      An understanding of what server hardware resource is required to ensure that a specific number of users receive a consistently good service is clearly essential but such information is - surprisingly - a little hard to come by.

      FMI themselves say that Server requires a separate cpu core - effectively a separate cpu - to handle each concurrent remote call efficiently with a smaller deployment a slightly fewer cores in larger situations. That’s a “remote call” - not a “remote user”. Depending on the complexity of what it is being asked to do - Server can handle quite a few calls a second - so it would be interesting to work out how many and perhaps to relate the number of remote calls to calls per user?

      Whilst FMI’s own technical recommendations are a good starting point they can appear more than a little conservative when compared with many people’s own experience - where often 4 cores appear to support 20 or more users - so what is going on?

      Research

      To investigate this we are using a method of testing FileMaker Server with “virtual clients” - server side scripts whose "completion" we do not await - hence we can send off a series of autonomous scripts - each simulating a client using server - from a single client side ui - watch the cpu history and statistics in Server Admin Console - watch the event log recording the statistics for each transaction and load up the Server to the point of near choking by adding or disconnecting virtual clients.

      These current comments apply only to FMP or FMGo connections - not yet to WebDirect - which we will test on another occasion.

      Using admin console statistics

      The true load factor on FileMaker Server is not clients but the number and frequency of remote calls, one of the 11 statistics observable in the FileMaker Server Admin Console under Statistics. For our investigation we need to turn on the following statistics - but remember that these numbers are each a sample of a single moment in Server's operation - at the time the sample is taken - so turn the sample frequency in Admin Console up to every 3 seconds to get a better idea of what is happening - then turn it down again when not required since measuring anything also affects what you are measuring - in this case by creating load - so a higher frequency will slow down normal operation.

      (1) Remote calls/sec - this represents the server load - each call being a significant set of instructions
      (2) Remote calls in Progress - at any single moment - when the sample is taken - often zero
      (3) Wait time (ms)/call - this shows the effect of load - the output - delivering the user experience

      The point at which performance and hence user experience starts suffering is indicated by spikes in the Wait time (yellow) but is determined by the Remote Calls in Progress (pale blue) moving above the floor of the display and remaining there.

      Choking on "busy" users

      A typical situation is shown in fig1 - where “Busy” virtual users are being added rapidly to a 4 core core MacMini host - there is a small spike at 14:14 - but after additional users arrive to bring the total to 15 at 14:15 - at 14:16 the Remote Calls in Progress lifts off the floor and stays up with values of 12 - 18. The server has insufficient speed to recover until the load is significantly reduced - hence the queue is congested - server chokes - so everything slows and nearly stops.

       

      fig 1 - choking with busy users
      fig 1 - choking on busy users

       

      Choking is simply that - the rate of new calls on server exceeding its ability to deal - hence an every increasing queue builds up which takes time to be dealt with by the Server and hence cleared.

      This choking characteristic is why folk may be misinforming themselves when they look at the stats and think their server has much more spare capacity available than is in fact the case - the moment the Remote Calls in Progress exceeds the number of cores available in the CPU - the risk of a suddenly escalating choke arises. The choke develops very rapidly - each delay multiplying further delays behind it - so the apparent surplus capacity disappears in an instance.

      Supporting more less active users

      However, if users with a lower level of activity are introduced - in this case “Fairly Inactive users” - server will support a much higher number - as shown in fig 2. Until that is many of them do something load creating at the same time - in which case congestion and a choke will arise but likely be short lived - because since the general load is low there is little load bearing on top of the congestion to escalate the choke.

       

      fig 2 - supporting more less active users
      fig 2 - supporting more less active users

       

      Observations

      We can observe from fig 1 that a level of below 50 Remote Calls / sec (dark blue) seems sustainable for this server - but that when the level moves above 50 the server moves beyond its ability to clear the backlog without significant delay. However, there must be more to it than that since in fig 2 - at 15:08 Server suffers a minor congestion with Remote Call / sec below 30.

      As the server has 4 cores we can make an initial theory that each core can safely handle 10 - 13 calls a second - but that when that capacity is exceeded choking will result. This clearly requires refinement based on the inconsistency in the preceding paragraph.

      Getting the best out of FileMaker Server

      The foundation of the performance we are observing is the speed and capacity of the server hardware - virtual or real - hosting FileMaker Server, the number of CPU cores, their speed in Ghz, the amount of RAM in Gb, the amount of FMS cache you have selected in the FMS Admin Console, the efficiency of the operating system, the efficiency with which Filemaker Server itself deals with its work and uses the 64 bit architecture it has available and - finally - how well written is the solution/App you are running.

      In order to get the best out of our deployment we can consider the following options:

      A. ensure that we have a good idea of the load that will be created by our intended user cohort - perhaps split them into less active and busy users and estimate that 2 less active users are roughly equivalent to one busy user;

      B. plan to support no more than say 4 busy users - or equivalent less active users - per CPU core with something like a mid range MacMini - provided you fit as much RAM as it will take - which is currently 8 or 16 Gb depending on the age of the machine;

      C. provided our solution is well written and efficient consider using cloud based hosting on virtual machines so that server resource can easily be increased if required to cope with increased demand;

      D. ensure that we are using the most up-to-date version of FileMaker Server available as this software becomes faster with every new version that is released;

      E. consider writing a token controlled flow-control system to regulate the load that your solution/App applies to FileMaker Server. This is like trains travelling in opposite directions on a single line - a train - or load creating instruction to server - is only permitted to use the line when it has obtained a token. By restricting the supply of tokens we can control the number of calls being heaped upon Server - to protect it from being over burdened. This is what we have done and fig 3 illustrates the detailed performance we obtained when testing several WAN servers recently - including with token flow-control turned on and off.

       

      fig 3 - detailed results of testing three different WAN servers
      fig 3 - detailed results of testing three different WAN servers

       

      We assess deployments using a Productivity Index for the systems capacity (larger is better) and a User Experience Index (smaller is better) to predict the goodness - or otherwise - of the user experience. When we turned Token flow-control off you can see - in the red boxes - that the Productivity declined from 51 to 45 and the User Experience declined from 25 to 410;

      F. consider improving the efficiency of our solution/App - the less we ask FileMaker Server to do the more it will get done and solutions which were written or started many years ago will certainly be capable of being much improved with a more modern approach to designing and building. Things that took hundreds of steps a few years ago can now be done in very few with a commensurate reduction in server load. We need to simplify our solution and play to FileMaker strengths. Consider removing all features that are really not required by most users. It can be surprising at how much you can improve performance by just removing unnecessary scripting; relationships and layout objects;

      G. we must of course use styles and themes well - bit the bullet and get rid of classic if we have delayed.

      Here is an additional comment on from Wim Decort - thanks Wim!:

       

      "Some calls will obviously take longer to execute than other ones, which will increase the back-log and the overall "wait time/call".

       

      Of the 4 traditional bottlenecks (memory, network bandwidth, disk i/o and processing power), network and memory are usually not the problem but they can be.

       

      For the other two FMS has a stats counter that ties them to the call:

      - elapsed time per call

      - disk i/o per call

       

      Spikes or sustained high numbers in elapsed time per call can be a symptom of bad design (as in: doing a search on an unstored calc across 1,000,000 records,...)

       

      All in all these stats will tell you where you should spend your resources to make the deployment fit the solution."

       

      Where Wim refers to memory - "can be a problem" - we should bear in mind that the server side user sessions created to host Web Direct will impact on memory - according to FMI with 256Mb Ram used for each WD client - this means that on a lightly resourced Server a few WD clients could take up sufficient memory to reduce the performance of Server for Pro & Go clients.

       

      The next testing I plan will try to investigate this relationship and work out the actual effect of WD connections on real world performance.

       

      Here is a note on the testing regime we used - as requested - thanks Greg!:

       

      The following client types were defined for use in testing:

      The type of client is defined by their Level of Activity - as follows

      Types: Cycles Delay 1, 2

      1 fairly inactive: random 0-18 random 0-18 + 300 secs

      2 busy: random 0-9 random 0-18 secs

      3 intense: random 0-9 0.1 sec

      Notes on randomisation:

      (1) Random Delay - we take a random 6 digit number and add together the leftmost and rightmost digit to give a random number between 0 and 18 - this is used to control the number of seconds delay

      (2) Random Cycles - we take the leftmost digit of a 6 digit random number to randomise the number of editing cycles of the content between 0 and 9 cycles

       

      Format - each client has a standard task to undertake:

      wait for available token

      create a new record - through a relationship - a data index record and a simultaneous one-to-one data record with a load of auto-enters

      release the token to permit another user to use it

      edit the content field with a random number of edit Cycles and a Delay 1 between each edit

      commit and save the record - data is synced between the data index and the one-to-one record

      pause for Delay 2 before starting again

       

      Note: When testing we normally tell the first Virtual user to do the task 50 times, the second 100, the next 150 etc so that as their work concludes after different periods we get to review performance from the Events Log with a gradually declining number of users. These test runs, depending on the power of the server hence typically accumulate between 2,000 and 10,000 records for analysis.

       

       

       

      Nick Lightbody - Nov 27th 2014 (updated Nov 30th)

        1 2 3 Previous Next