1 2 Previous Next 16 Replies Latest reply on Dec 24, 2016 12:59 PM by TonyWhite

    Is there any reason to store categorical data numerically rather than as text?

    JacobHickman

      I am converting an old database from MS Access to FM Pro, and in the old system we numerically coded all categorical data, such as religious affiliation (1=Catholic, 2=Protestant, etc.), ethnicity, kinship relation, place-names, etc. As I recall, back in the day this was to facilitate exporting data to the statistical program we were using (SPSS), which can now handle text-based categories better than it used to. I am trying to decide whether there is any benefit to storing these categories as numerical codes (with related tables that translate codes to text-based categories), or whether it would be better practice to just simply store the categories in the tables as text (brother, sister, etc. as opposed to 6, 7, 101, etc.). I have some Filemaker development experience, but I am no pro. The central purpose to this solution is storing survey data and related multimedia files collected with surveyed households for social science data analysis (including exporting data and related demographics to data analysis software packages).

       

      Thoughts?

       

      In sum, Is there any reason to store categorical data numerically rather than as text, such as for places (geographical locations) or kinship relationships (brother, sister, paternal grandmother, etc.)?

        • 1. Re: Is there any reason to store categorical data numerically rather than as text?
          Magnus Fransson

          Hi Jacob,

           

          I knew it's not a reason, but I can say that I and many with me prefer numerical categories, as you describe. Of course I do have reasons but I lack the pedagogical skill to describe them, other the very vaguely.

           

          With best regards Magnus Fransson.

          • 2. Re: Is there any reason to store categorical data numerically rather than as text?
            nickj

            guess a lot of the history behind databases and using simple numeric data for fields dates back to the original use of punchcards (Hollerith, IBM et al) to program and store data. So simple numeric entries for fields was mandatory. This was the case into the 50's where punch cards migrated to mag tape, again storage and manipulation of large volumes of data meant that brevity in storage was needed.

            Cue the rise of the PC where initially storage space was limited using floppy disks (I remember those huge floppy disks) and a heady max of 20mb RAM on an 8088 platform. Again storage was at a premium, even the HDD's were small (and initially the size of a plane carry on case). So size mattered both on its storage and processing requirement.

             

            Now .. my desktop PC has 32GB RAM, 2TB of HDD space and more CPU cores than you could shake a stick at and the GPU's double up as heavy weight numeric processors. The server side is even more OTT!!

             

            So do you need to use numbers in a well normanlised database??
            The answer is more one of scale and ease of management:
            Small scale probably NO because you have sufficient spec to manipulate large meaningful text fields and the software to manipulate them relatively easily.
            Large scale probably YES, it still takes clock cycles to move data and shifting a number is going to be quicker than a long text name. For example 11 or Zorastrianism.

             

            The absolute answer is really, it depends what you are doing, the scale of the data and the scale of 'power' you have to hand.

            • 3. Re: Is there any reason to store categorical data numerically rather than as text?
              Mike_Mitchell

              Building on what nickj mentions, the only real advantage is performance when the data load gets large. FileMaker will index numeric fields considerably faster than text fields, which speeds up pretty much any operation that depends on the index. However, this only generally matters at very large data loads for practical purposes.

               

              There are some other considerations as well. One reason why UUIDs for primary keys have become popular is for sync. Serial numbers are easy, quick, and built-in, but they're simply unworkable for a sync strategy - duplicates are inevitable. So you'd need a numeric UUID (several methods exist to do this), assuming the expected scale is large enough to matter.

               

              The use of a key (numeric or otherwise) is a good idea from a database standpoint (free text fields tend to be ... problematic). However, user experience must be taken into consideration, so human-readable text is important. Therefore, I recommend using keyed tables as you currently do, but with value lists or other methods to make them user-friendly.

               

              HTH

               

              Mike

              • 4. Re: Is there any reason to store categorical data numerically rather than as text?
                philmodjunk

                However, there are advantages to storing the actual text. When debugging issues, for example, and you look at the raw data in a table view, you have the actual value and don't have to link in a related field to see it.

                 

                And you don't have to add all those extra tables to your relationship--which can take back in some areas the speed efficiencies that you gained in others by using an ID. Also, keep in mind that a UUID may have more characters than the actual value and thus there is no efficiencies to be had with that type of ID here.

                 

                There is also another disadvantage not mentioned here to using the actual text instead of a reference ID:

                If you set up your value list, create say 5,000 records that store a particular value selected from it and then find that you have to change that text--say to correct an error. You have to track down and update all 5,000 records where with a reference number/UUID, you simply update one record in the reference table.

                 

                So I tend to use both methods in different circumstances, balancing the pros against the cons.

                2 of 2 people found this helpful
                • 5. Re: Is there any reason to store categorical data numerically rather than as text?
                  erolst

                  philmodjunk wrote:

                  If you set up your value list, create say 5,000 records that store a particular value selected from it and then find that you have to change that text--say to correct an error. You have to track down and update all 5,000 records where with a reference number/UUID, you simply update one record in the reference table.

                  Here's another reason to store categories as an entity in their own right and reference them by key: this allows you to sort them on something else than their name, simply by adding an additional field for sortOrder or priority etc.

                  2 of 2 people found this helpful
                  • 6. Re: Is there any reason to store categorical data numerically rather than as text?
                    jbante

                    Other posters have already mentioned some of the pros for coding categories as numbers:

                    1. Normalization - Lots of numbers take less space to store than the raw text values. Some folks rightly mention that the UUIDs we often like to use for sync-prone data may not necessarily have a size advantage, but I think for the purpose of coding statistical categories, serial numbers starting from 1 really are the better default to start with.

                    2. Also normalization - You can edit the wording of the category labels without changing the meaning of the data.

                     

                    One benefit I haven't seen yet is localization and changing names over time. We'd like for interfaces expressed in different languages to mean the same thing, and coded categories rather than raw text are a good way to accomplish that. I don't suppose the original poster's application is working with data over a historically substantial time scale, but things like place and category names do change, and the same words come to mean different things at different times, and we may still want to see all the names used to mean the same category all in one place.

                    2 of 2 people found this helpful
                    • 7. Re: Is there any reason to store categorical data numerically rather than as text?
                      taylorsharpe

                      Adding to Jeremy's comment, using a numeric code tied to a related table of names can prove better than letting people freely type in options because freeform typing may get things with the same meaning spelled differently like catholic, Catholic, Cath., Roman Catholic. If you have a related and normalized table, it can be used as your value list to make sure people select the code related to one normalized spelling or acronym, etc. 

                       

                      Speaking of which, I like to use UUID's for primary key's, but convert them to decimal UUID's since FM can index and search them faster.  And I hide the UUID's from the end user so they only see the value list of options for the field. 

                      • 8. Re: Is there any reason to store categorical data numerically rather than as text?
                        philmodjunk
                        using a numeric code tied to a related table of names can prove better than letting people freely type in options because freeform typing may get things with the same meaning spelled differently like catholic, Catholic, Cath., Roman Catholic. If you have a related and normalized table, it can be used as your value list to make sure people select the code related to one normalized spelling or acronym, etc.

                        That really doesn't have anything to do with using codes for the values. A value list of names can keep the user from entering their own values just as readily. It's a good point about database design, but not one that argues for or against this particular issue.

                         

                        You also need to keep "scale" in mind. We all have "categories" or value lists that are very small and very unlikely to need changing. They are the ones where the extra complexity of setting up such a "reference table" may not justify the benefits you get from doing so.

                        • 10. Re: Is there any reason to store categorical data numerically rather than as text?
                          JacobHickman

                          Thank you all for the input into this issue. I think that, given my case and some of the issues pointed out, I am going to store the data numerically and used reference tables to store the text values for each code.

                           

                          And now a question: I take it that the best way to do this would be to:

                          1) create a table occurrence (T.O.) for every instance where I want to 'decode' a numerical value for a given field

                          2) create a relationship between the code in the main data table to the numerical code in the T.O. to decode it

                          3) set the field in the layout where I want the text field displayed to display the text value from the codebook T.O.

                           

                          Do any of you see a more efficient or better way of doing this?

                           

                          The reason that I see that I cannot simply create a relationship between a single T.O. and every field that it decodes is that if I have multiple fields in a single record that rely on one codebook, then the layout will only display the code for the first field for that record, and not the subsequent records. EXAMPLE:

                           

                          I have a single codebook for all locations (e.g., 1000=Thailand, 1001=Bangkok, 1002=Chiang Mai, 3000=Korea, etc.), and I have a data table where we have stored (as numerical codes), the birthplace by city (field 1), county (field 2) and country (field 3), as well as other locations for that individual, such as locations of migration for work. I found that if I link a single table occurrence (T.O.) to each location field, the layout will not properly display all locations for that record. This is how I concluded that I need a separate T.O. with a unique relationship to each decoded field in a record.

                           

                          Do any of you see any problems with my logic or suggest better ways of going about displaying the text values for data stored as numerical codes in the main data tables?

                          • 11. Re: Is there any reason to store categorical data numerically rather than as text?
                            philmodjunk

                            Pop up menus can enter the ID number but display the corresponding text without need for an explicitly defined relationship.

                            • 12. Re: Is there any reason to store categorical data numerically rather than as text?
                              Mike_Mitchell

                              I suggest you have a separate table containing one record for each combination of person and location. Then, you only have one location field per record to deal with. This is commonly referred to as a join table, and it is generally the best way to implement a many-to-many join ( which is what you're describing).

                               

                              HTH

                              • 13. Re: Is there any reason to store categorical data numerically rather than as text?
                                JacobHickman

                                Doesn't a one-way relationship between each location field in a record and the single list of locations (although each requiring its own Table Occurrence in the defined relationships) do the same thing, but more simply than a join table (which I understand conceptually, but have not set one up myself)?

                                 

                                It strikes me that in my case its more of a series of one-to-many relationships (the many just happens to be the same in every case) than a single domain of many-to-many relationships. Maybe I didn't describe my case clearly enough, but for each location field in a record there is a single list of locations. It just so happens that if I defined the relationship using a single table occurrence, a layout will only display the field in the location table occurrence for the first field in the layout.

                                 

                                Does that make sense?

                                • 14. Re: Is there any reason to store categorical data numerically rather than as text?
                                  philmodjunk
                                  the birthplace by city (field 1), county (field 2) and country (field 3)

                                  This would be three different value lists

                                   

                                  as well as other locations for that individual, such as locations of migration for work.

                                  These would not necessarily be put in the same record as the birthplace and thus, if you are recording multiple locations, could use a join table where you have

                                   

                                  Person-----<Location_Record>------Codebook

                                   

                                  Location_Record is the Join table that mike referred to.

                                  1 2 Previous Next