7 Replies Latest reply on Jul 18, 2017 5:54 AM by CarlSchwarz

    Is there a script that allows to split a string of words without spaces in the original words?

    drlins

      When importing text from pdf's of articles, it happens regularly that a string of words is entered in a text field without any spaces (e.g. textfieldwithoutanyspaces). Is there a script that allows to split the string again in the original words?

        • 1. Re: Is there a script that allows to split a string of words without spaces in the original words?
          taylorsharpe

          You won't find a FM script to do that because it would be huge.  But I bet you can find a RESTful API that you can do an Insert from URL and it return the result into a FM script variable.  Maybe from Google or Bing.

          • 2. Re: Is there a script that allows to split a string of words without spaces in the original words?
            philmodjunk

            Once the text is in the field with no spaces there's nothing filemaker can use to identify where to put them.

             

            I have seen plug ins, however, that can extract content from PDF's you might research those to see if there's a better way to get that text in the first place.

            • 3. Re: Is there a script that allows to split a string of words without spaces in the original words?
              William-Porter

              CAVEAT LECTOR! This is not a useful post. You should probably do something else.

               

              *

               

              I'm not sure that I agree with my friend Taylor that such a script would have to be huge. It would certainly have to be clever and it would also certainly involve recursion. But it might not be all that long.

               

              It's interesting to think about how you'd go about writing such a script (or custom function). Say the string is

               

              WHENINTHECOURSEOFHUMANEVENTS

               

              My guess -- after about 7 seconds of deep thought -- is that for starters you'd need a reference word list of at least a thousand words. Depending on the texts you're trying to parse, you might need 2000 or 4000. If you're trying to parse Shakespeare, something around (what is it?) 30,000 words would be helpful.

               

              Your recursive function or script would then have to start nibbling at the string looking for word matches. Say you start at the beginning:

               

              • W not a known word (matches nothing in reference word list)
              • WH not a word
              • WHE not a word
              • WHEN -- bingo! matches something in word list

               

              At which point you might pull "WHEN" out of the string and proceed to parse

               

              INTHECOURSEOFHUMANEVENTS

               

              Since the opening of the Declaration of Independence consists entirely of words that would be in your reference list, I think this string would be (relatively) easy to parse. Of course there are potential problems in there: partial strings that are themselves words, for example,

               

              HEN

              NINTH

              UMA (if you look for names)

              MAN

              MANE

              VENT(S)

               

              But at least with this example string, these ambiguities would never be observed and so would never cause a problem, because the first word-match that the script would find in each case would be the right one, and you'd soon be done.

               

               

              But with other strings this wouldn't be true. Say the text was from a document about efforts to cool off lions in the summer, and a new technique has become popular among zoologists:

               

              MANEVENTSAREALLTHERAGE!

               

              The process's first pass would find "Man, even Tsar..." before failing. The correct result should be "Mane vents are all the rage!" (Work with me here...)

               

              Of course if you did not just want to match words, but had any hope of actually checking the result to see if it made sense, that WOULD be a major job.

               

              Will

               

              p.s. This is how all ancient texts actually looked, by the way. Early papyri of Greek, Roman and Biblical texts were all in caps and not only had virtually nothing in the way of punctuation, they generally had no word separators. The Latin word for "read" is lego, legere, which basically means "pick" or "pick out". (That sense is still there in the second half of the word "se-lect" in English.)

              • 4. Re: Is there a script that allows to split a string of words without spaces in the original words?
                beverly

                Good post, Will! Hieroglyphs, yes! Emojis, full-circle, but I hate them.

                 

                Agreed, with all: get the text otherwise. Spell check can tell you what seems wrong to a point, but the carbons must still take over from the silicons.

                Beverly

                • 5. Re: Is there a script that allows to split a string of words without spaces in the original words?
                  patricia

                  Love your post Will.

                   

                  Best approach would be to catch the problem before it happens. 360 works scribe works with parsing text from pdfs. I Think other groups have plug ins that work with pdfs.

                  You need to solve the problem before you have text with no words.

                  patricia

                  • 6. Re: Is there a script that allows to split a string of words without spaces in the original words?
                    greatgrey

                    First I would look for capital letters and put a space before them. one problem PTA would become P T A. without other checks. Also I would start with the long words and work to the shorter words.

                    However a good OCR program on such PFDs would be more accurate.

                    • 7. Re: Is there a script that allows to split a string of words without spaces in the original words?
                      CarlSchwarz

                      Are you copy/pasting the text from the PDF?  I found that using a different PDF reader, e.g. using Preview instead of Acrobat reader on OSX would produce different results.  Still completely garbled however.

                      Best bet is to use a proper tool that extracts text from PDF documents.