When importing text from pdf's of articles, it happens regularly that a string of words is entered in a text field without any spaces (e.g. textfieldwithoutanyspaces). Is there a script that allows to split the string again in the original words?
You won't find a FM script to do that because it would be huge. But I bet you can find a RESTful API that you can do an Insert from URL and it return the result into a FM script variable. Maybe from Google or Bing.
Once the text is in the field with no spaces there's nothing filemaker can use to identify where to put them.
I have seen plug ins, however, that can extract content from PDF's you might research those to see if there's a better way to get that text in the first place.
CAVEAT LECTOR! This is not a useful post. You should probably do something else.
I'm not sure that I agree with my friend Taylor that such a script would have to be huge. It would certainly have to be clever and it would also certainly involve recursion. But it might not be all that long.
It's interesting to think about how you'd go about writing such a script (or custom function). Say the string is
My guess -- after about 7 seconds of deep thought -- is that for starters you'd need a reference word list of at least a thousand words. Depending on the texts you're trying to parse, you might need 2000 or 4000. If you're trying to parse Shakespeare, something around (what is it?) 30,000 words would be helpful.
Your recursive function or script would then have to start nibbling at the string looking for word matches. Say you start at the beginning:
At which point you might pull "WHEN" out of the string and proceed to parse
Since the opening of the Declaration of Independence consists entirely of words that would be in your reference list, I think this string would be (relatively) easy to parse. Of course there are potential problems in there: partial strings that are themselves words, for example,
UMA (if you look for names)
But at least with this example string, these ambiguities would never be observed and so would never cause a problem, because the first word-match that the script would find in each case would be the right one, and you'd soon be done.
But with other strings this wouldn't be true. Say the text was from a document about efforts to cool off lions in the summer, and a new technique has become popular among zoologists:
The process's first pass would find "Man, even Tsar..." before failing. The correct result should be "Mane vents are all the rage!" (Work with me here...)
Of course if you did not just want to match words, but had any hope of actually checking the result to see if it made sense, that WOULD be a major job.
p.s. This is how all ancient texts actually looked, by the way. Early papyri of Greek, Roman and Biblical texts were all in caps and not only had virtually nothing in the way of punctuation, they generally had no word separators. The Latin word for "read" is lego, legere, which basically means "pick" or "pick out". (That sense is still there in the second half of the word "se-lect" in English.)
Good post, Will! Hieroglyphs, yes! Emojis, full-circle, but I hate them.
Agreed, with all: get the text otherwise. Spell check can tell you what seems wrong to a point, but the carbons must still take over from the silicons.
Love your post Will.
Best approach would be to catch the problem before it happens. 360 works scribe works with parsing text from pdfs. I Think other groups have plug ins that work with pdfs.
You need to solve the problem before you have text with no words.
First I would look for capital letters and put a space before them. one problem PTA would become P T A. without other checks. Also I would start with the long words and work to the shorter words.
However a good OCR program on such PFDs would be more accurate.
Are you copy/pasting the text from the PDF? I found that using a different PDF reader, e.g. using Preview instead of Acrobat reader on OSX would produce different results. Still completely garbled however.
Best bet is to use a proper tool that extracts text from PDF documents.
Retrieving data ...