It's tough to get a computer to understand the vagaries of the English language. You'll probably have to proofread in the end no matter what. For example, since FM can't read English, it can't discriminate between "a tom" and "atom". Which one is correct? Is the context of chemistry or cats?
The closest you may get is to eliminate "floating" letters except for "a" and "I". You'll have some false positives like "O", or if the language isn't English. This may get you 80% of the way.
You can do this in a couple ways. A recursive custom function or script is one. A lengthy Substitute() calc may be the simplest.
Have you confirmed that there is actually a space character between the letters or might this be some form of textstyle pasted into your field along with the text?
If it's a text style, this can easily be removed by adding the following auto-enter calculation into any text field where a user might be pasting styled text:
TextFormatRemove ( Self )
And be sure to clear the "do not replace existing text..." check box.
I already use the TextFormatRemove ( Self ).
Another consideration might be em space, flush space, en space, quarter space etc -- I thought, perhaps, the _FilterXML custom function worked on those. If not, then perhaps you or someone else knows how to identify these particular characters?
I have seen reference to these other hidden characters from http://www.indiscripts.com/blog/public/data/idcs4-special-characters/en_InDesignCS4-CS5-SpecialChars.pdf But, I do not know how to identify them, let alone remove them - if they are what I'm experiencing.
Perhaps it would be more practical to specify the characters that are permitted and use the Filter function?
Admittedly, this is a long list, but it would enable you to not have to concern yourself with the specific character creating the space so long as it is not the normal space character.
You can also highlight such a "mystery" space in the field and use the Code function in FileMaker Advanced's data viewer to determine the character code for that mystery character.
You can use the Code() function to determine exactly what character the "space" is.
This may seem silly, but have you tried pasting the text into something "dumb" (like Notepad) to see what happens. I have a suspicion that the source text is encoded UTF-16, and somewhere the encoding is getting lost in the transfer. That doesn't really solve your problem, but if it looks correct pasted somewhere outside of FileMaker, that may give you a starting point. I have been known to paste things into Notepad and then copy them out again just to ensure the correct encoding.
David - nice function! Wish I had it last week!! =)
So, I applied the Code() function to a sampling of the "spaces" I've been experiencing - and I get 32 as a result.
... does this mean I have a bunch of "real" spaces?
Or, is it still possible there's some underlying encoding which I need to tackle (I'd rather tackle underlying encoding or hidden characters, as that would be preferable to visually inspecting 1000's of records and fields.)
btw - domiller - not a silly suggestion - I used this method last week as part of figuring out my need for the xml filter custom function.
You can look up the code function in Help to see codes for common nonprinting characters. And 32 is the code for a space.
Sent from my iPhone
When I run in to this problem, if is it more that a few spaces I use OmniPage to read it and it usually takes care of the extra spaces but then you got to watch out for extra line breaks and sometime paragraphs being put together, however it usually much better than the copy paste from the PDF.
You are lucky if you are able to copy text from a PDF document. Often the text is mixed up from different sections of the document. How well it copies out often depends on the software that was used to create the document.
There is no easy free way to copy/paste from a PDF document. There are paid tools out there that make it easier.
I wonder: is there a way to "detect" a pattern using wildcards?
For example: is there a way to find an occurrence for "x x x x x" whereby the x's stand for any non-space character and the "space" is code 32? I don't need a pattern count, just an awareness that it exists.
CarlSchwarz - thank you for the perspective.