1 of 1 people found this helpful
• Extracting Document into a text field, while keeping the format of the document extracted
In a project that I did the primary object was a line. Lines belonged to paragraphs and also to pages, sections, chapters, etc. We used sub-summary reports to allow us to display the lines as paragraphs, pages or sections.
We had our data provided in an SGML format so we had all the data we needed. We didn't have to mess around with Word or PDF.
• When conducting a search and a phrase is found coping the phrase and higlingting the its location in the document.
We would do a search and then display all the found lines with their page and chapter references. The user then selects a line and is taken to the page, displayed in a sub-summary report.
The difficulty that we experienced was to search for a phrase across a line break. If the search didn't find any results we had to split the search phrase and search for the two parts. That wasn't always successful because it was easy to get false positives. We ended up using calculation fields to generate whole paragraphs which could be searched. That worked well even though it was a bit slower.
1 of 1 people found this helpful
I created a tool for a task and I certainly wouldn't rate it as a full-on 'digital library', however it has elements that are similar to your needs.
In my case, I have a bunch of old text docs. As in hardcopy. These have been scanned as image pdfs, and then opened in Acrobat and converted to Word docs, via Acrobat's OCR engine. Sometimes the result is surprisingly clean text and sometimes it's very ugly indeed (depending on the source).
In the nebulous funded future, resources would be allocated to cleaning up that text, however that day may never come.
Thus, the tool's job in the interim is to store the data in as usable a form as possible.
It's set up in FMPA v12.02. I have one container field - interactive - which could be used for either the scanned image pdf or its subsequent text pdf, but my preference is the text pdf. They look identical, but the post-OCR'd version allows you to do good text searches.
Alongside, I have another field which is just a text box. I select all the text from the Word doc and simply paste it into that text field. This also is very easily searchable via Finds, QuickFinds and ReplaceFinds.
If resources allowed, the text would be cleaned, with the original image available for easy reference. FTR, the docs range from 10 to 30 pages each.
Getting to your query re highlighting the word or phrase in the relevant record, well that still seems to be a bit laborious...
If you manually invoke the Find/Replace command (Edit menu), it'll do a great job of highlighting each instance in that text field, for each record.
Unfortunately, I haven't been able to use the same command in a script it and have it work as well. As a workaround, I've also tried adding a Find to the script to first isolate the records with the word or phrase, and then use the other command to actually highlight them in the field but so far, not working.
FTR, you can also do record by record searches within the container field for specific, highlighted text, if its content is a OCR'd pdf. Select the container field, go Cmd or Ctrl F, enter your search term and then click through the occurrences.
In summary, if your content is also in a text field, the manual Find/Replace command is efficient however if you use the Find/Replace script step you don't get the same results.
Tip (ex Martin Crossman): With that container field, switch off the thumbnails, so that you'll be able to click through more than 50 records. If thumbnail generation is left on, you'll end up with an error message if you try to click through too many records.
Mardi & Malcom thanks for the information.
What I did is convert the pdf and word doc into Text document and import them into the Document field (i.e Text field). Then I use the 24U SimpleHighlightText (plug-in) that can highlight the text but I have send them a question which is yet to get response from them. That if possible to get the position of the highlighted text into the search document so i can use browse next to jump to the position grap by the plug-in.
I have tried the Cmd Shift F and yes it did work out fine, as you rightly said I hope FileMaker will be able to build a script that uses the command....
Anyway, thank you so much for the enlightment
You don't need a plugin for hit-highlighting. A custom function will do and even can be configured to match Umlauts such as ae = ä and diacriticals.
Best solution is provided by 360Works. http://www.360works.com/filemaker-pdf-plugin/
This plug-in allows you to extract the data from a pdf and store for indexing purposes.
In adition you can use the container built in functions on FM12 to store the original PDF or use SuperContainer (360Works aswell) if you require a more sophisticated externall storing soluction.
Thanks for the information