10 Replies Latest reply on Feb 23, 2015 11:20 AM by disabled_Luna.media

    Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?

      Hi there,

       

      I am interested in being able to search in PDFs that are imported into my Filemaker solution. Has anybody ever tried to integrate an access PDF catalogs?

       

      I found the PDF Manipulator Plugin for FileMaker Pro that promises to be able to extract text out a PDF fro each individual page - so it must be able to build a search that show to a PDF file an its page for each found hit.

       

      Or has anybody other interesting solution ideas?

       

      Luna

        • 1. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?
          wimdecorte

          The text extraction is a function of that plugin and not natively available in FM itself.   So  you'll have to go with the plugin or some other non-FM way to extract the data and then get that data into FM.

          • 2. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?

            Yes, thanks - but has anybody hints for robust but not too expensive

            solutions? The plugins I found in the meantime including MBS are getting

            quite expensive.

             

            AN option could also be to use Automator somehow (throwing an document

            on an Automator app and looking for a temporary text file as an output).

            But is it wise to build such solutions?

             

            Luna

             

             

             

             

            Am 21.02.15 um 20:29 schrieb wimdecorte:

            <https://community.filemaker.com/?et=watches.email.thread>

             

            >

                Has anybody ever integrated PDF catalogs or other forms of

                full-text-integration of PDFs?

             

            Antwort von wimdecorte

            <https://community.filemaker.com/people/wimdecorte?et=watches.email.thread>

            in /Discussions/ - Komplette Diskussion anzeigen

            <https://community.filemaker.com/message/175476?et=watches.email.thread#175476>

            >

            • 3. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?
              monkeybreadsoftware

              MBS Plugin can use PDFKit on Mac and do the text extraction.

               

              Or cross platform for Mac+Win with DynaPDF, but that costs more.

              • 4. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?
                mbraendle

                How many PDFs? And what's the total size of them?

                 

                Plugins may not be the right way to go.  In our case, PDFManipulator proved much too slow to do mass extraction of text.

                 

                There are many aspects to extracting and searching full text data:

                - time to extract the data

                - time to index the data

                - information retrieval - FileMaker does not have the right tools to do full text searching (stop word removal, word stemming, word-distribution weighted index, proximity operators, ...)

                - display of results (hit highlighing, relevance ranking, snippet generation, ...)

                 

                See http://www.filemaker-konferenz.com/2010/downloads/27.%20Mai%20Donnerstag/Riesige%20Datenbanken/FMK%202010%20FileMaker%20…

                 

                An information retrieval software such as Apache Lucene or Xapian may be better suited for your task.

                • 5. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?

                  Thanks for the many aspects you brought up. And also will read your paper!

                   

                  I have to handle some tens of thousands PDF pages, each page about 150 kB.

                   

                  Extracting and indexing can be done at scheduled or the text extraction

                  can be done while importing new PDF pages (as I only have to import a

                  few hundreds every few weeks). Extraction do not seem to be too slow

                  with e.g. MBS plugin.

                   

                  I have also already thought about the searching the text amounts (as 

                  eSQL and LIKE do cost too much time) As I do not want to switch to other

                  programs totally but use Filemaker I was thinking of using a parallel

                  mysql database as I can use its fulltext index.

                   

                  But more advanced features like "stop word removal, word stemming,

                  word-distribution weighted index" etc. and also highlighting (not

                  possible with MBS as I know so far) might not be possible (or I have to

                  search for PHP/MySQL.solutions).

                   

                   

                  Is a fulltext index something for a "feature request"?

                   

                   

                   

                   

                   

                  Am 22.02.15 um 01:27 schrieb mbraendle:

                  <https://community.filemaker.com/?et=watches.email.thread>

                   

                  >

                      Has anybody ever integrated PDF catalogs or other forms of

                      full-text-integration of PDFs?

                   

                  Antwort von mbraendle

                  <https://community.filemaker.com/people/mbraendle?et=watches.email.thread>

                  in /Discussions/ - Komplette Diskussion anzeigen

                  <https://community.filemaker.com/message/175514?et=watches.email.thread#175514>

                  >

                  • 6. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?
                    wimdecorte

                    Luna.media wrote:

                     

                     

                     

                    Is a fulltext index something for a "feature request"?

                     

                     

                     

                    I don't think so: Defining field indexing options

                     

                    You can have every word in the field indexed to speed up searches.  I would not use eSQL and LIKE just to find the proper records.  The regular FM find will be much faster here.

                     

                    The other features that you are after are specific to a "document management system", which is just one type of database driven system.  So don't expect FM to build those DMS features for us.  There are applications on the market that do this already (Martin mentions a few, I like Lucene).  And those can be integrated with FM.

                    • 7. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?
                      jrenfrew

                      Remember that PDF, while it may contain the glyphs representing what you and I interpret as a text string may in fact not be so.

                       

                      There are many ways to create a PDF file which renders with the information you might 'interpret' by seeing, but it was never designed as an edit format, merely a container for the objects which make up the visual representation.

                      That's one reason that text extraction strategies are notoriously complex, and are not guaranteed to produce the results you might expect.

                      • 9. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?
                        mbraendle

                        pdftotext testExtract.pdf --> Violets are blue  (including a two carriage returns and a form feed character). Hex dump:

                         

                        0000: 56 00 69 00 6F 00 6C 00 65 00 74 00 73 00 20 00V.i.o.l.e.t.s. .
                        0010: 61 00 72 00 65 00 20 00 62 00 6C 00 75 00 65 00a.r.e. .b.l.u.e.
                        0020: 0D 00 0D 00 0C 00                             ......

                         

                        Old documents, although OCRed, may also pose a challenge, because fonts for these are often not available.

                         

                        Example from an excerpt of a document of 1862:

                         

                        czb_1862_1_1021.png

                         

                        And this is what OCR and text extraction made of it. And now imagine what your text index will look like if many very small text fragments (even single characters) have to be stored. Not to speak of the OCR errors (compare line 5 in the image above and in the text below). One will not have a chance finding this with a FileMaker index. One will have a slight chance using information retrieval engines with probabilistic methods.

                         

                         

                        Schweflige S ä u r e , Reaction ders. auf

                        Kupfer (H. Reinsen), 72.

                        — — Verhalten der wässrigen — bei

                        200 (Wöhler), 992.

                        S c h w e i n e g al le, Bestandtbeile der —

                        .(A. Strecker), 782.

                        S e b a c y l s ä u r e , ein Oxydationsproduct

                        des Walraths IA. E . Arppe), 1008.

                        Seide, Färbung ders. mittels Goldlö-

                        sung (Lapouraille), 32.

                        Selen, Verhalten gegen Metalllösungen

                        (Th. Parkman), 811.

                        Se n f ö l , Darstellung (Dragendorff), 543.

                        Serpentin (F. A. Genth), 746.

                        Siedepunkt wässriger Säuren als Kri-

                        terium für die chemische Zusammen-

                        setzung derselben (H. E. Roscoe), 883.

                        Silber, Darstellungvon reinem — aus

                        kupferhaltigem (Berlandt), 174.

                        — elektrische Leitungsfähigkeit (Mat-

                        thiessen u. v. Bose), 417.

                        — Reduction durch Elektrolyse (Bec-

                        querel), 773.

                        — Zustände des auf nassem Wege redu-

                        cirten — (H. Vogel), 513.

                        Silberhaltige R ü c k s t ä n d e , Aufar-

                        beitung von — (Hehn), 416.

                        S i l b e r o x y d , arsenigsaures (Ch. L .

                        Bloxam), 911.

                        Silicium, specif. Wärme (V. Regnault),

                        443.

                        Skolopsit (Rammeisberg), 538.

                        Sodafabrication in England (W.

                        Gossage), 40.

                        Solaniciu (C. Zwenger u. A. Kind), 781.

                        Solanidin (C. Zwenger u. A.Kind), 781.

                        Solanin, Einwirkung concentrirter

                        Salzsäure auf — (C. Zwenger und A.

                        Kind), 780.

                        S o l a r ö l , Darstellung (Düllo), 252.

                        Sombrerit (T. L . Phipson), 574.

                        Soolen der Saline Sülz (A. Virck), 404.

                        Soolquelle v. Dürkheim, 203.

                        Spartein (E. F. Mills), 700.

                        Specifisches Gewicht fester Kör-

                        per, Bestimmung (F. G. Schaffgotsch),

                        549.

                        Speci fi sc he W ä r m e d e r E l e m e n t e

                        (V. KegnaultT 442.

                        u. Atomvolumen der Elemente (H.

                        Weikart), 113.

                        Spectralanalyse (W.A.Miller), 321;

                        (A. Mitscherlich), 604.

                        Speichel, Ammoniaknitrit im •— (C. F.

                        Schönbein), 639.

                        S t ä r k e , quantitative Bestimmung (Dra-

                        gendorff), 523.

                        — Salpetersäureverbindnngen (A. Be-

                        champ), 865.

                        Stahl, Constitution dess. (Hezner), 352.

                        Stahlfabrication, Anwendung von

                        Titan bei der — (R. Mushet), 954.

                        Staurolith (A. Mitscherlich), 593.

                        — künstliche Bildung von — (H. Deville),

                        658.

                        Staurotid (F. A. Genth), 743.

                        S t e i n k o h l e n l e u c h t g a s , Ent-

                        zündungstemperatur des — (E. Frank-

                        land), 1007.

                        Stein kohlen theer, neue organische

                        Basen darin (G. Thenius), 53.

                        St e i n ö l d a m p f zur Verbesserung des

                        Leuchtgases, 45.

                        Stickstoff als Pflanzennahrungsmittel

                        (F. Stohmann), 452 fg.

                        — als Vertreter des Wasserstoffs in or-

                        ganischen Verbindungen (P. Griess),

                        465.

                        — Bestandtheil der Darmgase (E. Rüge),

                        347.

                        — Bestimmung des - im Roheisen (C.

                        TJllgrenh 950.

                        •—• Verwandtschaft des — zu den Me-

                        tallen (H. Geuther u. F. Briegleb),

                        793.

                        S tick s to ff magne s ium (H. Geuther

                        u. F. Briegleb), 791.

                        Stinkthier, das Oel des •— (Swarts),

                        787.

                        Strontian, Löslichkeit des schwefel-

                        sauren — (A. Virck), 402.

                        — u. Baryt, Nachweisung des — in

                        Kalkgesteinen (Engelbacb), 830.

                        Strontium im Meteorstein vom Cap-

                        land (Erigelbach), 877.

                        — Polysulphurete des — (E. Schöne),

                        613.

                        S t r y c h n i n , Auffindung bei Vergif-

                        tungen (J. Reese), 557.

                        — Einwirkung von Bromäthylen auf —

                        (E. Ménetriès), 145,

                        — Nachweisung dess. (J. Erdmann), 236.

                        — Scheidung dess. von Colloidsubstanzen

                        durch Dialyse (Th. Graham), 941.

                        S u l p h ä t h y l e n b r o m ü r (A. Huse-

                        mann), 501.

                        S u l p h ä t h y l e n c h l o r ü r (A. Huse-

                        mann), 502.

                        S u l p h ä t h y l e n i o d i i r (A. Husemann),

                        502.

                        S u l p h ä t h y l e n o x y d (A. Husemann),

                        500, 503.

                        Snlphaldehyd (A. Husemann), 504.

                        Sulp ha nil i dsäu r e (R. Schmitt), 213.

                        Sulphhydrate, die dem Glycerin ent-

                        sprechenden — (L. Carius), 993.

                        Sulphide der Alkoholrodicale (L. Cä-

                        rius), 588.

                        S u l p h o c h l o r b e n z o e s ä u r e (R. Otto),

                        953.

                        • 10. Re: Has anybody ever integrated PDF catalogs or other forms of full-text-integration of PDFs?

                          I am not such a PDF or PDFtoText-expert.... But as especially you are

                          dealing with such amounts of "scanned magazines, maybe it is possible to

                          let the old font be redesigned and then used in the OCR process. This

                          might be a way to improve the accuracy...

                           

                           

                          Am 22.02.15 um 23:50 schrieb mbraendle:

                          Old documents, although OCRed, may also pose a challenge, because

                          fonts for these are often not available.