13 Replies Latest reply on Aug 2, 2013 2:06 PM by francishunger

    Working with text

    francishunger

      Hi,

       

      could I get some hints how to proceed with this? -> I have a flat textfile.

       

      for each set (see below), beginning with the keyword "SET", I would need to extract the lines beginning with the numbers 100, 551, 903 into one data record. So that I would end up with

      something like this:

       

      ID100551903
      Set 1Rooyen, Lindy$cvan

      !041160983!Pretoria$4ortg

      !041226488!Stellenbosch$4ortw$vStudienort

      !040323994!Kopenhagen$4ortw

      !040231186!Hamburg$4ortw

      $eDE-101

      $rDE-101

      Set 2

      $eDE-101

      $rDE-101

       

       

      Original Data:

       

       

      SET: S3 [2781] TTL: 6 PPN: 1023980339 SEITE1 .

       

       

      Eingabe: 1240:05-07-12 Änderung: 1240:05-07-12 19:58:49 Status: 1240:05-07-12

       

       

      005 Tp3

      006 http://d-nb.info/gnd/1023980339

      011 f

      012 v

      035 gnd/1023980339

      043 XC-ZA;XA-DK;XA-DE

      100 Rooyen, Lindy$cvan

      400 Van Rooyen, Lindy$vSicherheitsvw.

      550 !948570091!Literaturwissenschaftlerin$4berc

      550 !041300769!Juristin$4beru

      550 L.L.M., Stellenbosch 1998$4akad

      550 Master in skandinavischer und engl. Literatur, Universität Hamburg 2011$4akad

      551 !041160983!Pretoria$4ortg

      551 !041226488!Stellenbosch$4ortw$vStudienort

      551 !040323994!Kopenhagen$4ortw

      551 !040231186!Hamburg$4ortw

      675 LCAuth (05.07.2012)

      678 $bSüdafrikanische Literaturwissenschaftlerin und Juristin aus zweisprachiger Familie, lebte von 1998-2002 in Dänemark und seit 2002 in Deutschland

      903 $eDE-101

      903 $rDE-101

       

       

       

       

      SET: S3 [2781] TTL: 7 PPN: 1023723506 SEITE1 .

       

       

      Eingabe: 1240:26-06-12 Änderung: 1240:26-06-12 16:57:54 Status: 1240:26-06-12 deutsche Autorin

       

       

      005 Tp3

      006 http://d-nb.info/gnd/1023723506

      008 piz

      011 f

      012 v

      035 gnd/1023723506

      043 XA-DE;XA-GB;XC-SD

      100 Kramer, Julia

      375 f

      510 !1005156778!Act for Transformation Gemeinnützige eG$4affi

      550 M.A. conflict resolution der Universität Bradford, UK$4akad

      550 !043281494!Entwicklungshelferin$4berc

      550 Friedensfachkraft$4beru

      550 !947661522!Gärtnerin$4beru

      678 $bProjektkoordinatorin in der Friedensbewegung; arbeitete 2008-2008 als Friedensfachkraft des DED (Deutscher Entwicklungsdienst) im Sudan

      903 $eDE-101

      903 $rDE-101

       

      Thanks in advance.

        • 1. Re: Working with text
          mbraendle

          This is MARC format interrupted with screens.

           

          If it were MARC format only, without the screen information

            SET: S3 [2781] TTL: 6         PPN: 1023980339                           SEITE1 .

           

           

          Eingabe: 1240:05-07-12 Änderung: 1240:05-07-12 19:58:49 Status: 1240:05-07-12

           

          you could use one of the MARC2XML conversion tools and then do an XML import using an XSLT.

          Otherwise, it will become very difficult to extract the items (also because there are subfields in MARC).

           

          How did you get this information? Can you obtain the pure MARC data only?

          • 2. Re: Working with text
            mikebeargie

            You need to parse this text out with nested looping.

             

            First you need to loop through each "SET" -

             

            set variable $text = your text file

            set variable $setcount = patterncount ( $text ; "SET: " )

            LOOP

              set variable $i = $i + 1

               new record

               set variable $set = middle ( $text ; Position ( $text ; "SET: " ; 1 ; $i ) ; Position ( $text ; "SET: " 1 ; $i + 1 ) )

                    PARSING SUB-LOOPS HERE (bee below)

               Exit Loop If ( $i = $setcount )

            END LOOP

             

            Then inside of each $set, you need to parse things out in sub-loops for each number you are extracting -

             

            set variable $lines = patterncount ($set ; "100 ")

            LOOP

               set variable $ii = $ii + 1

               set field table::100 - extract the line

               Exit Loop If - $ii = $lines

            END LOOP

            set variable $ii = ""

             

            set variable $lines = patterncount ($set ; "551 ")

            LOOP

              set variable $ii = $ii + 1

              set field table::500 - extract the line

            END LOOP

            set variable $ii = ""

             

            REPEAT SUBLOOP PATTERN FOR EACH COLUMN

             

            This is pretty complicated, as you might guess, but you can build a fairly robust scirpt to process the same files over and over again into the records you need for each SET value.

            • 3. Re: Working with text
              mikebeargie

              Also, this is thrown off if ANY of the text strings appear in the data you are extracting. PATTERNCOUNT does not tell the difference between a tag and a value.

               

              So if you're trying to extract the 551 lines from a line that looks like this:

              551 !040231551!Hamburg$4ortw

               

              it would be thrown off.

              • 4. Re: Working with text
                mikebeargie

                This sounds a lot easier than what I suggested. I have no idea what MARC data is, but setting up a recurring XML import would at least be a lot more reliable in terms of data integrity instead of extracting with patterncount and nested looping through records.

                • 5. Re: Working with text
                  francishunger

                  Thanks a lot for the answer. I already asked for structured data, but it is not available. So I have to muddle through ...

                  • 6. Re: Working with text
                    mbraendle

                    The records look like authority records.

                     

                    If you follow the link in each record, you will land on the record page of the German National Library. At the right you find the RDF/XML representation of this record.

                     

                    You just can add /about/rdf to the URL, then you get the XML/RDF file directly.

                    • 7. Re: Working with text
                      francishunger

                      I would need to look for 551 at the beginning of a line only

                      • 8. Re: Working with text
                        mbraendle

                        You can also extract the number from the URL, and then use

                         

                        https://portal.dnb.de/opac.htm?method=requestMarcXml&idn={number}

                         

                        e.g.

                         

                        https://portal.dnb.de/opac.htm?method=requestMarcXml&idn=1023723506

                         

                        This returns you the MARCXML file.

                         

                        So with some clever Script and custom function to parse out the URL you could feed a WebViewer and download the XML.

                        • 9. Re: Working with text
                          francishunger

                          You know your way around I'll look into this.

                          • 10. Re: Working with text
                            mbraendle

                            Well, being partly a librarian, I should know of these things .

                            • 11. Re: Working with text
                              beverly

                              LOL. There is no 'partly'. You're a Librarian who does lots of interesting things otherwise! It's all good.

                               

                              MB, Have you tried parsing as text (not XML) the data the OP has now? And is it more bother than it's worth? Do you use XSLT with the XML data? Would that be a deterrent for OP? and/or do you have the XSLT that OP could purchase? (And save $ by saving the time needed to start from nothing).

                               

                              -- sent from my iPhone4 --

                              Beverly Voth

                              --

                              • 12. Re: Working with text
                                mbraendle

                                Many questions, Beverly. Answered in your order:

                                 

                                • No. Only OP knows the exact format. What we see here is what he pasted in (as far as I know MARC, there might be also control characters in it).
                                • Depends. Just catching a specific line may be simple. If done by simple FileMaker Script / text function parsing, I would write sort of a 'line-eater" - extract always one line, scrape off the MARC header and determine if it is a MARC header (if it's a SET I would create a new record, if it's neither SET nor a MARC header I would trash the line), then per MARC header implement a parsing rule. The problem is the subfields that start with ${someletter}. There may be none, there may be many, and {someletter} may vary as well (in the MARC format has a meaning). What is inbetween the exclamation marks is a reference to another authority record, e.g. !041160983! above points to https://portal.dnb.de/opac.htm?method=simpleSearch&query=idn%3D041160983   --- BTW: One can build the same with XSLT, if one likes (I did that once to parse a list of book offers which was just a large chunk of text : add an enclosing XML tag, then just use the xsl:substring functions). Having done line-parsing with FileMaker scripts, XSLT, various programming languages or the JumboMarker software by Peter Murray-Rust I can tell that there are always pitfalls.
                                • I would.
                                • Don't know.
                                • No, don't have. But I know somebody in the US who is fond of XSLT and willing to help (for cash) (and can also point to people in Germany if needed).

                                 

                                Re your first comment: Love this clarity, Beverly - you found a box for me. My boss is talking of "information scientist". This cross-over of fields has both its strengths and weaknesses.

                                • 13. Re: Working with text
                                  francishunger

                                  Since this was just a favour for a friend, I did it pretty roughly. With search and replace in MS Word I seperated each individual record with ∆ (each new record starts with "SET") and then replaced all paragraph characters with ø. Now I would again replace the ∆ with a paragraph character. Then I would save this as an textfile to be imported into excel so that I had one record per cell.

                                   

                                  The excel file was imported to filemaker, where I had now 1400 records and replaced the ø with paragraphs again. With a script I looped through each field as proposed by Mike but maybe a bit simpler

                                   

                                  Count the lines per record

                                  Loop

                                  Start at line 3, omitting the header

                                  put line into variable $line

                                   

                                  if the first three signs match "550"

                                  write $line to the field 550 (and strip off "550 " first)

                                  endif

                                   

                                  ...

                                   

                                  Endloop

                                   

                                  Then I exported that to excel and explained my friend how to use excel search and replace to get rid of any unwanted data such as !234059820!

                                   

                                  Thanks to everybody for the comments!