Sorry but, every web page is unique. There are no universal parsing calculations. If you post an example of the page, or at least the URL to the page, maybe someone can come up with something to help you get started.
Even after you figure out a parsing methodology the vendor will eventually change the page and you'll have to figure out a new methodology for the new page.
If the vendor has an API to retrieve the data, for example a SOAP or Web Service, it would be much better to figure that out.
And some sites, if they discover you are scraping their site, will prosecute. Be careful what you do with other people's data from the web. See if you can get the data from a feed.
I set up specific feeds for many of my clients that needed to give the data to others. These were secure so no one else would get them.
Yes, great points. Faculty enter their publications into a form on our website and it goes into our database. It isn't uncommon for more than one faculty to be on a paper, and the system therefore collects multiple duplicate records. This would be fine, if they all submitted the citation the exact same way. But believe it or not, some faculty will enter it using an incorrect title, or the authors will be listed differently. They also enter the pubmed number, and what I want to do is rab the citation from there, as it will be correct. This is all public data.
I received this back from PubMed:
We have a suite of services called eUtilities which I think will meet your needs. Please see the documentation here:
I'll dig in and see how this can assist me.
Back in the day, there was a product called 'EndNote', which was far and away the best tool for all sorts of reference maintenance. They did everything I would have wanted to do for gathering, maintaining, and using a database of citations. Rather than re-create the wheel, I'd suggest looking into what they are still doing. At some point, I though Microsoft bought them and made them part of MS-Word....
I recall a tool, something like a web viewer, that enabled the user to search a variety of on-line resources for citations, then collect the data into their local table structure with a single click.
After you build an EndNote database, you could certainly export the data in any number of standardized and customized data formats. So, if you really wanted to, you could use it as the gathering and formatting tool for your FileMaker system, reducing a lot of development time and submission errors.
Drew, EndNote would be of no help to this project in any way. Thx.
You might have to use NCBI's E-utility. See if the answer to this post helps:
Also here's the NCBI E-utility page: http://www.ncbi.nlm.nih.gov/books/NBK25500/ . It looks like tools can return the article's information in XML format.
http://hublog.hubmed.org/archives/001518.html mentions that there is an API that returns XML, which again references to NCBI E-utils page above.