-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KGL Pubfactory journals #3009
Add KGL Pubfactory journals #3009
Conversation
Hello Brendan :) I added a few comments mostly about extracting stuff using selectors. In addition, I think the regex in the translator metadata is a bit too complex to understand and debug (personally I had to rely on a visualizer), and it may miss some "multiple" links. My observation is that with Pubfactory, the journal-article links all look like this: domain; a path containing "view/journals/"; followed by intervening paths depending on the content type; followed by some path containing "xml" This would translate to the RE (not valid JS syntax, but something to be put in Scaffold) ^https://[^/]+/view/journals/.+\.xml\b which should capture both individual-article, issue-view, and the "overview" URLs. But there's a catch with URLs like this:
or For those I'm tempted to use the following (again for Scaffold) ^https://[^.]*journals\.[^/]+/search\b This would catch domains like So the end result would be ^https://([^/]+/view/journals/.+\.xml|[^.]*journals\.[^/]+/search)\b If my caution about the the "domain pattern" should be unnecessary, the regex could be further simplified to ^https://[^.]*journals\.[^/]+/(view/journals/.+\.xml|search)\b How do you think? |
Thanks for the extremely detailed and helpful comments! I've incorporated everything, including the new regex - I agree that my regex was overly complex and hard to read. Will submit a new commit soon, still working on another issue that turned up while I was doing testing. |
It appears that some (all?) Brill journals are using KGL Pubfactory too. Example: https://brill.com/view/journals/tpao/109/1-2/article-p1_1.xml Another candidate for migration. |
Thanks for spotting this! Yes, at a glance all the Brill journals I looked at seem to use PubFactory as a platform. It doesn't appear that there's any migration necessary - I have the PubFactory translator already running in my browser extension for testing, and the PubFactory factory regex appears to be the top match for these Brill journal URLs, so it's defaulting to using the PubFactory translator anyway. |
|
Oh, I should be more specific. I think the Brill journals use KGL Pubfactory, but I'm not sure about the books and reference works. |
@zoe-translates yes, exactly - I was just looking at the journals. Brill publishes a ton of books, relatively few journals. We definitely still need the Brill translator for their books. |
In that case, maybe just leaving the Brill translator handle all Brill content (which will happen once KGL is set to priority 200, I think) is the right move |
OK, I've set KGL PubFactory translator as priority 200, so that the Brill translator handles all Brill content, as you suggested. Thanks! |
There's another issue with the translator that I haven't been able to resolve: tests are running successfully in Scaffold and returning I've found a couple possible explanations for this, but haven't figured out how to resolve the issue. One is simply extremely slow page load times: from the Network monitor in Firefox, I'm getting I've tried doing this, but can't find an API request or JSON that the page is being generated from when I look at Network traffic. Another possibility: I found this thread from you, @adam3smith, on a problem with the AMS translator, which no longer seems to exist: https://groups.google.com/g/zotero-dev/c/2gCD5-mM_Cg/m/tbJH6siaUWkJ. If we're indeed talking about the same AMS here, it appears that this could be the same issue with the content-type meta tag being improperly placed in the page. In that case, it would appear that there isn't a solution for this issue. |
I'm afraid that there's little you can do to give it a fix here, because the page you cited as an example wasn't valid HTML. Even if Firefox could "fix" it and display the page as if it were valid, its implementation of the JS API couldn't parse it in the intended way (although in this case there's no "correct" solution). And this is not just with Firefox -- neither Chromium nor Safari could do it. In the EM translator, you have var metaTags = doc.head.getElementsByTagName("meta");
Z.debug("Embedded Metadata: found " + metaTags.length + " meta tags.");
if (forceLoadRDF /* check if this is called from doWeb */ && !metaTags.length) {
if (doc.head) {
Z.debug(doc.head.innerHTML
.replace(/<style[^<]+(?:<\/style>|\/>)/ig, '')
.replace(/<link[^>]+>/ig, '')
.replace(/(?:\s*[\r\n]\s*)+/g, '\n')
);
}
else {
Z.debug("Embedded Metadata: No head tag");
}
} Here the |
Thanks for this explanation! When I look at the Inspector on the fully loaded page in the browser (https://journals.ametsoc.org/view/journals/amsm/59/1/amsmonographs-d-18-0011.1.xml), it appears that the |
The Inspector shows a live version of the DOM, which isn't just the result of parsing the document source. First the browser will try to parse the source, and then apply sanitization, to "fix" the errors it can fix. And on top of that, the document may be mutated by JS code executed by the browser. But the parser features exposed by the JS API does just that, the parsing. There's no browser, and therefore, there's no extra processing (sanitization of anomalies in the source), and no execution of the document's JS code after that. So when translation is performed on the live document, the translator code has access to the live DOM via the browser. But when it must fetch some additional source and parse it using JS API (via Zotero functions like The latter is what happens when the code processes One workaround is to apply some "fixing" by ourselves in the translator. For example: translators/Philosopher's Imprint.js Lines 82 to 87 in cc7c3c7
Here the code is dealing with exactly the same problem (malformed input -> But this must be done carefully, if ever, lest it mutates the DOM of a live page. (This hack is no longer necessary for the aboveshown translator because the sources have been fixed.) |
let em = await translator.getTranslatorObject(); | ||
em.itemType = 'journalArticle'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR, introducing a flag to Embedded Metadata.js
that tells it to look for the <meta>
tags in the body in addition to the head: PR #3083.
If that one gets merged, here you could add
if (doc.querySelector("body > meta")) {
em.searchForMetaTagsInBody = true;
}
and that would be all you need to do in order to work around the bad HTML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this extremely clear explanation! This all makes sense now. So I'll wait to make a new PR with searchForMetaTagsInBody
, once your PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're welcome <3 that PR is an "if" though. If it gets merged, you could do that. But if not, there's probably not going to be too much ill consequence with the "copy body meta tags to head" hack, if you restrict the hack to the scraping of multiples and leave the live-scraping of a doc unchanged (i.e. relying on the browser's sanitization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Merged now!)
Hey, thanks for clarifying that - given that this translator doesn't address an urgent issue, I don't mind letting it sit for awhile and see if your PR gets merged. Otherwise, I'll try implementing the more hack-y solution. |
And update tests.
KGL PubFactory is a hosting platform for over 1300 academic journals. While individual PubFactory journal articles could previously be saved to Zotero using the Embedded Metadata translator, this new translator addresses a couple of issues:
multiple
for journal issue landing pages, e.g. https://journals.ashs.org/hortsci/view/journals/hortsci/58/5/hortsci.58.issue-5.xmlabstractNote
for journals that include the word Abstract in the Abstract field, e.g. https://journals.ametsoc.org/view/journals/atsc/79/12/JAS-D-22-0008.1.xml. This addresses American Meteorological Society: No longer uses Atypon #2997