-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimization: tokenize HTML or process textually entirely #15
Comments
Another interesting measurement: peak memory usage during conversion is reduced by about 150MB when using HTML tokenization instead of HTML parsing. (with |
I decided to not work on this for the time being, unless it becomes a blocker for anything. Help welcome :). |
The post-processing for cross reference detection is necessary only for the man pages written in the old man(7) language, which is not semantic and references are usually written with the .BR or .IR macros. I think it should really be improved in mandoc itself, also as a way of working on #56. Until then, you can probably detect if the manual is written in man(7) or mdoc(7) and post-process only the first case 😉 |
That’s orthogonal to the issue in this ticket, I think: we do our own cross-referencing for internationalization. |
Could you elaborate on what else is necessary to post-process besides the example |
Have a look at debiman/internal/convert/convert.go Line 249 in 7d479b8
Post-processing consists of 3 steps:
Notably, ③ finds cross-references even if they include formatting directives (such as the italic tag in the example). Internationalization in this context means linking to the best language match for the target, as viewed from the source. For example, if the user is browsing manpages in Danish, but the target is only available in Norwegian and English, than we link to the Norwegian version. However, if the target is only available in, say, Italian and English, we’d link to the English version. mandoc doesn’t know which manpages are available in which language (at least in the way we’re invoking it), so doing language matching when cross-referencing is out of scope for mandoc, I think. |
|
Fair point, the stripping must be a remnant of when we didn’t use -Ofragment. We should remove it eventually (pull requests welcome!) |
Tokenizing shaves off about 1 minute on a 6 minute rendering of Debian unstable.
The code is not entirely straight-forward to port due to the HTML-tag-agnostic cross reference detection (e.g. for
<i>crontab</i>(5)
) which requires us to keep state after all.If we could improve mandoc’s cross reference detection and id generation, we could probably get away with textually processing the HTML, which has the potential to shave off another 30 seconds.
The text was updated successfully, but these errors were encountered: