optimization: tokenize HTML or process textually entirely #15

stapelberg · 2017-01-15T14:55:33Z

Tokenizing shaves off about 1 minute on a 6 minute rendering of Debian unstable.

The code is not entirely straight-forward to port due to the HTML-tag-agnostic cross reference detection (e.g. for crontab(5)) which requires us to keep state after all.

If we could improve mandoc’s cross reference detection and id generation, we could probably get away with textually processing the HTML, which has the potential to shave off another 30 seconds.

The text was updated successfully, but these errors were encountered:

stapelberg · 2017-01-15T14:57:28Z

Another interesting measurement: peak memory usage during conversion is reduced by about 150MB when using HTML tokenization instead of HTML parsing. (with -concurrency_render=20)

stapelberg · 2017-01-28T20:39:57Z

I decided to not work on this for the time being, unless it becomes a blocker for anything. Help welcome :).

lahwaacz · 2017-08-27T13:23:41Z

The post-processing for cross reference detection is necessary only for the man pages written in the old man(7) language, which is not semantic and references are usually written with the .BR or .IR macros. I think it should really be improved in mandoc itself, also as a way of working on #56.

Until then, you can probably detect if the manual is written in man(7) or mdoc(7) and post-process only the first case 😉

stapelberg · 2017-08-27T13:49:41Z

That’s orthogonal to the issue in this ticket, I think: we do our own cross-referencing for internationalization.

lahwaacz · 2017-08-27T14:04:08Z

Could you elaborate on what else is necessary to post-process besides the example crontab(5) above? And where does the internationalization part come in?

stapelberg · 2017-08-27T14:26:04Z

Have a look at

debiman/internal/convert/convert.go

Line 249 in 7d479b8

    
           func postprocess(resolve func(ref string) string, n *html.Node, toc *[]string) error {

Post-processing consists of 3 steps:

We strip <html>, <head> and <body> tags because we’re inserting the resulting HTML into an existing document.
We set IDs for each heading. I know that mandoc ≥ 1.14.2 does this as well, but unfortunately with a slightly different algorithm than we use, so we need to keep ours in order to not break existing links.
We find cross-references and URLs and turn them into links.

Notably, ③ finds cross-references even if they include formatting directives (such as the italic tag in the example).

Internationalization in this context means linking to the best language match for the target, as viewed from the source. For example, if the user is browsing manpages in Danish, but the target is only available in Norwegian and English, than we link to the Norwegian version. However, if the target is only available in, say, Italian and English, we’d link to the English version.

mandoc doesn’t know which manpages are available in which language (at least in the way we’re invoking it), so doing language matching when cross-referencing is out of scope for mandoc, I think.

lahwaacz · 2017-08-27T14:59:17Z

You're running mandoc with -Ofragment, so stripping <html>, <head> and <body> again should be useless:

debiman/internal/convert/mandoc.go

Line 112 in 3715b1e

cmd := exec.Command("mandoc", "-Ofragment", "-Thtml")
I was actually running an older version, thanks for pointing this out!
I admit that post-processing is indeed necessary to get cross-language links on a static site, thanks for the description. Though if mandoc was improved to handle crontab(5) etc., then you could pass -O man=/something/definitely/unique/%N.%S.html to mandoc and do just a (probably much simpler) replacement on the <a href="..."> tags.

stapelberg · 2017-08-27T19:19:11Z

Fair point, the stripping must be a remnant of when we didn’t use -Ofragment. We should remove it eventually (pull requests welcome!)

stapelberg added the enhancement label Jan 15, 2017

stapelberg added the help wanted label Jan 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization: tokenize HTML or process textually entirely #15

optimization: tokenize HTML or process textually entirely #15

stapelberg commented Jan 15, 2017

stapelberg commented Jan 15, 2017 •

edited

Loading

stapelberg commented Jan 28, 2017

lahwaacz commented Aug 27, 2017

stapelberg commented Aug 27, 2017

lahwaacz commented Aug 27, 2017 •

edited

Loading

stapelberg commented Aug 27, 2017

lahwaacz commented Aug 27, 2017 •

edited

Loading

stapelberg commented Aug 27, 2017

optimization: tokenize HTML or process textually entirely #15

optimization: tokenize HTML or process textually entirely #15

Comments

stapelberg commented Jan 15, 2017

stapelberg commented Jan 15, 2017 • edited Loading

stapelberg commented Jan 28, 2017

lahwaacz commented Aug 27, 2017

stapelberg commented Aug 27, 2017

lahwaacz commented Aug 27, 2017 • edited Loading

stapelberg commented Aug 27, 2017

lahwaacz commented Aug 27, 2017 • edited Loading

stapelberg commented Aug 27, 2017

stapelberg commented Jan 15, 2017 •

edited

Loading

lahwaacz commented Aug 27, 2017 •

edited

Loading

lahwaacz commented Aug 27, 2017 •

edited

Loading