Parsing strategy for tables?

xoofx · March 13, 2016, 12:18am

The breaking change that you are in favour is to change the behaviour of a whole established rule (code spans backsticks, HTML inline, labels in links…etc. don’t require a specific escape for |) just because one extension may require to parse the |, and you prefer to change this rule (which is a general end-user rule) rather than an implementation detail (inline parsing should be after block parsing, or inline parsing should not modify block structure, does the end-user really care about this?)

It is fine to look for making life easier for developers, but end-user experience and requirements usually come with a higher priority… So yes, we have more generally a different perspective on software development here

MathieuDuponchelle · March 19, 2016, 5:22pm

Well I’m of the opinion that the separation of block and inline parsing is a good design decision, which should not be discarded lightly. Note that I don’t think a single pipe in a given line should be enough to make it a table, this also seems to me as it would break end-user expectations.

Requiring either a line prefix, or a vertical rule (dashes on the second line) would seem judicious to me, and would make the cases you’re worried about very unlikely anyway.

jgm · March 19, 2016, 10:58pm

@xoofx I believe @MathieuDuponchelle’s proposal is not to change the behavior of backtick code spans generally, so that you always need to escape | characters in them, but just to require these escapes on table row lines. The backslash would be removed before the cell contents are passed to the inline parser.

I find both alternatives fairly bad, but if we’re going to have pipe tables we will have to make a choice. I don’t think the priority of block to inline parsing is just an implementation detail; it’s a conceptual thing that users sometimes need to know about in order to predict how something will be rendered.

For example,

- [this
- is](notalink)

So, breaking this priority may end up confusing users in just the same way as requiring a backslash before | in backtick code spans. They both go against user expectations. I wish I saw a good alternative.

xoofx · March 21, 2016, 9:24am

Hm, I completely agree with this and didn’t want to mean this but instead: inline parsing can modify block structure which was the problem of my original post. I would like to support existing Markdown flavour (most notably Github, pandoc, PHP Markdown Extra) for pipe tables. As @MathieuDuponchelle noted, they may diverge in some corner cases, but roughly, in their implementation, a pipe operator can be escaped by code backsticks and it will not count for a table pipe marker. So I had to implement this behaviour. I was not saying that this must be the spec for a future pipe table extension within CommonMark. The escape \| might be a reasonable constraint for the spec, but until it will become a well used standard, I will have to support existing behaviour (as badly as they are heterogeneous in their “specs”)

MathieuDuponchelle · March 25, 2016, 1:14pm

Hm, I completely agree with this and didn’t want to mean this but instead: inline parsing can modify block structure which was the problem of my original post.

You seem to have changed your mind from the post where you were giving me software engineering lessons and textually saying so then Anyway you’re playing with words here, if inline parsing can modify block structure, then block parsing doesn’t truly have the priority on inline parsing, and the boundaries are not clear-cut.

I would like to support existing Markdown flavour (most notably Github, pandoc, PHP Markdown Extra) for pipe tables.

Well good for you, our idea here is to define a specification though, not be compatible with N incompatible markdown formats.

So I had to implement this behaviour. I was not saying that this must be the spec for a future pipe table extension within CommonMark. The escape | might be a reasonable constraint for the spec, but until it will become a well used standard, I will have to support existing behaviour (as badly as they are heterogeneous in their “specs”)

Could have been more clear, as for your stated goal good luck then!

MathieuDuponchelle · March 25, 2016, 1:24pm

I find both alternatives fairly bad, but if we’re going to have pipe tables we will have to make a choice. I don’t think the priority of block to inline parsing is just an implementation detail; it’s a conceptual thing that users sometimes need to know about in order to predict how something will be rendered.

Yep, that was pretty much my thinking when you first stuck my nose into this issue in https://github.com/jgm/cmark/issues/100#issuecomment-189892502 (sorry for linking to this heavy thread but it’s relevant), and in my opinion if we’re going to have a consistent specification, then the block priority rule is to be enforced, I don’t think it’s bad as long as it’s enforced consistently, and made clear from the get go.

xoofx · March 25, 2016, 11:11pm

Sorry for that. I think that from the beginning, I made it pretty clear that I was focusing on implementation (I was always talking about parsing and not specs, even if they are related). On the other hand, admit that you dragged me silently into a discussion of specs without telling me your goals. I realized that a bit late and respond in a post above that I would rather prefer see CommonMark 1.0 out before digging too much into the details of a spec. But yeah, sorry, I came back into a discussion about specs that would lead to divergent point of view regarding this. Anyway, when we see that the debate is not that easy, It rings the bell that things will not be that easy to accord everyone when the devs that are supporting current pipe table behaviour with thousands of users will come to the (pipe)table of negotiation…

MathieuDuponchelle · March 25, 2016, 11:24pm

Sure, thanks for stating this, I have to admit your previous answer came across (to me) as a bit condescending, and I kept a mental note to get back to you in some petty way when time would come, that’s done now

As for your initial post, it was clear that you were focusing on implementation, true, but on top of commonmark’s syntax, which means that in any case there will be some level of incompatibility with existing markdown formats for piped tables, and I think any syntax extension on top of commonmark should disregard corner cases and known shortcomings of existing formats, and directly go for a satisfying syntax (even if it misses, it’s nice to try and see what works).

Regarding your actual work, I strongly suggest you to have a look at my proposal for syntax extension support in libcmark itself, as it would spare you the trouble of maintaining a separate C-sharp implementation for the sake of syntax extensibility. I’ve implemented inline extensions too, so if you really feel going down the dirty, mucky road of inline-parsing-time of tables, you should be able to do so pretty easily. Discussion is still at https://github.com/jgm/cmark/issues/100 , implementation at https://github.com/MathieuDuponchelle/cmark/commits/extensions_draft_3 , feedback is very appreciated

xoofx · March 26, 2016, 12:39am

Yeah, thanks, I like to go down dirty! I have already worked on extensions for the past few weeks and didn’t want to rely on any existing implementation (even not looking how things were done, as I did for the implem for commonmark to start from scratch). I have finished implementing around 13+ extensions as of now, mostly focusing on major ones but also including things like grid tables (which support parsing multiline blocks separated by columns), and going through all these extensions was good enough to challenge iteratively the design of a somewhat satisfying extension system (that includes block parser, inline parser and renderers extensibility). I will share it on github as well when I’m done with some other (intra)related projects.

MathieuDuponchelle · March 26, 2016, 12:56am

Why not work on this in the reference implementation instead of reinventing the wheel though? C should be portable enough not to go for vendor locked-in languages instead?

xoofx · March 26, 2016, 1:37am

Good question! First, my main language of development for the past 8 years has been C#. I worked already in C/C++ for years before that (and java, and several others), and I dislike a lot too many things in these languages to enjoy using them anymore, moreover when I can develop something much more efficiently in a different language, with an acceptable 20-40% performance degradation (at the same feature level). Also, C# has been evolving to some great directions in the past 2 years, as It is not really no longer vendor locked-in, all the runtime and compilers are MIT on github, and it is now running on Linux and MacOSX.

I had also a look at CommonMark.NET but as the implementation was a port of the C implem, it was not suitable to the kind of extensibility I was looking for. Also, I was not really happy about some design decisions in their implem and the work done in their new implem branch was not going into the kind of direction/simplicity I’m seeking. And well, I have been through many cases in practice in the past that it is often not possible to make “radical” changes to an existing project you don’t “own”. Sometimes, the wheels are so different that, e.g, square vs circle, and things will not move the same way!

I also think that it has many benefits for the CommonMark initiative that implems are not coming from the same implem seed

MathieuDuponchelle · March 26, 2016, 4:15am

I also think that it has many benefits for the CommonMark initiative that implems are not coming from the same implem seed

I actually think that is detrimental, the only reimplementation which I perceive as having any value is the javascript one, as browsers will not let you execute native code (at least yet), and I don’t know whether transpiling the C implementation to javascript would be a viable solution.

As for the other implementations, I simply see them as wasted energy and useless fragmentation when all they do is reimplement the same parsing strategy in a different language. libcmark’s upstream is willing to make these “radical” changes you’re interested in, it’s of course difficult to get things upstream because there’s a requirement of quality, but getting through that admittedly frustrating process will mean better code for everyone in the end.

chrisalley · March 26, 2016, 4:43am

Having numerous implementations may be less efficient, but having a CommonMark ecosystem (rather than a single implementation) will make the spec more robust in the long run. For example, as each implementer has to understand the spec before implementing it, any issues with the spec will be seen by more eyes.

MathieuDuponchelle · March 26, 2016, 4:49am

But conversely, any issues with the implementation(s) will be seen by less eyes.

chrisalley · March 26, 2016, 5:35am

This is true. In cases of inconsistencies in other implementations, I think that is a good reason to improve the spec tests (which all CommonMark implementations should pass).

xoofx · March 26, 2016, 5:35am

To give you a glimpse, If I wanted to implement this in C, It would be a major rewrite of the implem (and would be pretty similar to my C# implem), not an incremental evolution and I would do it very differently from how it is done today. Even in your branch inlines.c, the switch case is hardcoded in the old-school way in this method, even if the try_extensions is added as an escape, which is bttw far from being performance friendly, as you need to go through all extensions everytime a character is not handled by static cases to find a good candidate while all of this should be cached in advanced and be performed by a simple table lookup.

In my code, everything, including standard markdown parsing is done through plugins, I can disable code backsticks if I want, an extension could decide to change the opening character of for parsing heading text and replace # by @ for example…etc. The main parsing method for inlines is just a loop of 100 lines, and everything else is self contained in one feature per file (instead of having a big blob file like inlines.c). So yes, that justifies what I talked about “radical” changes.

Also, as @chrisalley noted about implems strengthening the specs, it is typically what happened while I implemented CommonMark from scratch: See for example issue #395

MathieuDuponchelle · March 26, 2016, 6:06am

Even in your branch inlines.c, the switch case is hardcoded in the old-school way in this method, even if the try_extensions is added as an escape, which is bttw far from being performance friendly, as you need to go through all extensions everytime a character is not handled by static cases to find a good candidate while all of this should be cached in advanced and be performed by a simple table lookup.

That’s a very trivial remark. The check is only done for special characters, which makes it mostly harmless, and making this use a table would take around 5 minutes if profiling showed it was indeed draining performance.

So yes, that justifies what I talked about “radical” changes.

Not really, that simply shows you did this in your parser, which you develop in a closed-source manner, making it difficult to compare it factually and benchmark against the current cmark implementation.

xoofx · March 26, 2016, 8:50am

I will release it as soon as the API is stable, naming of the project is “secured”, another related project is finished, and the proper website and documentation are done… It will be published on github and announced on this forum. It will come also along accompanied benchmark compare to other .NET solutions and to cmark for raw performance (for now, at least 20-40% slower, but not feature wise equivalent). So, stay tuned…

MathieuDuponchelle · April 2, 2016, 11:23pm

Cool Care to share the list of extensions you currently have implemented, and detail the issues you might have had if any?

xoofx · April 3, 2016, 1:53pm

So far, I have implemented the following extensions:

abbreviations
auto identifiers and auto link (similar to pandoc)
css bootstrap (to output some classes specific to bootstrap for some elements)
custom containers (mostly block ::: and inline ::)
definition lists
Emojy/Emoticons
Emphasis extra (strikethrough, subscript, superscript, inserted, marked)
Figures (as described here)
Footnotes
Special/Attached attributes (allows to attach css attributes to the current block or the previous inlines with the syntax {...}), works with all blocks and all inlines, including headings, fencedblocks
Softline as Hardline
Lettered list (a. b., A. B., i, ii. iv, …etc.)
Mathematics (block/inline escape $$...$$, and $...$ )
Medias (renders differently image with url that have a mime/type that is a video or music, or links to youtube/vimeo)
Smartypants
Pipe tables
Grid tables (slightly extended version of pandoc)

No real issues. Maybe attached attributes are a bit specials, as they need to have hooks for some other block parsers (like fenced code blocks, to parse the attributes before parsing the info string, or for headings before removing the optional trailings ###).

In terms of performance, the side effect of adding proper extensibility for all these extensions (including the ability to remove/modify standard CommonMark parsing) has slow down the whole by around 20% (in C#), which is quite acceptable.