Feature request: automatically generated ids for headers

matmuchrapna · June 29, 2015, 1:06pm

why this feature cannot be delegated to extensions?

After some time everybody will be able to choose implementation they prefer

jgm · June 29, 2015, 5:02pm

My main worry about automatically generated header IDs is that, in order to ensure uniqueness, you have to add subscripts or use some other mode of disambiguation. And then the problem is that the ID of a particular element might change due to changes elsewhere in the document, which can lead to broken links. @Crissov’s proposal is quite nice, and would substantially reduce the need for such disambiguation, but not eliminate it entirely. (Maybe it would eliminate it enough?)

For internal links it’s nice if the Markdown renderer creates both the target IDs and the links, as is done in pandoc:

## My header

See [My header], or [above][My header].

But of course there’s still a problem about disambiguation, and links from outside the document still need to know the generated ID.

hulkur · June 29, 2015, 8:34pm

Being new to markdown I don’t know all the right syntax but I want to mention my use-case for referencing headers.

Previously I used my own markup+renderer and there I had auto generated IDs for headers and {TOC} tag which generated table of content for these headers. This way I could change headers as needed and not worry about changing them in ToC.

Ofcourse it had above mentioned problems - autogenerated IDs not constant in time and not predictable, so not linkable from outside. For that I added also explicit IDs.

Looking at spec I think most understandable would be to use reference syntax and explicit IDs (implicit generated IDs can be added). Usage would not be limited to ToC but can be use also internally in text.

I think something like following would work:

Table of Content
1. [header_ref]
1.1 [section_ref]
1.2 [other_ref]

[header_ref]: # Top Header {#explicitID}
Some text with mention of [section_ref] to link internally.
Maybe this could also have [section_ref](some other link text) for internal linking

[section_ref]: ## Section Title
Some more text

There is still problem of reordering headers but this is easier to do with only references to worry about (references don’t change that often)

ToC could be fully auto-generated (post-processing) if requested (like my {TOC} tag) but that is whole other issue.

On topic of conflicting autogenerated IDs: add a prefix like md- or mdref-
You can’t account for all cases but you can make an educated guess on “will work in most cases”

asbjornu · January 28, 2016, 8:39am

Sorry for bumping this old thread, but it doesn’t seem to have reached consensus and I think Markdown/Commonmark really needs this, so I want to give @zwol a .

The author may not wish to allow headings to be linked to. For example, the headings may be subject to change in the future (we discussed reordering headings above), so a link to the overall document may be preferred.

A link to a non-existing header will become a link to the overall document, so HTML takes care of that use case right out of the box. I thus find this to be a weak argument against adding an id attribute to all headers.

I also want to add that I agree with @an3ss in using the text content of the header as the reference, leaving to the implementation how the ID’s are generated. Mandating how the resulting HTML needs to look is in my opinion not required; just that it should be possible for Commonmark itself to make self-references to headers defined within the same document.

This of course requires an id attribute to be added to all headers, but it does not need to have the same algorithm for how these id attributes are generated across implementations. I thus think this is a pretty simple and small addition to the Commonmark language.

chrisalley · January 29, 2016, 9:02am

If the headings are reordered, an old link to the heading would now point to the wrong heading, which could mislead the reader. For this reason, the author may not wish to allow direct links to the heading if it is subject to change later.

If the algorithms are different, and a CommonMark parser is swapped out with another CommonMark parser, the IDs may no longer be the same. This strikes me as problematic; it’s preferable that links to headings continue to work across implementations.

Crissov · January 31, 2016, 2:42pm

That’s only true if ordinal IDs were being used (e.g. for multiple headings with the same textual content). That’s a border case and alternatives which avoid the problem have been demonstrated.

Internal links would still work, because IDs are not used directly by authors.

If this was indeed considered a problem to be solved by the spec, the solution would be explicit ID overrides. I’m not 100% sure how this should look. I present here are two variants:

# Variant 1
## Implicit ID

Paragraph with links to [implicit ID] and [explicit ID][#ExplicitID]. 

  [Implicit ID]: #ExplicitID

(Both links are the same if the parser supports explicit IDs, 
 otherwise both links will fail in output, 
 because there is no `ID` attribute value ‘ExplicitID’.)

# Variant 2
## Implicit ID

Paragraph with links to [implicit ID] and [explicit ID][#ExplicitID]. 

  [#Implicit ID]: #ExplicitID

(Only the second link works – perhaps – if the parser supports explicit IDs, 
 otherwise only the first link works, 
 unless the output format supports multiple IDs per element.)

  [Implicit ID]: http://example.com/overwritten

(Both links now work if the parser supports explicit IDs, 
 but they have different targets, 
 otherwise only the first link, to an external site, works.)

chrisalley · February 1, 2016, 8:54am

My point was in response to @asbjornu’s comment stating that the algorithms for generating the IDs could be different. If indeed the algorithms are different then the generated links (from those IDs) could be different.

Regarding the reordering of headings, I’m in agreement that the IDs could be generated in a way that ensure uniqueness in most, but not all, cases. Whether this can be done in way so that the URLs are aesthetically pleasing, I’m not sure. If the URL is considered part of a website’s design, then flat ordinal IDs might be preferable (to the designer/author) over longer IDs which concatenate heading strings together but less prone to duplication (as you suggested earlier, @Crissov). Aesthetic considerations of the generated IDs shouldn’t be overlooked here, because designers may wish to make their URLs beautiful and easy to read and write.

tmpfs · February 2, 2016, 11:56am

I agree with this, however maybe implicit IDs is the first thing to support.

Conflicting IDs between Markdown rendering and a containing HTML document should be resolved by a preprocessor (or some post-processing or validation) so I don’t think that’s too much of an issue if IDs were automatically generated.

If IDs are automatically generated then it needs to be clearly specified, but this sounds like a feature extension and post 1.0.

However, if they are implicit then I think something like:

#headingid Heading Title
Heading Title
=============heading-id

Is the cleanest and is tune with the info string on fenced code blocks, however you would not be able to use spaces in the ID which I would (and most people I think) consider a bad practice.

I believe I saw a discussion about requiring a single space after the # in ATX headings, that would also need to be ratified for the above to be possible.

Jeremy_Morton · February 2, 2016, 1:59pm

My 2 cents: I pretty much agree with hulkur here.

It would be a nice feature to have anchor links in Markdown, but given the problems of auto-generating them based on heading text (heading text changing, identical headings being re-ordered) it should probably require an explicit ID. The explicit ID could either be applied to the header or even just put anywhere in the document in order to generate an empty a tag with that ID, to link to that point in the document. Normalize the ID by lowercasing it, changing spaces to dashes, and removing all other non-dash punctuation.

I would also prefix the IDs in an attempt at giving them a unique namespace (obviously one can never guarantee this unless one has access to the entire HTML document but it’s a reasonable precaution) - perhaps markdown-anchor-? For example:

[# Step 2 - config]{Step 2!}
Configure the software by doing stuff...
More text...
Here is somewhere inline you can []{inline-link}link to.

… generates:

<h1 id="markdown-anchor-step-2">Step 2 - config</h1>
<p>Configure the software by doing stuff...
More text...
Here is somewhere inline you can <a id="markdown-anchor-inline-link"></a>link to.</p>

If there are any duplicate anchor IDs, the parser should warn the user (or maybe refuse to generate the Markdown until the duplicates are removed).

Any links in the document could then be normalized so that they linked to the generated fragment IDs:

[Link to heading](#STEP-2)

[Link to inline](#inline link)

… generates:

<p><a href="#markdown-anchor-step-2">Link to heading</a></p>
<p><a href="#markdown-anchor-inline-link">Link to inline</a></p>

JavaScript could optionally be used to allow linking to fragments without the markdown-anchor- prefix, much as Github does.

Crissov · February 2, 2016, 11:15pm

I prefer overwriting implicit IDs with extended reference link syntax, i.e. an indirect approach. If there needed to be direct explicit IDs (and classes) anyhow, I always thought the obvious way was like this (remember line suffixes?):

# Heading Title # .class #headingid "title"

Heading Title
============= #heading-id .class @for

jgm · February 2, 2016, 11:30pm

I agree that different sites may have different needs for automatically generated header IDs. So I’d hate to specify this in the spec. But it might be worth adding implicit links to headers, leaving the assignment of IDs up to the implementation. They might work like this: [My first section] links to a section with contents My first section, unless the reference label [My first section] is explicitly defined in the document. If there are multiple sections with contents My first section, it links to the first one.

This would be fairly simple to implement and would solve most everyday section linking needs. For more control, we could consider a way to specify IDs explicitly. In pandoc and several other implementations, you do it this way:

# Heading {#myid}

[EDIT: of course, leaving the precise IDs undefined complicates testing.]

tmpfs · February 3, 2016, 1:30am

I prefer this for legibilty and alignment. I think we should address classes and other attributes separately but your suggestion does allow for them neatly.

zzzzBov · February 3, 2016, 1:35am

That’s exactly why I recommended we drop this feature, but it didn’t seem to get noticed much and everyone kept on discussing it as though it’s a good idea to violate SRP for a one-size fits nobody approach to adding some [id] attributes.

tmpfs · February 3, 2016, 2:07am

Then it makes sense to me to do it that way. I do however like the ability to specify IDs; if the {#myid} syntax is already familiar to people then I suggest we follow than convention too (I haven’t used it yet).

I agree that different sites may have different needs for automatically generated header IDs

For me, that rules out automatic IDs as part of Commonmark. I suggest any automatic ID generation is left to a processor.

leaving the assignment of IDs up to the implementation.

I worry that implementations would treat this so differently that it could cause headaches for users wishing to switch implementations.

After more consideration, I prefer that Commonmark not automatically generate IDs but implicit IDs are implemented likely following the existing convention {#myid} although I do wonder if those curly braces can be dropped

Crissov · February 3, 2016, 8:20am

The syntax with curly braces may have one advantage in that it can be applied to all kinds of blocks and even inline markup. Like all explicit metadata it makes the input source code, which, in the case of Markdown/Commonmark, is always also the simplest form of output, less readable. IOW, it’s against the spirit.

The “prefixed line suffixes” variant I introduced could work well for headings (both ATX and Setext) and fenced blocks as well as thematic breaks, but needs further work for other types:

With optional line suffix / terminator

1. Enumerated list item . #ID
* Bullet list item * .class
> Quotation < "title"

Without optional line suffix / terminator

1. Enumerated list item #ID
* Bullet list item .class
> Quotation "title"

I’d expect #ID and .class to work surprisingly well, except that the former may clash with hashtags, but the title syntax derived from links is just asking for trouble. You could still add optional curly braces for ambiguous cases, of course.

chrisalley · February 3, 2016, 8:40am

The use of square brackets is quite common. I can see the [My first section] syntax creating links in Markdown documents where a link was not intended by the author. Unless you meant the reference in reference style links only? e.g. [Click/tap this text to visit my first section][My first section]. I can’t think of any strong objections to this latter syntax. But…

It complicates testing and there’s the problem of the IDs being different between implementations. As an opt-in extension, having a consistent algorithim for generating the IDs would be useful as this would allow a subset of CommonMark documents to all count on external links to headings not breaking if the CommonMark implementation is swapped out with another implementation. If such an extension existed then you define a link such as…

[Click/tap this text to visit my first section](#my-first-section)

…and the heading…

## My First Section

…and count on it working across all CommonMark implementations which opt-in to the Implict IDs extension. It’s not going to solve all of the issues raised in this topic (reordering headings, etc), but for a sizeable number of documents (wiki articles, forum posts, etc) it would probably be reliable enough for their use cases. For documents that require more certainty in how the IDs are generated we could have a seperate explicit IDs extension and leave it up to the application developer to choose which extension (implicit IDs, explicit IDs, or both) to use.

Crissov · February 3, 2016, 2:30pm

No, authors must never be required to use (and determine first) the actual ID. Some may choose to do that, though.

With default settings, Pandoc is the only current implementation in available Babelmark that seems to get it right.

zwol · February 3, 2016, 3:44pm

For the nth time, this is a request for mandatory generation of ids for all headers, as a core component of the specification. An extension is not good enough. Optional-to-implement is not good enough. Only-if-the-author-does-something is not good enough. It must be in the core, it must be mandatory to implement, and it must apply to all headers. Only that will move us toward a world where all HTML documents always have IDs on all of their headers.

All the picayune stuff this request keeps getting sidetracked on - what the IDs actually are, how the author can control them, whether the author should be able to opt-out some headers (no), etc - is not as important as the principle.

Mandatory generation of IDs.
For all headers.
In the core specification.

asbjornu · February 3, 2016, 4:57pm

Not if the identity of id includes the position of the header in the outline. Not that I think such should be mandated by the specification, but it can be explained in an informative section on “best practice”.

I agree that it’s preferable, but I don’t see it as an absolute requirement. It’s a nice-to-have and something that can be achieved given a best practice algorithm, but since we don’t yet know what that algorithm will look like, I think the feature can be added to the Commonmark syntax and eventually that algorithm will surface. When it does, it can be added as a reference to the core language specification.

I don’t see what problem an explicit id is solving here. While I do agree it should be possible to have explicit ids, they will suffer from the exact same synchronicity problem as implicit id’s. The link referencing the anchor will have to reference something. That something can change, whether it is an explicit id or the text of a header.

Jeremy_Morton · February 3, 2016, 9:50pm

I don’t see what problem an explicit id is solving here. While I do agree it should be possible to have explicit ids, they will suffer from the exact same synchronicity problem as implicit id’s. The link referencing the anchor will have to reference something. That something can change, whether it is an explicit id or the text of a header.

It is solving the problem that the auto-generated ID will change if you change the heading text or the ordering of 2 identically-named headings. Your anchor ID will stay the same unless you explicitly change it, meaning that your links to it will not get broken.