Metadata in documents

It would be good to have a list of YAML capabilities to support natively. I can see binary data being useful in certain context.

Well, if you can get a list of core features that needs to be kept, then that can be a start (And maybe we can forward that to either YAML or diet-YAML team.

A good yardstick could perhaps be if it can convert from/to json/json5 at the least (if I remember json is not very large of a specification).


Alternative proposed name for this stripped down YAML:

cYAML - Core YAML - C YAML : Small fast minimally specced YAML with fast C reference parser. (I think at minimum, you should be able to convert between JSON and YAML)

1 Like

I want to make the point that there are two ways in which we might
think of a YAML metadata section fitting in to documents.

A. One is to consider think of YAML metadata blocks as part of CommonMark,
adding it to the spec, etc. With this method, metadata would be part of
the AST produced by a CommonMark parser. (That’s how it is in pandoc.)

B. Another is to think of a document like:

---
title: My title
author: Me
---

Starts here...

as really the combination of two separate documents – one, a YAML
document, which ends at the second ---, the other, a CommonMark
document.

A processor would then divide the two documents, parse the YAML document
with a proper YAML parser (or, if it likes, just skip it), and parse
the rest with a CommonMark parser.

The processor could use the values gleaned from the YAML document in
any way it likes – interpolating them into templates, for example,
and even perhaps running their string contents through a CommonMark
parser. The CommonMark spec wouldn’t need to know about this.
(This is how it is done in Jekyll, for example.)

1 Like

@jgm, this is already the second time there seems to be something missing from your message:

First:

Second:

Does that email-envelope (in the top right corner) mean that those replies come somehow through email, which might eat part of your message…

I have edited my post above. I did use email, and I note that I only indented the code block 3 spaces. For some reason, this caused Discourse to omit everything following. (I also see that if you click on the envelope icon, you can see the unedited contents of the original email.)

I guess your distinction boils down to whether CommonMark is “just” a markup language or also a file format. For example, HTML is clearly the latter, as it also specifies doctypes (<!DOCTYPE html>), and a head element containing metadata.

P.S.

Seems only the poster can do that, at least it doesn’t work for me.

Below from mailing list from Ingy dot Net ingy@ingy.net


Thanks for bringing this up to the yaml-core mailing list. I’m not sure
even where to start. I’ll throw out some random points that come to mind:

  • YAML was designed to be a full, cross-language, data serialization
    language
    • It is just a current state of affairs that people use it mostly for
      trivial purposes like config files
    • There are minimal (not yaml.org approved) YAML implementations,
      that only exist in a particular language like Perl’s YAML Tiny
      https://metacpan.org/release/YAML-Tiny
  • I started the YAML2 discussions https://github.com/yaml/YAML2/wiki 3
    years ago to make YAML less complex without losing its powers
  • I’m working on a Pegex based YAML implementation that will generate
    parsers in all YAML languages from a single grammar
  • There are only 3 major differences between YAML and JSON (at the data
    model level):
    1. References
    2. Tags/types
    3. Non-string mapping keys
  • YAML implementations can be complete, full-stack, or minimal
    text=E2=86=92native

I think that the YAML spec documents cause implementor confusion because it
is unclear what needs to be implemented. These are my opinions on what
should be properly conveyed:

  • The YAML 1.2 syntax as specced is correct. (Though a 2.0 could make
    it simpler)
  • The default schema should only support JSON types: Str, Num, Bool,
    Null, Map, Seq
    • ie no Date, Set, OMap or any other should be made available by
      default
  • Only true/false/null (from JSON) should be implicitly recognized. Not
    the Yes/No/True/False/=E2=80=A6 options.
  • Merge key is something that should only be available as a plugin. This
    was just a idea we threw out, and for some legacy reasons some of the
    implementors implemented it and some did not.

It seems we need a YAML implementors guide. I’m thinking that what you are
seeking could be part of this. I would encourage people not to fork YAML to
a simpler form, but to simply make weaker/simpler implementations according
to an agreed upon guide. Here are some basic thoughts on how this might
look:

  • Format is called YAML
    • .yaml and .yml extensions are used
  • Implementations can be called SimpleYaml or somesuch
  • Basic Loader restrictions:
    • Explicit Tags throw error on parse
    • Flow forms throw error on parse (except empty [] {} which have no
      block form)
    • JSON schema as above
    • Anchor/Alias throw error on parse
    • Non-string (plain/quoted) keys throw errors
    • No stack. Loader =3D=3D Parser=E2=86=92Constructor
  • Dumper restrictions:
    • Dumpers must produce streams loadable by Loader above
    • Streams must be loadable by any more complex loader

In conclusion, there are ways to make YAML simpler on many levels without
forking it. I personally am interested in discussing them.

Consider joining #yaml on irc.perl.org to discuss further.

Cheers, Ingy

I strongly oppose this feature and hope that we can see through the flaws in this idea enough to drop it. As a standard, markdown should stay simple and focused on ease of implementation and universal compatibility. By introducing a data format - of any kind and any level of complexity - we will be introducing a feature that complicates this medium and cripples how other libraries can work with it.

Markdown can stand on its own, but metadata cannot. It must have an ultimate purpose, like being passed to a template engine as context to be used templates, being passed to renderers/parsers as an options object, whatever is required for the use case. Given that, we need to allow implementors to use whatever solution makes the most sense for parsing metadata and use the markdown parser they want for parsing markdown.

By implementing metadata, markdown will now have “compatibility issues” As it stands, markdown has a clear purpose in life, which makes it easy to see how it fits into any application. This will not be the case if data enters the picture. The problem is that, regardless of best intentions, this feature will never be able to satisfy the needs of every user, parser, renderer, template engines, or implementor who might need such data. This means that other solutions will still need to be implemented for parsing data, which not only complicates decisions and implementation strategies, but it will virtually guarantees confusion with users who want to use both this solution and the implementor’s solution, or some combination of those things.

Markdown is not a data format, but it will be if this is implemented. We’ll need to decide which data format is correct, how much is “just enough”, who the consumers will be, etc. and this slippery slope will ultimately lead to religious battles over how much data is too much and: 1) why “my favorite data format isn’t supported”, 2) “can I use this data along with my jekyll front matter, or instead of it? because then I can’t use all of jekyll’s features”, etc. etc.

Data formats are use-case specific, and should not be related to “file type”: e.g. there are many document data and front matter parsers for many use cases, and none of them have any specific relationship to markdown. Why are we trying to create one? In other words, since front matter parsers will parse front matter from any file type (e.g. markdown, handlebars templates, HTML documents, whatever), if this feature is implemented, how should users format their data when both templates and markdown files are used? Should they ask the front matter parsing library to adopt the format you decide on here for handlebars templates (not going to happen)?

Parsing front matter is trivial. One can write a front-matter parser to extract data from a document in ~20 sloc, the result of which provides them with a nice, clean string of pure markdown, and an object of data that was create from whatever language the implementer preferred to use. By implementing this feature in markdown, you will greatly complicate this task by necessitating strategies for data conflict resolution and so on.

6 Likes

You raise some excellent points @jonschlinkert.

Is there a reason why meta data needs to be placed in the same file? A separate yaml file that points to the Markdown file could solve the Jekyll use case at least. Separation of concerns.

Yes. That’s convenient sometime. For example, in blog posts (title).

But i’m not sure such posts should be parsed directly by markdown parser, without preprocessor.

2 Likes

I think it’s much safer if it’s in a block generic directive, since displaying metadata between platforms (html, paper, etc…) is highly variable, and data (and thus metadata) in general are much more fragile than normally human typed text.

Maybe metadata should be thought of as “recommended” best practices when used in the context as settings for various generic directives (basically restricted YAML syntax). As for metadata used for embedding metadata within a document (rather than risking it getting lost by placing it as a separate file), it is best used with it’s own generic directive.

For those that need it maybe we can support these generic directives:

!metadata :~ Included in all parsers, but a stripped down YAML, to cover most obvious use cases. This should remain small and mostly unchanged thought the life of commonmark.

!YAML :~ optional extension (included in fatter parsers like pandoc) of full YAML

!json :~ optional extension of full json

So basically, keep default metadata syntax as small as possible, and make full metadata support optional. Hopefully addressing jonschlinkert concerns that this would greatly make commonmark too complex and unwieldy.

1 Like

I think that CommonMark spec should contain at least some very simple and minimal metadata format so that applications that rely on a CommonMark parser would be able to use at least a trivial key-value pairs out-of-the-box - for example ^([0-9a-z]+):([^\n]*)$ - if the app needs, it can just store one value with base64 encoded data or JSON object or anything else - for everyone else it will be just a string.

Since metadata is in fact application-specific, an application that requires something very complex can be expected to implement that on its own but it would help other developers if the minimum is already in place.

Another nice thing the spec could do is to list common metadata keys, such as Author, Title, Description etc. - so that content management systems have a reference from where to read such attributes.

2 Likes

I think I agree with all of @jonschlinkert’s comments here.

But since it will presumably be common to have “hybrid” documents with some metadata at the top, followed by CommonMark text, I wonder if it would make sense to have the spec define a recognizer for front-matter metadata, so that all CommonMark parsers would know to skip this and go right to to the text.

For example: if the document starts with a line containing ---, skip to the next line containing just --- or ..., and start parsing CommonMark after that.

This would fall far short of specifying a metadata format. Between the opening and closing metadata signs, you could have anything you like – so, you could use YAML, or JSON, or XML, or lua tables, or a custom key-value store. Parsing this would be application-specific, but conforming CommonMark parsers would know to skip it.

The advantage is that, with this feature, you could run your hybrid metadata/CommonMark file through any CommonMark parser and get good results, not the garbage that would result if the metadata were parsed as CommonMark.

5 Likes

There are several topics here reqesting an official “Do Not Even Attempt to Parse This Section” delimiter or block element.

That seems like the safer, saner choice.

2 Likes

Jekyll is the most popular program for static site generation according to https://www.staticgen.com/

As widely known here, it uses --- to --- for encasing YAML entries.

Any other programs that adopts the same format? The concept of treating --- to --- or ... as a ‘do not parse’ section for commonmark makes sense, but it pretty much a ‘do not parse’ command that could only be safely included in the top due to potential clash with --- horizontal rule (unless I am mistaken). Should there be a more general delimiter or fencing character for “do not parse” command?

Either way, Jekyll style “do not parse” section is a good approach for dealing with metadata in an impartial manner.


extra: could perhaps avoid the ‘clash with horizontal rule’ by simply disallowing for empty newlines between ---

+++ mofosyne [Nov 22 14 11:04 ]:

Any other programs that adopts the same format?

Among those known to me: Pandoc, Hakyll, Gitit.

Some other implementations were mentioned by @lu_zero in the document titles topic.

Also, Middleman.

Okay, well it seems pretty clear then. “Do not parse or show” :

http://talk.commonmark.org/t/jekyll-style-do-not-show-sections/918

This is not quite a “Do not parse, but show” island. But rather this is “Do not parse or show” island. We need a different syntax for those who wants a “no parsing but show in html”.

The “Do not parse or show” will be useful for comments, application specific metadata, etc…

It is recommended to have a implementers guide for a general concensous for optional metadata interpretion, so that most simple documents can have readable metadata (e.g. stripped down YAML). But the core of common mark will not include metadata interpretation (could have a hook thought).

If you want to render the data as HTML, why not just use a description list?

That’s fine. What I meant, is for those who just want text to directly go straight to html/doc/etc… without any parsing (but not hidden for external parsing). (e.g. a no markdown island). Not visual metadata.

No-Markdown islands are altogether different kind of data, closer to code blocks than the type of meta data used by Jekyll. For reference, there’s already a topic about no-Markdown islands. I agree that (since these two features are quite different) they should have different syntax.