Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less restrictive bare keys #337

Closed
mwanji opened this issue Jun 30, 2015 · 17 comments
Closed

Less restrictive bare keys #337

mwanji opened this issue Jun 30, 2015 · 17 comments

Comments

@mwanji
Copy link
Contributor

mwanji commented Jun 30, 2015

Bare keys are currently restricted to A-Za-z0-9_- but I don't get the rationale. The only character that really needs to be escaped is .. Is anything else, including spaces, a problem for parsers or easy comprehension?

@BurntSushi
Copy link
Member

This is the PR that merged it with the rationale: #283

@mwanji
Copy link
Contributor Author

mwanji commented Jul 12, 2015

The justifications seem to be: "easy to understand", "guides users to choose simple key names" and "eliminate any weirdness that could come from having to deal with undelimited Unicode". I may be underestimating the difficulty of dealing with undelimited Unicode, but I disagree somewhat.

However, in languages other than English, not being able to use accented characters might make things more difficult and less clear.

A technical problem that arises is that quoted keys make round-tripping from a class to TOML and back difficult in some cases. For example: a class with äbc = 5 becomes "äbc" = 5 in TOML. So translating it back to code requires some perhaps surprising heuristics.

@BurntSushi
Copy link
Member

So translating it back to code requires some perhaps surprising heuristics.

Can you elaborate? It seems like you'd have to scan the key name to determine whether it needs quotes or not.

I may be underestimating the difficulty of dealing with undelimited Unicode, but I disagree somewhat.

Yes. Reasonable people can definitely disagree on this point. I tend to like keeping unquoted identifiers simple because it makes it easier for the human writing the config to reason about when quotes are needed.

@mwanji
Copy link
Contributor Author

mwanji commented Jul 13, 2015

Can you elaborate? It seems like you'd have to scan the key name to determine whether it needs quotes or not.

Yes, but is that what the user expects? Different libraries handle this differently. Compare JS libs toml-node and toml-j0.4:

# TOML input, referred to as input in JS
"ä" = 5
toml_node.parse(input) // => { "ä": 5 }
toml_j04.parse(input) // => { ä: 5}

If I then use tomlify-j04 to convert them back to TOML:

# from toml-node output
"\"ä\"" = 5.0

# from toml-j0.4 output
"ä" = 5.0

The restricted expressiveness of bare keys relative to programming language variable names leads to unhelpful disagreements between libraries. Mine, toml4j does the same as toml-node, while toml-rb (from what I can make out) follows toml-j0.4. This could perhaps be resolved in the spec or in toml-test, but I think lifting the restrictions on bare keys would reduce the scope of the ambiguity.

Also, this restriction discriminates a bit against languages other than english. For example, French, Greek or Chinese users have to quote all their keys, or write them in english. That isn't necessarily simpler or easier to understand, from their point of view.

@BurntSushi
Copy link
Member

It looks like toml_node gets it wrong or doesn't know about quoted keys. (Quoted identifiers are a relatively recent addition.) In other words, this isn't a disagreement between libraries---it's a compliancy issue with the spec itself.

but I think lifting the restrictions on bare keys would reduce the scope of the ambiguity.

What exactly is the ambiguity? Can you point it out in the spec?

Also, this restriction discriminates a bit against languages other than english. For example, French, Greek or Chinese users have to quote all their keys, or write them in english. That isn't necessarily simpler or easier to understand, from their point of view.

It isn't necessarily more complex either, but I could see how some might consider this a negative of restricted identifiers.

@mwanji
Copy link
Contributor Author

mwanji commented Jul 13, 2015

It looks like toml_node gets it wrong or doesn't know about quoted keys.

Are you saying that parsers should ignore the quotes when creating a data structure from a TOML input? Eg. "ä" = 5 should produce { ä: 5 } ? My thinking was that the keys used to manipulate the data structure in code should be the same as the ones in the TOML input, quotes and all.

@BurntSushi
Copy link
Member

Are you saying that parsers should ignore the quotes when creating a data structure from a TOML input? Eg. "ä" = 5 should produce { ä: 5 } ?

Uh, ya. I never even considered your alternative interpretation! That seems like something could be clarified in the spec.

@ghost
Copy link

ghost commented Jul 13, 2015

For BinaryMuse/toml-node, there is an issue about quoted keys months ago: BinaryMuse/toml-node#21

It seems that no much people really care about it, so I made my own library, jakwings/toml-j0.4, and learned some PEG parsing techniques for fun. Thanks for using it. :)

Also, this restriction discriminates a bit against languages other than english. For example, French, Greek or Chinese users have to quote all their keys, or write them in english. That isn't necessarily simpler or easier to understand, from their point of view.

I'm Chinese. Even that equal-sign =, brackets [] and periods ., are not always easy to type while I am typing Chinese characters, that depends on my input method. (Furthermore, I am using a modified version of TOML for my own simplicity.)

But for the latin-originated languages and keyboards, typing these characters are not that hard?

@mwanji
Copy link
Contributor Author

mwanji commented Jul 13, 2015

But for the latin-originated language and keyboards, typing these characters are not that hard?

It depends on which language your keyboard is in. Some are easier than others, but in general they're not more than a 2-key combo away. How do Chinese programmers type in these symbols, considering that they are very common across all programming languages?

@ghost
Copy link

ghost commented Jul 13, 2015

@mwanji Oh, this is an embarrassing problem. ;-) Most of us just use ascii characters, except for comments and string contents. And nearly all input methods provide an ascii mode, or we can just switch off the IME.

@BinaryMuse
Copy link

Aside: I apologize for the delay on BinaryMuse/toml-node#21; quoted keys didn't work for the longest time, this was just a bug in the parser. Should be good to go in the latest version.

@ChristianSi
Copy link
Contributor

I reluctantly agreed when the restriction on bare keys was introduced, but I was never happy with it. The problem is that it introduces a strong bias in favor of English-only vocabularies which TOML didn't have before.

Considering as a totally arbitrary example that my config file includes the following keys:

author
translator
street-address
city
postcode

That works fine, but assuming my app is targeted at German users and therefore uses German keys:

Autor
Übersetzer
Straße
Ort
Postleitzahl

Now I have to tell my users that they need quotes around "Übersetzer" and "Straße" while they can use the other keys unquoted. That would be annoying and confusing.

I can also tell them to use quotes around all keys. That works and is less confusing, but also makes TOML a bit less convenient to read and write. (That may be a matter of disagreement, but I certainly find it inconvenient that I have to quote all keys in JSON!)

I would therefore suggest to reconsider this restriction and to allow (more or less) arbitrary Unicode letters in bare keys. Definitions of identifiers in languages such as JavaScript, Java or XML could provide a starting point for such a generalization, as they all avoid the "English preferred" bias.

@mojombo
Copy link
Member

mojombo commented Jan 25, 2016

Everything I said in #283 still holds. TOML 1.0 will have restricted bare keys, but if TOML adoption becomes significant and we can find a reasonable way to deal with undelimited Unicode, then I'd consider it for a future version of TOML.

@mojombo mojombo closed this as completed Jan 25, 2016
@Hrxn
Copy link

Hrxn commented Jan 26, 2016

#337 (comment)

I reluctantly agreed when the restriction on bare keys was introduced, but I was never happy with it. The problem is that it introduces a strong bias in favor of English-only vocabularies which TOML didn't have before.

What bias are you talking about here? That every programming language under the sun is based on the English vocabulary? Well, yeah, true. But that ship sailed looong ago.

Don't bother with the past, because it can't be changed anyway...

@ChristianSi
Copy link
Contributor

@Hrxn:

What bias are you talking about here? That every programming language under the sun is based on the English vocabulary?

No, those are keywords, and TOML doesn't have any keywords. I'm talking about the bias regarding keys and table names, that is, identifiers. Now, practically all modern programming languages allow arbitrary Unicode letters and (except for the first letter) digits when naming identifiers.

I'd be happy if TOML said something such as "everything sequence of Unicode characters that is a legal JavaScript [or Python, or whatever] identifier is also a valid (bare) key." That would remove the bias I have complained about.

@Hrxn
Copy link

Hrxn commented Jan 31, 2016

No, those are keywords, and TOML doesn't have any keywords. I'm talking about the bias regarding keys and table names, that is, identifiers.

Oh, really?
Thanks for the lecture, I guess, but this was definitely not my point.

I merely tried to say that there is no 'bias' in the first place. Nothing worth complaining about..

@TheElectronWill
Copy link
Contributor

Here is a suggestion for bare keys:
Allow any character after the space one in the unicode table (so no newlines, no spaces and no weird characters like NULL), except the following ones: points, square brackets (open and closed), number signs (because they are used for comments), and equal signs.

arp242 added a commit to arp242/toml that referenced this issue Jun 2, 2023
This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other issues.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

All of this means we can push forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the response didn't seem
too hostile to the idea:
toml-lang#966 (comment)
arp242 added a commit to arp242/toml that referenced this issue Jun 2, 2023
This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other fronts.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

Reverting this means we can go forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the responses didn't
seem too hostile to the idea:
toml-lang#966 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants