With this amount of exceptions in the specification, it is no wonder that people...

jgraham · on Oct 7, 2020

The problem with XHTML is that the "abort on parse failure" behaviour simplifies a computer science problem at the expense of creating a business problem; now if you have an error somewhere in your content generation pipeline it means that the site goes down. That's a pretty difficult tradeoff given that most CMS are absoutely not designed in a way that ensures well formed markup.

Back in the dim and distant past when XHTML was a fashionable term to throw around, Evan Goer did a study of whether sites claiming to serve XHTML were actually doing so. The results were not pretty [1]. Some people took the results as a challenge and tried to ensure they were sending valid XHTML with the correct mime type so that browsers would catch fire in the case of a parsing failure. In almost every case it turned out to be possible to get their sites to break with user generated content (e.g. searchng for XML-invalid characters which were then echoed back onto the page).

So I contend we ran the XHTML experiment pretty thoroughly 15 years ago, and it turns out that it doesn't really work. Once you accept that parsing errors being fatal isn't viable for publishing, you have to have some kind of error recovery system. The one in HTML isn't ideal since it's basically just the codification of many years of improvisation and reverse engineering. Maybe something like XML5 [2] would be better. But figuring out how to move the world in that direction is an unsolved problem. Meanwhile HTML Just Works for most of the people most of the time.

[1] https://goer.org/Journal/2003/04/the_xhtml_100.html [2] https://github.com/annevk/xml5

pwdisswordfish4 · on Oct 7, 2020

The right solution is pervasive programming language support for interpolation inside XML/HTML code. Like there is in JavaScript now, known as JSX. Like HHVM has, in the form of XHP. (Interesting that both are Facebook innovations.)

The industry didn't defeat SQL injection through more permissive query parsers, but by generating queries via ORMs and parametrised templates instead of dumb string concatenation.

Also, have you noticed that JSON injection bugs are almost never heard of? That's because in many programming languages this very problem comes pretty much pre-solved before you even add any JSON support to them.

sergeykish · on Oct 7, 2020

Error is still there, it is just a non breaking error.

Imagine Word or Excel document that's trying to silently recover. I do not see programming languages adopting "not to fail" approach

    1 + "2"
    //"12"
    1 - "2"
    -1

There were PHP sites with mysql connection error all around. As industry we've chosen AirBrake approach — fail and notify developers. HTML makes it easier to edit plain text but there is a price. What you load is not what you've stored, HTML is a lossy serialization [1] [2] [3].

Program not human produced DOM, it should be safe to serialize-deserialize. It could be JSON, XML, s-expressions. It is unsafe with HTML.

It is very easy to author XHTML. Start DOM first (HTML for brevity):

    data:text/html;charset=UTF-8,<p contenteditable>foo

Done, it automatically escapes <>&. Extend with some controls [4], store it as XHTML [5]. It is WYSIWYG, much easier than HTML authoring.

[1] http://sergeykish.com/script-style-is-cdata-in-html

[2] http://sergeykish.com/content-after-html-appended-to-body-in...

[3] http://sergeykish.com/pre-newline-ignored-in-html-test

[4] http://sergeykish.com/live-pages

[5] http://sergeykish.com/bookmarklet-put-xhtml

barrkel · on Oct 7, 2020

HTML mostly ends up rendering text and images, and you have to mess up really hard to lose both of those.

Typically what breaks is styling. It might not be pretty, but it may still be functional.

sergeykish · on Oct 7, 2020

I can't mess it when I edit DOM and browser restores it as it was.

I may have <ul> in <p> (we had it in 1978), I may have <a> in <script> (and it works like comment), I may have <pre>\n and don't worry that it disappear each time I save document. I may have nested <script type="foo"> tags [1].

DOM supports it. XHTML supports it. HTML breaks my content on save-load.

[1] https://stackoverflow.com/a/59548670/5554075

dataflow · on Oct 7, 2020

+1 for XHTML. I never understood why people think it's a good idea to avoid closing elements. It's like a dangling brace to me... it nags me and it just doesn't look right. How is it seen as acceptable practice?

vbezhenar · on Oct 7, 2020

It's not a good idea to close all elements in HTML because browser does not care about your closing elements, they'll be closed automatically regardless of whether you closed them or not. And your closed elements will be opened automatically.

    <p>List: <ul><li>item1</li></ul> of items</p>

becomes

    <p>List: </p><ul><li>item1</li></ul> of items<p></p>

and that's probably not what you wanted. So you have to understand auto-closing behaviour either way. And if you understand it, you can just spare yourself from closing them.

Now with XHTML, that's a different thing. But I think that ship sailed many years ago and HTML is a preferred way to go nowadays.

lucideer · on Oct 7, 2020

> you have to understand auto-closing behaviour either way

You do, if you use HTML, but my read of the above two comments was that they would've preferred if the world had stuck on the XHTML path.

> Now with XHTML, that's a different thing. But I think that ship sailed many years ago and HTML is a preferred way to go nowadays.

Actually, XHTML is a (little-known) part of the HTML5 spec.[0], so going the strict path is still an option. In the past, this would've required complex content-negotiation for media-type backward-compat but that's no longer an issue unless a non-neglible % of your visitors are using IE8.

The only remaining issue is that of draconian error handling, which is an issue browsers definitely would have fixed the UX of had the mainstream stayed on the XHTML track, but sadly that never happened. Still, good modern support for server-side validation of well-formed XML documents means this is also less of an issue than it once was (though tbh, still a significant issue imo).

W3 have also put together a more informal guide to modern XHTML considerations within the HTML5 spec. https://dev.w3.org/html5/html-polyglot/html-polyglot.html

[0] https://html.spec.whatwg.org/multipage/xhtml.html

Mikhail_Edoshin · on Oct 7, 2020

It's only because there's a known structure for HTML. A block-level element cannot be within another block, but you have to know that "p" and "ul" are block-level. Explicit closing does not require you to know the structure and thus makes the processing simpler.

It's as in JavaScript: you can omit the ";" at the end of a statement and the parser will figure it out, but it only makes the parser more complex and introduces subtle differences you have to learn. If JS parser were more strict, it would be simpler both internally and conceptually.

wayvey · on Oct 7, 2020

The second sentence is wrong: there's nothing wrong with nested block elements, but you never nest block elements inside inline elements. For example <div> elements are block-level and they are nested all the time. There are some exceptions though such as your example of an <ul> in a <p>. The <p> tag only permits "phrasing content" within itself, and <ul> isn't considered phrasing content.

MDN definition of phrasing content: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Cont...

Mikhail_Edoshin · on Oct 7, 2020

Thanks for the correction (I don't really know the HTML flow model and was guessing it.)

cookiengineer · on Oct 7, 2020

> they'll be closed automatically regardless of whether you closed them or not.

Actually that is not really true. It is defined per html5 spec what is closed automatically when another open node is being parsed.

The <p> element always had a weird flow-root behaviour, that is why it is always closed automatically.

Rather than that iirc mostly form relevant elements are also closed automatically, like optgroup, option, select, input and such.

Additionally it's only table (same problem with flow root) and body, pretty much.

I can understand that a lot of people are confused why that is. But the reason is not the difference of XML vs SGML per se (xhtml will simply break if a p is within a p)... it's the flow root model and the difference in behaviours of layouting that is specified here, not the notation structure.

So SGML was the better choice, to allow all php crapsites that never test their html validity to still run rather than forcing the enduser to (fix?) the xml.

I mean, at some point opera didn't even work on bbcode based forum software(s), so people quickly started abandoning it.

tannhaeuser · on Oct 7, 2020

FWIW, there's a complete SGML DTD capturing these and all other HTML rules for tag omission/inference with extensive docs at my site [1].

[1]: http://sgmljs.net/docs/html52.html

cookiengineer · on Oct 7, 2020

Woah this is actually very nice, very nice indeed.

I'm currently building my own HTML5 parser for my browser stealth [1] and I aim to be spec-compliant with it, and this might come in very handy for testing against.

What I kind of miss with SGML as a feature is something similar to XSLT stylesheets that can transform chunks of a website into another chunk.

Currently I'm kind of reinventing the wheel here due to my optimizer having the idea to "upgrade" websites on the fly before they get to the client.

If all websites were XHTML1.1 strict based, that part would have been so much easier.

[1] https://github.com/tholian-network/stealth

tannhaeuser · on Oct 7, 2020

SGML itself has link process declarations, an additional type of declaration set that can appear in an SGML prolog next to DTDs and that can be used to remap elements (in SGML you can have multiple DTDs and LPDs, pipeline LPDs, and so on). sgmljs uses this and adds templating to capture attributes at call sites for passing these into templates as regular entities, allowing for parametric macro expansion. Basically, if you have eg

    <div bla=x>

in your main doc, you can make SGML expand it using

    <!DOCTYPE div SYSTEM [
      <!ENTITY bla SYSTEM>
    ]>
    <div>
      <p>Value of bla is &bla</p>
    </div>

honoring escaping/sanitizing etc. LPDs can apply rules in a context-dependent way using an automaton capturing much of core CSS.

Now, for arbitrary markup manipulation (XSLT is Turing-complete), don't tell the HN crowd that SGML has/had Scheme-based DSSSL (precursor of XSLT) ;) My opinion, having done large, nontrivial XSLT projects (including extracting the DTD grammar rules you see on the site from spec text) is that the more complex it gets, the more a general-purpose language with unit testing etc becomes a better choice over XSLT.

Edit: much luck with your browser project! Don't hesitate to use my code or ask questions (here or on StackOverflow tagged sgml)

dataflow · on Oct 7, 2020

Wow, today I learned! I didn't realize that transformation would take place. I wonder how many people writing HTML are aware of these... your comment basically made me realize I in fact simply don't know HTML. Thanks!

runarberg · on Oct 7, 2020

Your linter should warn you about these. E.g. if you use prettier the first `<p>` will be automatically closed on auto-format before the `<ul>`, and the second (closing) `</p>` will cause a parsing error.

sergeykish · on Oct 7, 2020

Sometimes we want <ul> inside <p> [1]. It works in DOM, it works in XHTML. And before "you should not want to do it" argument GML (SGML predecessor) had paragraph without indentation so text can continue after list [2].

> The pc tag identifies a paragraph continuation -- that is, one or more sentences related by their subject matter to a paragraph which has been interrupted by an address, example, figure, list, or long quotation. > Usage: The paragraph continuation can occur after the sequence consisting of a paragraph unit followed by an address, example, figure, list, or long quotation.

    :p.The subject of a paragraph might be continued through
    :sl
    :li.an address, a list,
    :li.an example or figure, or
    :li.a long quotation, :esl
    :pc.and continue to be discussed in flowing text.
    The discussion could continue indefinitely through

To long to quote entirely [3], unfortunately IBM destroyed online version

[1] https://www.google.com/search?q=list+inside+paragraph+site%3...

[2] SH20-9160-0 DCF Generalized Markup Language GML Users Guide, First Edition (July 1978)

[3] http://sergeykish.com/gml-element-pc