FAQ: Untrusted users and HTML
An input form that takes raw HTML. It’s a pretty common thing to see in web apps these days: many comment forms allow HTML, or some subset thereof; many social-network-style applications allow end-users to enter HTML in their profiles; etc. Unfortunately, allowing untrusted users to enter raw HTML is incredibly dangerous; read up on XSS if you don’t know why.
So a common question that comes up in web developer circles deals with how best to "escape" user-entered HTML so that is safe for presentation. Though this seems easy, it’s actually incredibly difficult — see Whitelist, Don’t Blacklist for an introduction. I’ve literally seen hundreds of recipes for stripping unsafe HTML that are about as effective as a screen door on a submarine.
I’d like to answer the question once and for all:
No method of displaying untrusted HTML is 100% safe.
Really. Given the bewildering array of browsers and their bugs as soon as you open up HTML input you’ve exposed yourself to an arms race against XSS (and related) attacks.
Put another way, the only 100% safe form of HTML protection is abstinence: if you can avoid allowing raw HTML input, do so.
One of the great features of alternative markup like Markdown, reStructuredText, bbCode, and their ilk is that they can be transformed into safe HTML. For example, python-markdown has a safe_mode argument that prevents anything dangerous from appearing in your output.
Now, I’ve always thought that abstinence-only education is a crock. I’ve always felt that consenting adults who know the risks and want to proceed anyway should be taught about the most effective forms of protection.
In this case, the most effective protection comes in the form of html5lib specfiically html5lib.filters.sanitizer. This uses a well-tested, centrally-maintained whitelist of safe HTML elements. Because of the quality of that whitelist, html5lib is the safest form of protection against malicious HMTL.
Just remember that abstinence is the only 100% effective method of protection, and non-HTML markup is more fun, anyway.