Parse Me, Baby, One More Time: Bypassing HTML Sanitizer via Parsing Differentials
2024Conference / Journal
Authors
Research Hub
Research Hub C: Sichere Systeme
Research Challenges
RC 7: Building Secure Systems
RC 8: Security with Untrusted Components
Abstract
Websites rely on server-side HTML sanitization to defend against the ever-present threat of cross-site scripting attacks. Parsing arbitrary pieces of markup to assess whether they contain an exploit payload is far from trivial. This complexity leads to divergences between the parsing results of the sanitizer and the user’s browser. These so-called parsing differentials open the door for the unexplored category of mutation-based attacks. Here, an attacker abuses the sanitizer’s incorrect HTML parser to either directly bypass it or coerce it to transform benign markup into a dangerous exploit payload. In this work, we study the prevalence of such parsing differentials and their security impact. To this end, we built a generator for HTML fragments that are difficult to parse and evaluated how 11 sanitizers across five programming languages deal with such inputs. We found that parsing differentials are commonplace, as each assessed sanitizer has at least several functional deficiencies leading to overzealous removal of benign input. Even worse, we were able to automatically bypass all but two of the 11 sanitizers, painting a dire picture of the state of server-side HTML sanitization.