We were contacted by a bug hunter once stating he has access to our database and asking for a bounty for his finding, he even provided a sample of first 100 users from the users table in the database.<p>After some investigating, I figured out how did he obtain the data.<p>He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.<p>After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.<p>Escape input, you never know where will it be rendered.
This is such an important lesson, but it's a difficult one to convince people of - telling people NOT to sanitize their input goes against so much existing thinking and teaching about web application security.<p>It's worth emphasizing that there's still plenty of scope for sensible input validation. If a field is a number, or one of a known list of items (US States for example) then obviously you should reject invalid data.<p>But... most web apps end up with some level of free-form text. A comment on Hacker News. A user's bio field. A feedback form.<p>Filtering those is where things go wrong. You don't want to accidentally create a web development discussion forum where people can't talk about HTML
because it gets stripped out of their comments!
It's buried a bit in the article, but if you have to sanitize input to allow only some kinds of inputs (e.g., specific tags), you should really be parsing it fully to an AST and then acting on that (or using a library doing the same) since otherwise you're going to be subject to all sorts of pain.
I still wish that the Unicode folks had set up a bunch of duplicate code points which could have been used exclusively for processing marked-up text and that the folks making markup systems/languages had followed through.<p>Say one was updating TeX to take advantage of this --- all the normal Unicode character points would then have catcodes set to make them appropriate to process as text (or a matching special character), while "processing-marked-up" characters would then be set up so that for example:<p>- \ (processing-marked-up variant) would work to begin TeX commands<p>- # (processing-marked-up variant) would work to enumerate macro command arguments<p>- & (processing-marked-up variant) would work to delineate table columns<p>&c.<p>and the matching "normal" characters when encountered would simply be set.
Why not not both? Escaping output should be a requirement but doesn't hurt to remove obvious garbage in the input (including harmless stuff like pointless spaces)
I store the raw input in my database, but run it through bluemonday before rendering it. Simples.<p><a href="https://github.com/microcosm-cc/bluemonday">https://github.com/microcosm-cc/bluemonday</a>
This is another place where 80% of the time one way works but 20% of the time you need to go the other way.<p>Of course once the product is in production you can swim one direction but not fight the current going in the other. You can always move to escaping output, but retroactively sanitizing input is a giant pain in the ass.<p>But the problem comes in with your architecture, and whether you can discern data you generated from data the customers generated. Choose the wrong metaphors and you end up with partially formatted data existing halfway up your call stack instead of only at the view layer. And now you really are fucked.<p>Rails has a cheat for this. It sets a single boolean value on the strings which is meant to indicate the provenance of the string content. If it has already been escaped, it is not escaped again. If you are combining escaped and unescaped data, you have to write your own templating function that is responsible for escaping the unescaped data (or it can lie and create security vulnerabilities. "It's fine! This data will always be clean!" Oh foolish man.)<p>The better solution is to push the formatting down the stack. But this is a rule that Expediency is particularly fond of breaking.
I've always been a big fan of <i>structuring</i> data on input, <i>escaping</i> it on output.<p>I think the big problem with just escaping output is that you can accidentally change what the output will actually be in ways that your users can't predict. If I am explaining some HTML in a field and drop `<i>...</i>` in there today, your escaper may escape this properly. But next month when you decide to change your output to actually allow an `<i>` tag, then all of a sudden my comment looks like some italicized dots, which broke it.<p>Instead if you structure it, and store it in your datastore as a tree of nodes and tags, then next month when you want to support `<i>` you update the input reader to generate the new structure, and the output writer to handle the new tags. You preserve old values while sanitizing or escaping things properly for each platform.
It is a reasonable idea, but there are other things that can be done too.<p>However, in the stuff about SQL, you could use SQL host parameters (usually denoted by question marks) if the database system you use supports it, which can avoid SQL injection problems.<p>If you deliberately allow the user to enter SQL queries, there are some better ways to handle this. If you use a database system that allows restricting SQL queries (like the authorizer callback and several other functions in SQLite which can be used for this purpose), then you might use that; I think it is better than trying to write a parser for the SQL code which is independent of the database, and expecting it to work. Another alternative is to allow the database (in CSV or SQLite format) to be downloaded (and if the MIME type is set correctly, then it is possible that a browser or browser extension will allow the user to do so using their own user interface if they wish to do so; otherwise, an external program can be used).<p>Some of the other problems mentioned, and the complexity involved, are due to problems with the messy complexity of HTML and WWW, in general.<p>For validation, you should of course validate on the back end, and you may do so in the front end too (especially if the data needed for validation is small and is intended to be publicly known). However, if JavaScripts are disabled, then it should still send the form and the server will reply with an error message if the validation fails; if JavaScripts are enabled then it can check for the error before sending it to the server; therefore it will work either way.
This has been the way for Drupal since ... 2005 at least. My memory becomes fuzzy before that. Since 2015 it's highly automated too thanks to Twig autoescape.
Of the “six famous bad ideas in computer security”, the first and second are “default permit” and “enumerating badness”.<p><a href="http://www.ranum.com/security/computer_security/editorials/dumb/" rel="nofollow">http://www.ranum.com/security/computer_security/editorials/d...</a>
Of course you should sanitize input, <i>and</i> escape everything properly in the context-specific way.<p>Defining what is valid for an input field and rejecting everything else helps the user catch mistakes. It's not just for security.<p>Some kinds of information are tricky to sanitize. Names, addresses and such. Especially in an application or site that has global users. Do the wrong thing and you end up aggravating users, who are not able to input something legitimate.<p>But maybe don't allow, say, a date field to be "la la la" or even "December 47, 2023".
Ehhh!? I don't get this at all. You <i>obviously</i> do both.<p>1) you get your input data into the form that is meaningful in the database by validating, sanitising and transforming it. Because you know what form that data should be in, and that's the <i>only</i> form that belongs in your database. Data isn't just <i>output</i>, sometimes it is processed, queried, joined upon.<p>2) you correctly format/transform it for output formats. Now you know what the normalised form is in the database, you likely have a simpler job to transform it for output.<p>It's not just lazy to suggest there's a choice here, it's wrong.
Disagree.<p>Escaping/sanitizing on output takes extras cycles/energy that can be spared if the same process is done once upon submission.<p>Think more sustainable.