Post-Mortem and Security Advisory: Data Exposure After travis-ci.com Outage

351 pointsby xtreak29about 7 years ago

32 comments

cjbprimeabout 7 years ago

Kudos for a thorough and transparent writeup, and (by the looks of things) understanding that processes fail rather than individuals.That said, I have to admit to having at least three eye-bulge WTF moments while reading this.I'm also surprised that there isn't a Remediation step of "firewall the development machines away from the production database".(And isn't the change to database_cleaner to make it throw when run against remote databases by default a serious break of API compatibility? What if someone's depending on that behavior?)

评论 #16791478 未加载

评论 #16790241 未加载

评论 #16792739 未加载

评论 #16790243 未加载

geofftabout 7 years ago

So one of the ways of analyzing the root cause here is that the autoincrement index of the user table in their database is security-sensitive, and relatively normal DB operations like "Let's roll back the DB" have serious security implications involving ID reuse. What are some ways to make this less dangerous? (The rest of it was an operational failure, but it would have been less trouble if it weren't a security failure.)I can think of the following:- Don't use auth cookies that are signed messages consisting of a UID + expiration date and other data, use auth cookies that are opaque keys into some valid-auth database. This is significantly less efficient (every operation needs a lookup into the DB before you can do anything; if you move it into a cache you now risk the cache being out-of-date with your DB). AFAIK using signed UIDs has no security downside other than this, right?- Identify users by usernames, not by UIDs. This makes renaming users (which GitHub allows, so Travis is forced to allow) difficult and security-risky.- Use UIDs that are selected from a large random space so make collisions unlikely, e.g., UUIDs or preferably 256-bit random strings. This seems fine and probably preferable from a security point of view. Is this fine from the DB point of view?Anything else? Maybe a DB restore-from-backup option that preserves autoincrement counters and nothing else - is that a standard tool?

评论 #16790889 未加载

评论 #16790921 未加载

评论 #16791471 未加载

评论 #16792880 未加载

评论 #16792844 未加载

drinchevabout 7 years ago

> The shell the tests ran in unknowingly had a DATABASE_URL environment variable set as our production database. It was an old terminal window in a tmux session that had been used for inspecting production data many days before. The developer returned to this window and executed the test suite with the DATABASE_URL still set.I was expecting something like this. I remember, I configured my terminal windows to change their background when I'm on production systems [2], after around I read about gitlab database incident [1].1 : <a href="https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/" rel="nofollow">https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...</a>2 : <a href="http://www.drinchev.com/blog/ssh-and-terminal-background/" rel="nofollow">http://www.drinchev.com/blog/ssh-and-terminal-background/</a>

评论 #16790742 未加载

评论 #16790909 未加载

评论 #16793664 未加载

评论 #16790985 未加载

评论 #16790880 未加载

xtreak29about 7 years ago

I was surprised about a tmux session connected to production DB for days. Though it was idle there are a lot of things that can go wrong during window switching. My colleague also pointed out the subtle error of assuming the value of DATABASE_URL in the system instead of being set explicitly by the test script that could have avoided this.That being said I am amazed at their transparency over the whole issue and a thorough write up of the whole incident. It's something we can all learn from.

评论 #16790572 未加载

评论 #16791708 未加载

评论 #16798525 未加载

评论 #16790302 未加载

kgilpinabout 7 years ago

If you are interested in a list of steps you can take to avoid this happening to your data, here are some suggestions. I don't believe that any single measure is sufficient. And I also believe that it's valid to balance the strictness of your controls against the the amount of protection you really need.1. Vault the passwords. People and machines should fetch passwords on-demand using identity credentials.2. Create a read-only database account. In all cases, use the account that matches the need. Running reports? Use the read-only account.3. Restrict access to read-only and read-write database accounts. Provide this account information to a limited set of people and tools.4. Provide a fairly straightforward way for people to get temporary elevated access. If it's easy to get elevated access, then users will not be tempted to "hold on" to elevated access longer than they should (e.g. by leaving a terminal open for a very long time).5. Rotate the credentials of all the accounts regularly. This ensures that temporary elevated access will become long-term access. It also greatly reduces the harm created when credentials are leaked, exposed, or forgotten about (e.g. in an environment variable in an old window).Note that none of the steps above require a heavy investment in automation. You can start with basic (even fully manual) processes for key management and access management, and evolve to automation as you grow.Finally, keep in mind that this type of accident is not just a "small company" problem. Recall this AWS ELB outage on Christmas Eve of 2012 - <a href="https://aws.amazon.com/message/680587/" rel="nofollow">https://aws.amazon.com/message/680587/</a><pre><code> "The data was deleted by a maintenance process that was inadvertently run against the production ELB state data."</code></pre>

评论 #16791716 未加载

sudhirjabout 7 years ago

They should really consider using a CI system to run their tests.

评论 #16790234 未加载

wiredfoolabout 7 years ago

I think the root issue here is that the production database "user" has too many privileges, and the reason for that is migrations. This is compounded by the test user essentially needing to be a db superuser to create and destroy test databases, as well as run the migrations for them. I've noticed this lately with Django, but I'm guessing that it's a general problem.When I design a DB system, Ideally the production 'user' can only do those things that we reasonably expect them to be able to do, and truncate isn't one of them. Also drop tables, potentially delete entries, and any maintenance tasks. DDL modifications are right out.Those tasks can be run from a specific user, and locked down to a certain types of connections that aren't allowed from production.

评论 #16791737 未加载

评论 #16793783 未加载

评论 #16790951 未加载

评论 #16792381 未加载

jtmarmonabout 7 years ago

Wow, the issue with the signed token is very interesting. Found it surprising the authentication method specifically wasn't mentioned in the remediation.Food for thought: the security issue wouldn't have happened if (1) travis used UUIDs instead of sequential IDs as a pkey, or (2) used a secret token for auth instead of a signed (presumably) JWT.

评论 #16790913 未加载

评论 #16794257 未加载

评论 #16792385 未加载

badmadradabout 7 years ago

Wondering why does any developer need update/delete/drop access to a prod database? Or why would ad hoc scripts have this ability?

评论 #16790268 未加载

评论 #16790275 未加载

评论 #16790267 未加载

plasmaabout 7 years ago

A suggestion for production database access, create a readonly login (in addition to a write one).Login using the readonly login the majority of the time, and only switch to the write login when required.

vijaybrittoabout 7 years ago

Finally I can show a solid example to my team mates who ridiculed me when I said we needed restrictions on access to prod servers. This is a great write-up!

评论 #16793693 未加载

bogomipzabout 7 years ago

>" Using our API logs, and with information from our upstream provider about the IP address the query originated from, we were able to identify a truncate query run during tests using the Database Cleaner gem."I'm assuming by "upstream provider" here they mean ISP/IaaS provider. Either way they didn't have enough information under their control to identify the source of the query. The reliance on a third party for accurate logging information seems like a big blind spot.What if the upstream provider didn't have the logs? Or the request for access to those took an excessive amount of time? I didn't see anything in the remediation steps to address this.

jmiserezabout 7 years ago

Interesting writeup. I loath setting environment variables on long running terminal sessions exactly because it’s not obvious once they’re set.I prefer to use a subshell for the command and set the environment variable each time:$ ( export FOO=bar; my_cmd )

评论 #16790327 未加载

kccqzyabout 7 years ago

One thing that immediately caught my attention: the fact that it is possible for a single query/command/request to wipe everything.To be frank at a place I worked there had always been something like this too: if you were logged in as super admin, wiping all data is just one POST request away. That was super convenient when testing things, but having the same in production made me uneasy. Fortunately before any incident happened I added additional checks that required special command line flags to enable this API. Perhaps still not super foolproof but I felt much better.

评论 #16791627 未加载

notimetorelaxabout 7 years ago

Shouldn’t there be a remediation step of making it impossible to login into another users’ session? E.g. generate a random number for every provisioned user and add it to the token.

thezilchabout 7 years ago

All of my production terminals have dark-red background and my screen hardstatus also red. This is my default in rc files, and I have to explicitly link rc files to get my dev-only black background with lime hardstatus.

评论 #16791741 未加载

joelhaasnootabout 7 years ago

Lots of focus in the comments on the database access issue, but trusting the user specified (signed) token doesn't seem like a great idea. Not validating the token against database seems like a painful shortcut

philip1209about 7 years ago

This makes me think of the Google SRE book. They advise that, if there is a problem this big, any SRE should have the power to turn off the production load balancers until the problem is fixed.I don't think that TravisCI did anything wrong. However, if they had turned off the load balancers as soon as they realized that there was a huge issue, it might have protected customer data more. They optimized uptime over completely fixing the issue. Also, perhaps nobody felt that they had the authority to turn off the production service.

catfoodabout 7 years ago

So the session keys mapped to usernames, rather than IDs in the database? Otherwise, when the database is restored with the old user IDs, the session would become invalid instead of continuing to work. This is what I'm seeing:1. Tables truncated. 2. In this window, someone creates an account with a username that existed in the dropped database. 3. They see a blank user page because a new user record was created. 4. Database restored. 5. It's as if you're logged into the original user's account.

评论 #16790875 未加载

fermigierabout 7 years ago

Reminds me of this:"Why Auto Increment Is A Terrible Idea" (2015) <a href="https://www.clever-cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/" rel="nofollow">https://www.clever-cloud.com/blog/engineering/2015/05/20/why...</a>(update: link fixed).

评论 #16791097 未加载

pradeepchhetriabout 7 years ago

Amazed to see how transparently they have written the post. I think we all can learn from such outages[0][0]: <a href="https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/" rel="nofollow">https://about.gitlab.com/2017/02/10/postmortem-of-database-o...</a>

SoulManabout 7 years ago

Classic case of Developer returning to window with prod env setup. I am sure it was a "blameless post-mortem" i.e action item contains change in tooling and processes rather than trying to change human behaviour.

评论 #16798580 未加载

Sembianceabout 7 years ago

Aren’t these the folks that spammed every github repo with a spam pull request to integrate their system into your code? I kinda lost all respect for this project and their developers after that incident.

评论 #16792286 未加载

yazrabout 7 years ago

Has anyone estimated whether SaaS in general reduce or improve the uptime in aggregate over all users ?!The obvious arguments is that a specialized SaaS is more reliable, but the rare outages are horrific...

bananarepdevabout 7 years ago

Why does an extremely dangerous tool, such as a database cleaning tool/library, rely on an environment variable to define the target?

评论 #16791820 未加载

stevekempabout 7 years ago

Site is down for me, but this mirror isn't:<a href="http://archive.is/klfF5" rel="nofollow">http://archive.is/klfF5</a>

nukeopabout 7 years ago

What is a "read-only follower"? Is this a common term when handling databases? Is it different than a slave?

评论 #16791888 未加载

评论 #16791488 未加载

d6de964about 7 years ago

I'm currently job-seeking and I've seen many jobs ads asking for CI experience. I'm not fond of using SaaS solutions and would like to fiddle with CI in private (e.g. using a private gitlab repo.What would be the steps to setup an own, private and open source CI solution for, say, a Go, PHP, or JavaScript project?

评论 #16792336 未加载

评论 #16792040 未加载

ghoshbishakhabout 7 years ago

Amazed at their transparency!

wemdyjreichertabout 7 years ago

Attack of Little Bobby Tables

0xFFFEabout 7 years ago

Not being snarky. How hard is it to setup DB replication and do testing/QA on that DB? Isn't it the SOP?Why the remediation list doesn't include it?

评论 #16790377 未加载

100kabout 7 years ago

Great writeup.This is the third case I'm aware of where CI deleted the production database. Others are GitHub (back in 2010: <a href="https://blog.github.com/2010-11-15-today-s-outage/" rel="nofollow">https://blog.github.com/2010-11-15-today-s-outage/</a>) and LivingSocial.

评论 #16790393 未加载

评论 #16792725 未加载