Discovering Azure's unannounced breaking change with Cosmos DB

188 点作者 jmartens超过 2 年前

19 条评论

dagss超过 2 年前

This rhymes with my overall impression of Cosmos. It took us a while to see through the smokescreen because when talking to Microsoft support and representatives it is the Best Thing Ever and they sound so confident about it. But it really is a beta demo product sold with an alpha premium price tag.If your traffic pattern is exactly right, and you always scale traffic up and never ever down and do not have spikes, I guess it is probably OK. The main problem is the docs are (or, at least were 2 years ago) not clear about all the caveats and restrictions but pretend it is a generic database that just works. So one has to discover all the caveats oneself.Microsoft thinks the exact workings of the partitioning is something that should work so well you don't need to know it in detail. But, if your usecase is slightly off you end up really needing to know. I know at least one team who routinely copy all their data from one Cosmos instance to another and switch over traffic to the copy just to get a partitioning reset; it is one thing to have to do it; another to discover in production yourself it has to be done with no prior warning..Also: The ipython+portal+Cosmos security meltdown from 1 1/2 years ago alone should be reason to look elsewhere.(No, not a competitor, just have spent way way way too much engineering time moving first on and then off Cosmos and yes I am bitter)

评论 #33198100 未加载

评论 #33197131 未加载

评论 #33196660 未加载

评论 #33196435 未加载

评论 #33199931 未加载

评论 #33200622 未加载

评论 #33196789 未加载

redditor98654超过 2 年前

I am proud of my team at AWS that took backwards compatibility very seriously. Even introducing a temporary backwards incompatibility was a no-go in design reviews.We had a service that had a list API that was paginated. It returned a nextToken to specify the start of the next page of the results.Internally we were doing a database migration to a completely different system and migrating one customer at a time. The problem was that if a customer was in the middle of a list call and had the next token with them which was generated from the previous database system, after migrating to the new database the older token would not be able to start from exactly where it should had the customer not been migrated. This was because it would not have all the information of the service to resume at the exact offset.One option was to throw an error and let the customer retry the request; another option was to return some possibly duplicate items in the next page; none of these were good enough for both engineers and PMs and instead we decide to take up a bunch of additional work so that no customer would be impacted. This was 10+ weeks of additional work for the whole team but we did it because culturally it felt the right thing to do for the customer.Note that the impact would have been tiny if at all. A customer would have to be in the middle of a paginated request and out migration system would have had to migrate that particular customer at that exact time and the impact would have been a few possibly duplicate items. But we didn’t know the actual impact of those temporary duplicated for a single call and we all agreed breaking changes like this are unexpected and cause customer to lose trust with us.

评论 #33200848 未加载

评论 #33200884 未加载

DishyDev超过 2 年前

As someone whose job involves maintaining uptime of a critical system that's dependent on Cosmos DB this sort of thing is scary. Where there's been other reliability issues with Cosmos before we've not had an understanding customer base, and it feels very out of my control.I'm finding a lot of the reliability guarantees of Azure PaaS services are overblown or come with big caveats when you start to work with them in a serious way. For example I've had some bad reliability issues with Azure Functions not firing, or their premium function runtimes becoming unresponsive. And it seems like that's just the start of the outstanding issues with them <a href="https://github.com/Azure/azure-functions-host/issues" rel="nofollow">https://github.com/Azure/azure-functions-host/issues</a>I think people need to look more carefully at these PaaS guarantees and look at what that 99.999% reliability Microsoft are claiming actually means.

评论 #33198602 未加载

评论 #33196724 未加载

评论 #33196562 未加载

评论 #33197834 未加载

评论 #33199418 未加载

prepend超过 2 年前

This reenforces my idea that no one uses Cosmos because it is utter garbage.It sounds cool, but I was surprised when after what I think should be the worst and dumbest security design flaw breach [0] there wasn’t much uproar.I thought maybe no one is using it so there wasn’t much impact.Pushing out breaking changes without telling your customers also gets explained by there not being any (or many since these folks found it) users.Could you image how big of a deal it would be if a breaking change or elevated privs bug were in actually used products.[0] <a href="https://www.techtarget.com/searchsecurity/news/252505973/Researchers-discover-critical-flaw-in-Azure-Cosmos-DB" rel="nofollow">https://www.techtarget.com/searchsecurity/news/252505973/Res...</a>

dharmab超过 2 年前

Back around 2017-2018 unannounced breaking changes in Azure services were so common, my team coined a term "Cloud Monday" (echoing Patch Tuesday) because usually our integration tests would break between 8-10AM Pacific Time on Mondays. (They did eventually become far less frequent.)

评论 #33197485 未加载

HorizonXP超过 2 年前

So, as someone who was in the midst of planning a migration of a multi-billion $ revenue platform to using CosmosDB...Alternatives? LOLBasically just looking for geo-redundant, high read & write throughput. Our intention was to leverage Azure Event Grid/Kafka Connect to have event streaming used to coordinate writes between Redis (cache), Cosmos (transactional DB), and our systems of record (legacy). Majority of read/writes would occur via our API, but some would occur via the systems of record, hence the use of a log-based architecture.

评论 #33199180 未加载

评论 #33205118 未加载

评论 #33199162 未加载

评论 #33206498 未加载

评论 #33201246 未加载

评论 #33200074 未加载

speedgoose超过 2 年前

I believe it is easy for a well-made software to immediately detect and report what goes wrong. With Sentry, Elk, or whatever else.So, let's say I'm woken up in the middle of the night because my black box database as a service suddenly returns errors. If I'm not incompetent, I should have error messages and stacktraces available in a few seconds. If I'm a rich cloud customer, I can call the premium cloud support and ask for an explanation. If not, I would probably have to debug it myself.With your service, I understand that I can blame the cloud provider faster. Maybe it can make the debugging session slightly faster when your monitoring also returns errors. End users don't care whether it's my code or the cloud provider code crashing, so it's a developer tool for emergencies. Did I understand well?

评论 #33196453 未加载

twodave超过 2 年前

Funny, I was just last week having an argument with one of our team leads. I'd told him to create a specific container without a partition key (which I wouldn't recommend except in certain circumstances), and he said he couldn't. I assumed he was just doing it wrong.

评论 #33196980 未加载

dec0dedab0de超过 2 年前

This seems like an accident. Microsoft should treat it as a bug, and set the default on their backend to fix it.

whalesalad超过 2 年前

This is very typical Microsoft behavior, unfortunately.

nobodyandproud超过 2 年前

For new projects, why wouldn’t anyone use postgres?

评论 #33197275 未加载

评论 #33197292 未加载

评论 #33198506 未加载

评论 #33200963 未加载

评论 #33200006 未加载

grogers超过 2 年前

The error message from azure are literally just “One of the specified inputs is invalid”? I get annoyed at AWS error messages because they aren't really machine readable (unless you're okay parsing a string that is subject to change), but at least they are almost always human readable with all the relevant details...

sublimefire超过 2 年前

Correct me if I'm wrong but the article does not mention which "outdated sdk" version was used. In addition to that every API call requires a version which is not mentioned in the article [1].It is not clear to me if the issue was with an old SDK using the newest api version in calls or was it something else?[1] <a href="https://learn.microsoft.com/en-us/rest/api/cosmos-db-resource-provider/2021-04-01-preview/sql-resources/create-update-sql-container?tabs=HTTP" rel="nofollow">https://learn.microsoft.com/en-us/rest/api/cosmos-db-resourc...</a>

评论 #33210548 未加载

atraac超过 2 年前

I have similar opinion to some other comments. Some Azure services - like Application Insights - I absolutely loved, some I hated, CosmosDB, being the latter.They needed years to finally introduce PATCH in CosmosDB, Request Units feel like they're obscure on purpose to hide insane cost of using this storage, being able to use Stored Procedures only on one Partition Key(while /id is being the default...), requests would often fail with 429 Too Many Requests when the container was set to Autoscale with obscene limits that were never hit.Just setup Marten with Postgres and get it over with for fraction of the cost.

didip超过 2 年前

I feel like Azure should just give up on Cosmos DB and go all in on managed Citus DB.

评论 #33199172 未加载

CiTyBear超过 2 年前

Breaking changes are very common with Python Azure SDK. First version where not pep8 compliant so when they decided to respect it, everything broke. Azure servicebus in python went from version 0.50.3 to 7.0.0 with almost everything renamed, class moved and so on.

评论 #33210546 未加载

yellow_lead超过 2 年前

I find it pretty interesting that this company/product (Metrist) has created a monitoring tool for different cloud products, because their monitoring is so bad. Honestly a good idea, but a bit sad these companies can't do this themselves.

jmartens超过 2 年前

We used our own product to learn about and debug the issue. Its rather wild that they'd roll out this change so incrementally, which my colleague outlines here.

评论 #33195808 未加载

评论 #33210534 未加载

hupt超过 2 年前

Cosmos was originally created for hosting massive datasets internally within Microsoft. For example they use it for the OS telemetry sent in from customer machines, and raw data for threat intelligence. As part of Microsoft's move of everything hosted on-premise to their cloud, they decided to upon up Cosmos to other users of Azure. But the primary customer is and will likely always be Microsoft themselves. Which is probably why we see these breaking changes, it'll be in response to some internal ticket most likely.

评论 #33196966 未加载

评论 #33196943 未加载