Last week we at n8n ran into problems getting a new database from Azure. After contacting support, it turns out that we can’t add instances to our k8s cluster either. Azure has told they'll have more capacity in April 2023(!) — but we’ll have to stop accepting new users in ~35 days if we don't get any more. These problems seem only in the German region, but setting up in a new region would be complicated for us.<p>We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.<p>Is anyone else experiencing these problems?
> We never thought our startup would be threatened by the unreliability of a company like Microsoft<p>You're new to Azure I guess.<p>I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.<p>One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.
Oof, that sucks and I feel for you. That said...<p>> setting up in a new region would be complicated for us.<p>Sounds to me like you've got a few weeks to get this working. Deprioritize all other work, get everyone working on this little DevOps/Infra project. You should've been multi-region from the outset, if not multi-cloud.<p>When using the public cloud, we do tend to take it all for granted and don't even think about the fact that physical hardware is required for our clusters and that, yes, they can run out.<p>Anyways, however hard getting another region set up may be, it seems you've no choice but to prioritize that work now. May also want to look into other cloud providers as well, depending on how practical or how overkill going multi-cloud may or may not be for your needs.<p>I wish you luck.
This is nothing new, Azure has been having capacity problem for over a year now[1]. Germany is not the only region affected at all, it's the case for a number of instance types in some of their larger US regions as well. In the meantime you can still commit to reserved instances, there is just not a guarantee of getting those instances when you need them.<p>The biggest advice I can give is 1. keep trying and grabbing capacity continuously, then run with more than what you need. 2. Explore migrating to another Azure region that runs less constrained. You mention a new region would be complicated, but it is likely much easier than another cloud.<p>1. <a href="https://www.zdnet.com/article/azures-capacity-limitations-are-continuing-what-can-customers-do/" rel="nofollow">https://www.zdnet.com/article/azures-capacity-limitations-ar...</a>
I worked briefly in an enterprise facing sales organization that targeted multi-cloud deployments. Azure always had capacity problems.<p>As ridiculous as it sounds, having an enterprise's applications exist on multi-cloud isn't terrible if the application is mission critical - not only does this get around Azure's constant provisioning issues but protects an organization from the rare provider failure. (Though multi-region AWS has never been a problem in my experience, there is a first time for everything.) Data transfer pricing between clouds is prohibitively expensive, especially when you consider the reason why you may want multi-cloud in the first place (e.g., it's easier to provision 1000+ instances on AWS than Azure for an Apache Spark cluster for a few minutes or hours execution - mostly irrelevant if your data lives in Azure Data Lake Storage).
Every cloud provider will have these issues with specific instance types in specific regions, although the Azure Germany situation sounds perhaps a bit more dire. At my past (much larger) employers we’ve always run into hardware capacity issues with AWS too - we’re just able to work around them.<p>Building on cloud requires a lot of trade offs, one being a need for very robust cross-region capability and the ability to be flexible with what instance types your infrastructure requires.<p>I’d use this as a driver to either invest in making your software multi regional or cloud agnostic. Multi regional will be easier. If you’re already on k8s you should have a head start here.
Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.<p>Instead of providing you with a list of the resources they do have, you have to play this weird game where you ask for specific instances in specific regions and then within several hours someone emails back to say yes or no.<p>If it’s no, you have to guess again where you might get the instance you want and email them again and ask.<p>I envisage going to an old shop, and asking the shopkeep for a compute instance in a region. He hobbles out the back, and after a long delay comes back and says “nope, don’t have no more of them, anything else you might want?”.<p>It’s surprising this how it works. Not the auto scaling cloud computing used to bring to mind.
I am sorry to say but at this point Azure is so f’ed up I think it should only be considered after AWS and GCP.<p>The documentation is terrible and the Azure portal is so slow and laggy I can’t even believe it. Not to mention how unreliable their stack is.
This is not as rare as public clouds may lead people to believe. I have had to move workloads around since AWS began (even between public clouds on occasion).<p>In particular, GPU availability has been a continuing problem. Unlike interchangeable x64 / arm64 instances with some adjustments based on the new core and ram count... if no GPU instances are available then I simply cannot run the job. AMD's improved support has increasingly provided an alternative in some situations but the problem persists.<p>I recommend doing the work to make the business somewhat cloud agnostic, or at the very least multi-region capable. I realize this is not an option for some services that have no equivalent on other clouds but you mentioned databases and k8s clusters which are both supported elsewhere.
I used to be a technical seller for Azure. This situation is obviously not great for you as a customer but there are proactive steps you can take to prevent this going forward. Reach out to your sales team and work with them on your roadmap for compute requirements going forward. The sales team has a forecast tool that feeds back into the department that buys and racks the equipment. If you can provide enough lead time, they will make sure you have compute resources available in your subscriptions.
What VM sizes?<p>Besides what’s already been said, internal capacity differs HUGELY based on VM SKU. If you need GPUs or something it’ll be tough. But a lot of the newer v4/v5 general compute SKUs (D/Da/E/Ea/etc) have plenty of capacity in many regions.<p>If changing regions sounds like a pain, consider gambling on other VM size availability.<p>(azure employee)
> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.<p>Yikes, this is totally the first thing you need to come to expect when working with MSFT.
Most of Europe expects the winter to be quite painful from a power perspective. It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.<p>The timeframe they gave would match that kind of ask.<p>I wonder whether you see the same behavior from other cloud providers there (ie if you ask them whether new capacity is available, what do they say)
Message to cloud providers:<p>List what you do you have available so we can choose.<p>Do not force users to randomly guess and be refused until eventually finding something available.
Interesting semi-confirmed anecdote: when lockdown hit, Azure began to refuse to allocate servers. One of the main reasons was they prioritised servers in this way:<p>1. Government/health/defence cloud customers<p>2. Teams, which was exploding in use and they wanted to capitalise on it<p>3. Regular cloud customers
Good news is that today is Black Friday, so the e-commerce industry is running at peak capacity. In 30 days it will be Christmas, and by then (the very latest!) everybody will scale back, so you have a good chance to gain access to more compute before you reach the end of your runway.
> We never thought our startup would be threatened by the unreliability of a company like Microsoft<p>You will be threatened by your own unreliability of building something that's dependant on one region or one cloud.
I've seen this before. I think it was in us-west1, ran out of VMs of the size we used for CI. Had to move to a different region. (Never moved back…)<p>It is shocking to me that it happened at all. Capacity planning shouldn't be so far behind in a cloud that wants to position it as being on-par with AWS/GCP. (Which Azure absolutely isn't.) To me, having capacity planning be solved is part of what I am paying for in that higher price of the VM.<p>> <i>We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.</i><p>Oh my sweet summer child, welcome to Azure. Don't depend on them being proactive about anything; even depending on them to <i>react</i> is a mistake, e.g., they do not reliably post-mortem severe failures. (At least, externally. But as a customer, I want to know what you're doing to prevent $massive_failure from happening again, and time and time again they're just silent on that front.)
I'm baffled to read stories that suggest Azure is a viable competitor to GCP/AWS - they're an absolute nightmare on capacity.<p>It took me six months to get approved to start six instances! With multiple escalations including being forcibly changed to invoice billing - for which they never match the invoices automatically, every payment requires we file a ticket.
Azure Germany is a separate partition from the rest of Azure - presumably for compliance reasons. This is distinct from AWS, where Frankfurt is just another region, albeit one with high demand.
We have had this issue in and since 2018 <a href="https://www.opencore.com/blog/2018/6/cloud-has-a-limit/" rel="nofollow">https://www.opencore.com/blog/2018/6/cloud-has-a-limit/</a><p>That said: We also had this issue on GCP last month.<p>We found that all three (AWS) are unreliable in their own ways.
I’m sure Microsoft is just as surprised as you are. Almost every European facility I ever worked with was constrained by either space or power so you had to be really on top of your capacity management. Facilities in the US seem to have unlimited power and floor space so you never have to deal with either issue.
Who else has heard countless times something like "with company X's cloud platform you don't need to file a ticket and wait weeks for another team to provision a physical server, just spin some more up bro." The reality is you do, you've just outsourced the problem.
EC2 us-east-1 is chronically stocked out, too. Black Friday is the worst day of the year for this. At work, we pre-allocated tons of EC2 machines we don't really need, to hedge against EC2 stockout coinciding with some kind of incident. Yes, we are part of the problem.
This is a bit tangential, but now might be a good time to experiment with raising the price of your product. It might extend the time you have until you have to stop accepting new users entirely, in case your migration is taking longer than needed.
Ran into a similar issue last year in the East US region. We contracted support and they gave a similar response. From my understanding talking to people who use AWS and GCP this isn't uncommon across cloud platforms.<p>While we could've just swapped a deployment parameter to deploy to another region, we opted to just use a different SKU of VMs for a short period and switch back to the VMs when they were available again.<p>We haven't seen issues since.
Get in touch with your CSAM. They will be able to get you assigned a capacity manager, if you don't already have one assigned.<p>It is the function of the capacity manager to help you plan ahead based on what the data center capacities look like going into the future.<p>Meet monthly with your capacity manager. Get representation across different technology interests - database, compute, storage, event hubs, etc. Don't ever skip these meetings.
Ask your VCs/angels for help, this is the kind of thing they can definitely help with.<p>(Speaking from experience - one of our portfolio companies had a similar challenge and we used our network to get to one of the execs of the vendor involved)
One of the biggest benefits of k8s is that you can easily mix in pools of different hardware types without a “rebuild”.<p>Something to try in scenarios like this is to add the “weird and wonderful” VM SKUs that are less popular and may still have capacity remaining.<p>For example, the HPC series like HBv2 or HBv3. Also try Lsv3 or Lasv3.<p>Sure they’re a bit more expensive, but you only have use them until April.
There is no such thing as unlimited when it comes to resources and/or scalability in the hosting market. You might want to find a local colocation provider, buy a few network switches and servers as a secondary production and backup environment for your startup. Deploying your own infrastructure gives you full control over your startup. Yes it will raise your overhead and yes it’s not cheap but for a sustainable operation it’s a requirement in my opinion. I currently use Azure but I also have my own deployment with my own IP addresses and ASN which I keep spare capacity and keep some important servers on there incase something happens with Azure. Definitely helps me sleep better at night.
I’ve heard from a friend who works at Microsoft that due to energy crisis in Europe plus their data locality laws, Microsoft is indeed running short on datacenter capacity there and can’t do anything about it no matter how much they are willing to spend.
While some may immediately run a comparison between Azure, AWS and GCP let it be noted that any cloud platform facing this and making it to headlines is not good for the cloud industry over all.
I remember, in the early days of the pandemic, that Azure Australia ran out of compute too. It happens at the regional level.<p>Are you stuck only to the German region, and can't go to other European regions?
>We never thought our startup would be threatened by the unreliability of a company like Microsoft<p>Had you never heard about (and this is unfortunately not a joke) Microsoft’s music service they once had, shut down after a few short years leaving customers without the ability to listen to the music they had paid to listen to?<p>The service was called, this was the trademarked name, Microsoft “Plays for Sure.” You cannot make this stuff up.
Azure, despite being smaller than AWS, I think has more regions. So each one must be smaller, which likely means less spare capacity.<p>I also sort of suspect the spot market is less robust there. Lots of Azure is lift and shift on premises workloads, and those aren't using spot. Without people using spot, it's even harder to have spare capacity...
we have the same issue and escalated it through multiple azure teams.<p>our quota has been silently set to 0 while there where still instances running. this worked fine until auto-scale scaled the instances down in the night to 1. at the start of the day auto scale was not able to scale back up to the initial amount which did lead into heavy performance issues and outages. we needed to move the instances as azure support did not help us.
after many calls with azure and multiple teams involved we finally did not get the quota approved (even if we did have it already and was not asking for „new“ quotas).<p>also we decided to not be able to host in the German azure region anymore. Even if we could get the quota this is a business risk we don’t want to bear anymore to not be able to scale for unexpected traffic.<p>this is huge for us as our application requires German servers. We are still in research where to host in future.
Sad to ear that, but people have a wrong idea about the cloud, it's just others people hardware and like everything, there's a limit.<p>They cannot warn you because it's very hard to predict how many new customers will come or if the existing ones will create more instances.<p>I know about a bank with the same issue, basically, they've hogged all the resources in a specific region and yet, they need more. Unfortunately this things take time, MS cannot setup a new datacenter in a couple of days.<p>>but setting up in a new region would be complicated for us.<p>Why? it's easy: <a href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/move-resources-overview" rel="nofollow">https://learn.microsoft.com/en-us/azure/azure-resource-manag...</a><p>Latency issues from app to DB?
> but setting up in a new region would be complicated for us.<p>I've never done K8 on Azure, but my understanding is that Azure is pretty good about coordinating between your own datacenter running windows and Azure. Maybe you can spin up some windows boxes in a cheap datacenter to make it work?
Yes, my company found this out trying to add both a database and a serverless app to our existing infrastructure in Germany West Central in July. They had no ETA for more GWC capacity back then and told us to move to the North and West Europe regions.
Just don't trust in marketing and save yourself a lot of money. On prem for all base or long term (6+ month) resources. Cloud only for peaks. And never use single cloud providers dependent features. Then you will never have such troubles at all.
"There's no such thing as cloud - it is always someone's else computer". Although we may try to rely on the unwillingness of the cloud provider to lose revenue, probability of events like this can never be fully discounted.
My understanding is that the German region is not run by Microsoft, but a German company. This provides a legal shield required by Germany to try and prevent the US government from accessing data on those servers.
We've been having this problem in Singapore for a couple of years now. Can't add any VMs to our k8s cluster and can't provision a number of services which made our multi-region BCP more complicated.
>These problems seem only in the German region, but setting up in a new region would be complicated for us.<p>This seems like your fundamental problem. If you design an architecture that is limited to a single region of a single cloud provider, you are very likely to encounter issues at some point.<p>Luckily you have a full month to solve this problem before it will prevent you from accepting new users. My suggestion is to start making your app multi-regional or multi-provider ASAP.
I worked in a top 15 Azure customer. This is not unusual at all, especially in the newer regions. Talk to your TAM before you make attempt major capacity changes in a region. They may have advice on specific SKUs to use or which zones have capacity (e.g. when austrailaeast was being built 80%+ of the capacity was in one zone for many months).<p>If you aren't a big spender you may not have a TAM who can get this info for you. Welcome to Azure.
While you're at it at making your "infrastructure as code" cloud agnostic perhaps take a look at tools like Terraform (the only one I'm familiar with). I've just started the work of defining whatever we need to provision in their notation with the objective that it can be done with a single push of a button in the future.
In Norway East Azure were incapable of provisioning new VMs for several(4-5) days, caused by some IP issue. The only solution was 'try to provision in the night, and don't turn it off if you get one'. Their status page showed green through the whole period though, even though nothing needing compute worked. So that was cool....
Stockouts have happened on both AWS and GCP too. Most of the time the problem is no longer a problem if you build your infrastructure not to rely on a single region or availability zone. On EC2 especially, even if you can't change to a different region, try changing to a new instance type and that might work.
Is this related to the hardware shortage during the pandemic? I'm assuming they couldn't scale at the rate which they intended pre-pandemic..<p>This seems like a much larger issue than they're making it seem. The promise of the cloud was unlimited scalability. I never thought of cloud resources as finite.
I don't have much knowledge about azure but is it possible to add different instant types and/or sizes? E. g. in the EC2 world if AWS was out of m5.xlarge I would try to add a worker group with m6.xlarge or m4.xlarge. If that did not work I might try to replace my xlarge with 2xlarge...
Sort-of. I have a Postgres flexible database in the West Germany Central region that can no longer be scaled. It was only created for testing purposes, so no biggie. The backend is basically a managed Compute resource.<p>If you need more reliability, I see only one way out: Go multi-region or even multi-cloud.
Infinite scaling clouds, they said. In AWS at work we spin up large numbers of EMR nodes and every few days get stuck waiting for availability of certain instance types in our region too. I guess we could reserve more, but that defeats a lot of scale up and down advantages.
Worth repeating again, AWS, Azure and GCP are all adding capacity, and new Datacenter as fast as they could. We have enough demand to drive the next two generation of leading edge node. That is TSMC N3 and N2. And I assume it will be similar in N1.4 or 14A.
I think there is some general rule in business that you should not <i>depend</i> on a provider that if they lost your business it would be less then one percent of their revenue. Or be ready when they drop the guillotine.
They basically have far too many small regions and are growing like crazy, multi-region deployments will be a must unfortunately.<p>Maybe you can spin up some part of the infrastructure that are not latency sensitive in the nearby region?
Infinite resources is only marketing and no hyperscaler on the market should ever promise that or give people that impression if they haven't accomplished scaling all throughout the entire supply chain.
Reading these comments it looks like everyone runs into this all the time. As a counterpoint: never run into this on Azure, scaling up/down 20-30 vm's a day. Hope it stays that way...
As part of launching our global GPU edge network, we need to support low-volume regions, which means a small number of T4 gpu in different timezones. Azure ran out last Christmas, or at least refused us capacity, and is only adding the next tier of A10's (~2x+ costlier?). We haven't had as much of a problem getting GPUs of different grade on GCP + AWS. I get a form email every 2w from Azure IT that they are working on it. Not as much of an issue for bigger GPUs.<p>(Also... If into k8s, python, GPUs, graphs, viz, MLOps, working with sec/fraud/supplychain/gov/etc customers on cool deploys, and looking for a remote job, we are hiring for someone to take ownership here!)
Why not just creating a bigger DB instance in another region for a few months? Sure, you'll take a performance hit, but 99% of users won't notice or care
Maybe they are doing this to push people into regions with lower energy costs. Of course Northern Virginia or Canada is going to give you much higher ping times.
If this is a serious problem for your business, you use K8s and require assistance quickly moving your workloads, consider contacting:<p><a href="https://www.giantswarm.io/" rel="nofollow">https://www.giantswarm.io/</a><p>(I work at Giant Swarm.)
People saying "shame on you for not being multi-region" are missing the point: This is a German company with German customers subject to German data residency laws. For them to store German data in a region besides Germany requires getting informed consent from the "data subject", who must be "pre-informed about the potential risks involved in cross-border data transfer". [1] This is why Azure has a dedicated German partition, just as it has a dedicated Chinese partition.<p>Now, they could go the GDPR/Cookies route and prompt absolutely every user on pageload, but doing so would annihilate the purpose of the law into monotonous smithereens, just as it did with Cookies. Good on them for defaulting to the "more secure" mode, but yes this is a potential consequence.<p>Happy to hear from any German amigos present if I've got something wrong. (But watch out... you might be putting HN at risk - their servers aren't (likely) in Germany!)<p>[1]: <a href="https://incountry.com/blog/which-german-data-privacy-laws-you-need-to-comply-with/" rel="nofollow">https://incountry.com/blog/which-german-data-privacy-laws-yo...</a>
The Batch Service schedule history monitor sucks. It is inaccurate and doesn't sync the job order correctly. You can call them, they will get on the phone and then say they fixed it. Then you call them again because they didn't and they give you the same answer. Can't blame them, most of them are on H1B's. Nobody wants to be the squeaky wheel in that position. So you will just get the runaround all the time.
>We never thought our startup would be threatened by the unreliability<p>Daily reminder that cloud services are vastly less reliable than traditional hosting; it’s just that they manipulate the terminology to deflect that, replacing reliability with availability, aka “making impression of working”.
I am so glad we made the decision to pull <a href="https://Bigger.Bio" rel="nofollow">https://Bigger.Bio</a> off azure a while ago. It was nothing but problems on their platform.
I’M. having a tough time ALso, with microsofft.<p>They seem to IIgnoRe, then repent.: finally APologgise.:(<p>I think u should switch to a new COMpuute. GCc.-pp.??<p>When we were running our own compute back in 09: and resources ran out or were unreliable, we cld shOUt at the server maintainer and/OOr install better hardware oUUrselves. NOt-THE.case anymore.:( :((<p>-Vip
Ha. I knew something like this would happen eventually. Isn't limitless scalability one of the biggest selling points of using "the cloud"? If you have to buy your own computers anyway why even use the cloud? You could try using different clouds providers but eventually the clouds run out.<p>Which brings me to another important point. If we run out of computers meaning supply can't keep up with demand, then who are the winners? The people who own the computers. Cloud providers and self hosters. Because of the high demand cloud providers can raise their prices and that's directly converted to profit since expenses remain the same, i.e. price gouging. Good job all you cloud loyalists who use the cloud for everything.