We have a application which requires a document store. This need to hold up to 10 million documents and be accessible via APIs to retrieve documents to display in our application.<p>We are considering everything from Dropbox-type solutions to blob storage in GCP.<p>What kind of document storage solutions are people using in 2023 to meet this use case?
I use B2 and Wasabi because I don't like relying on a single cloud provider. Files are uploaded to both. OpenResty (Nginx+Lua) sits in front to provide caching, and the logic for deciding which provider to pull from.<p>Wasabi gives you a free bandwidth allowance equal to the number of bytes stored per month. When I use up most of that, I start pulling from B2. And of course, if one of them is down, I pull from the other.<p>It's more time up front to build instead of just relying completely on GCP/Azure/AWS. But I don't have to worry as much about spontaneous account terminations destroying my business.
It's too much of a generic question to be answered right. Do you need global availability? Do you need high speed downloads? Are you worried about bandwidth costs? etc.<p>We use S3 + Cloudfront for documents that we want to be quickly accessed by our customers. We use SFTP for our internal docs when we don't care that much about availability and speed.
I would go with an S3 compatible object store by default.<p>In Open-source Ceph and Minio are common. Garage is newer and has good potential too and it has a simpler design.<p><a href="https://ceph.com/en/" rel="nofollow">https://ceph.com/en/</a>
<a href="https://min.io/" rel="nofollow">https://min.io/</a>
<a href="https://garagehq.deuxfleurs.fr/" rel="nofollow">https://garagehq.deuxfleurs.fr/</a>
The file system was designed to hold documents and does a pretty good job of it, there are several to choose from depending on what OS you run. Backing them up and restoring them is easy. An API to retrieve documents is trivial to write and customize or there are a few tools and APIs already available.
There are a number of fine options for blob storage (S3, R2, Ceph, Azure Storage, etc.), but with that many documents it's likely access control and audit logging will be important. If that's the case, something heavyweight like SharePoint may be a better choice.
One possibility is to use our open-core document management API build to deploy in your AWS account: <a href="https://github.com/formkiq/formkiq-core">https://github.com/formkiq/formkiq-core</a><p>The files are stored in S3, with customizable metadata storage in DynamoDB. As the system is designed to run on AWS Serverless and Managed Services, the majority of the cost will come from S3 storage fees.