Dropbox is distributed. According to the article, it uses AWS, which is a Dynamo based system. Among its other features, Dynamo allows you to distribute data across many servers, using a hash of the data's key in order to look it up (each server gets some of the keyspace).<p>Riak is a similar type system.<p>Dropbox is "centralized" in the sense that it is one service, but it's not the opposite of distributed which would mean "running all on one computer."<p>Edit: I said "hash of the data's key" but really it's a hash of the key plus the bucket.
> The student persisted and kept repeating that "Dropbox has a bottleneck because it is a centralized storage solution, and the distributed solution doesn't have that bottleneck". I couldn't believe my ears.<p>The student is correct. Lets ignore the fact that Dropbox is actually distributed and say it is centralized because all nodes of the system belong to one provider. The only way Dropbox could have scaled to 200m users was tons of cash. In a distributed solution where each node is a provider themselves, each additional user could potentially increase the performance of the system. The distributed alternative scales much more gracefully without running into the bottleneck of needing more cash to buy more machines/storage/bandwidth. In this particular frame, distributed is most definitely always more scalable than centralized unless you have unlimited cash.
1) Dropbox is distributed.<p>2) This article doesn't actually make any argument about why a centralized system can scale as well as a distributed one.
> You can employ Paxos to replicate the centralized server. In contrast, it is often much harder to design and add fault-tolerance to a distributed system.<p>Ok am I missing anything. So we are employing Paxos to replicate the centralized server. Are we replicating it to itself? Because if we are not, we got ourselves a "distributed" system.
My hunch is that the student is frustrated because Dropbox sync speeds are sometimes less than the network line speed (maybe due to the agent having to scan the filesystem to look for changes, or because the agent is syncing many small files, or because Dropbox or the ISP or anyone in the middle is throttling the connection). This is particularly noticeable if you sync a new computer on a different network from the rest of your Dropbox machines (say, a EC2 VPS, or on a university network away from home) because when you're on the same network, LAN sync is often used for a large portion of the initial sync.<p>I suspect the student thinks that distributing his/her files among his/her friends and/or multiple services (bittorrent-style) will allow his/her to increase throughput -- however, I suspect it will merely increase complexity (and possibly also cost) without actually making syncing/back-up faster.
Dropbox is centralized at the organizational, jurisdictional and other levels whilst technically it may employ distributed resources. It's not incorrect to point at this centralization as risk, both in terms of availability and scalability.<p>This is really an industry-wide problem begging for a neat solution. Software eats middle management! (Devops => Devmangops? Mmm... mangoes...) Perhaps the world needs an open source tool in the organizational management/risk space that models business-level risk based upon commercial as well as technical infrastructure.<p>Perhaps the best model for developing such a capacity is a generic exchange protocol with plugins for risk management? My start brainstorming @ <a href="http://www.ifex-project.org/our-proposals/ifex" rel="nofollow">http://www.ifex-project.org/our-proposals/ifex</a>
I think the actual confusion is about a centralized distributed system vs a peer-to-peer distributed system, which is probably what (still totally wrong) PhD student meant.
It's not really clear to me what part of dropbox isn't distributed? (in the sense that it's hosted on multiple computers), the data is distributed and the processing is distributed.
Do they mean it has a central controller/router or something of that kind?
Chatty software able to synchronize state over the open internet using declarative concurrency is a distributed system. A high performance cluster running something like distributed message passing concurrency erlang is a distributed system. A single program written with the complexity of shared state concurrency executing over multiple cores is a distributed system. The concept of concurrency is vital for this, particularly what type of concurrency used. When this person talks about distribution what kind of concurrency is he referring to? I'd like to see this professor reimplement Dropbox for sequential execution on a single CPU to serve the world (you can only use shared state, or any other form of concurrency if you do it on the same CPU). This centralized system then should be fault tolerant. Which it absolutely will not be, as you need at least two machines for fault tolerance. This article was a waste of time.
Distributed is often much more difficult to scale than centralized esp because you n^2 messages for the system to reach consensus.<p>Distributed tends to produce higher availability than centralized systems and often that is worth the cost.
Yea AWS is not dynamo based. Dropbox uses a bunch of mysql and s3. It is hugely distributed and they have to spend a lot of human resources keeping it up.