TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Sharding Pinterest: How we scaled our MySQL fleet

144 pointsby jparisealmost 10 years ago

21 comments

devmachalmost 10 years ago
As I read on their blog, Instagram uses similar logic with PostgreSQL. If anyone interested, it&#x27;s accessible on <a href="http:&#x2F;&#x2F;instagram-engineering.tumblr.com&#x2F;post&#x2F;10853187575&#x2F;sharding-ids-at-instagram" rel="nofollow">http:&#x2F;&#x2F;instagram-engineering.tumblr.com&#x2F;post&#x2F;10853187575&#x2F;sha...</a>
评论 #10088711 未加载
boomzillaalmost 10 years ago
It seems mysql (and hopefully postgresql sometime soon) with custom sharding logic in the app layer still hits the sweet spot for scaling to the order of 100M users. With some thoughts going into designing an appropriate data model and sharding logic, certain join queries can be delegated to the databases too.
评论 #10087311 未加载
philsnowalmost 10 years ago
&gt; To edit a Pin, we read-modify-write the JSON under a MySQL transaction:<p>I&#x27;ve seen a different approach where you keep a version number on the row, do your read, modify in memory on the app server, then do your write like this<p><pre><code> UPDATE db03429.pins SET blob=’&lt;modified blob&gt;’ WHERE local_id=7075733 AND version=53 </code></pre> then look at the result and make sure that it modified one row. If it returned zero rows, you retry (or show a failure to the user, whichever is appropriate for your use case).<p>The reason you&#x27;d do this is so you can&#x27;t ever have the row locked for a long period of time. A lot of people don&#x27;t think about database scalability so _even if they know_ that the code they&#x27;re writing runs while a transaction is held open, they don&#x27;t care that that transaction is blocking anything else that is trying to read the row they&#x27;re working on.<p>This can lead to row lock bloat over time, which can cause scalability &#x2F; availability issues as app servers wait longer and longer to get read (or write) locks on all the rows they care about for their current request. This is mitigated a bit if you&#x27;re requiring &#x2F; encouraging people to read from slaves instead of master, though.
philsnowalmost 10 years ago
&gt; We only interact with the master in production. You never want to read&#x2F;write to a slave in production. Slaves lag, which causes strange bugs.<p>These can be worked through, with discipline. You probably only need a fully consistent few on a surprisingly small number of pages.<p>Generally you&#x27;ll want to read from the master right after a user changes something (if you have a model where user clicks a button -&gt; you go to a different page). User changes setting on a pin, clicks &#x27;save&#x27;, you render a new page showing their updated pin. This page view should probably come from the master, or else you risk the user&#x27;s change not showing up, causing confusion.<p>Reads from slaves are fine, as long as you&#x27;re not using something that was read out of a slave as an input to a database write somewhere else (which you shouldn&#x27;t be doing anyway!). If you render a page that lets a user change their profile (say site.com&#x2F;edit_profile), the user data can come from a slave, but if you take _all_ the field values and blindly write those into the master, that&#x27;s where you run into &quot;time travel&quot; bugs. You just need to find out what the user changed and only make those changes in the master.
misiti3780almost 10 years ago
Can someone explain to my why people are still recommending mysql over postgres? This is a serious question, it just seems that Postgres has more features and I cant think of any good reasons mysql would scale any differently other than it has been along a little longer (so there are more blog posts + experienced engineers) ?<p>&quot;MySQL is mature, stable and it just works. Not only do we use it, but it’s also used by plenty of other companies pushing even bigger scale. MySQL supports our need for ordering data requests, selecting certain ranges of data and row-level transactions. It has a hell of a lot more features, but we don’t need or use them. But, MySQL is a single box solution, hence the need to shard our data. Here’s our solution:&quot;<p>what about that paragraph is not true of postgres also ?<p>UPDATE:<p>This schemaless json reminds me of this friendfeed blog post from years ago:<p><a href="https:&#x2F;&#x2F;backchannel.org&#x2F;blog&#x2F;friendfeed-schemaless-mysql" rel="nofollow">https:&#x2F;&#x2F;backchannel.org&#x2F;blog&#x2F;friendfeed-schemaless-mysql</a>
评论 #10089075 未加载
pbreitalmost 10 years ago
Is it surprising that Pinterest is on MySQL? Has Postgres finally usurped the position of default DB for startups?
评论 #10087688 未加载
z3t4almost 10 years ago
Nice article and good explanations.<p>But I still wonder why some people store blobs in a database rather then simple files on a file-system!?
评论 #10087792 未加载
e40almost 10 years ago
Is this the guy that&#x27;s going to reddit?
评论 #10088029 未加载
评论 #10087403 未加载
评论 #10087418 未加载
fredliualmost 10 years ago
How are they going to solve the problem of querying into the data that&#x27;s stored as json? E.g. trying to find all pins whose &quot;link&quot; is from, say, reddit. Just pull out all data and filter them through in client side? That&#x27;s not gonna scale. Or having a sort of cron job that periodically picking out interested fields in new json data and store them in a separate table? &lt;-- this is essentially what we do in one of our projects, but curious to see how they do it, or alternatives.
评论 #10089091 未加载
评论 #10089007 未加载
ck2almost 10 years ago
PINTEREST: publish your public outgoing IPs so we can whitelist you and block the rest of amazon ec2<p>It is the responsible thing to do, otherwise other website bots can spoof you.
评论 #10096144 未加载
评论 #10088679 未加载
评论 #10088315 未加载
mark242almost 10 years ago
Am I missing the obvious? They&#x27;re using text columns as blob stores for JSON data? How on earth do you query that in MySQL? How do you run a secondary index, on say the user_id? Is it just one gigantic instance of Elasticsearch or Lucene?
评论 #10102999 未加载
applecorealmost 10 years ago
This reminds me of how FriendFeed was using MySQL to store unstructured data.<p><a href="http:&#x2F;&#x2F;backchannel.org&#x2F;blog&#x2F;friendfeed-schemaless-mysql" rel="nofollow">http:&#x2F;&#x2F;backchannel.org&#x2F;blog&#x2F;friendfeed-schemaless-mysql</a>
评论 #10088668 未加载
baghalialmost 10 years ago
I&#x27;m curious to know how Pinterest or Instagram decide about what goes inside each shard? Do they shard by user ids or something else? Secondly, would like to know if a shard gets more data than other shard, how do they load balance?
joantunealmost 10 years ago
a minor con of this approach is that you have to add an extra layer to your application to do these operations, i.e. to abstract them. But probably the speed &amp; other gains surpass the cons, and in such a big team, i&#x27;m sure they will easily handle such an abstraction layer.<p>One good idea is to open source it, so if other people can take advantage of it, they will also help you maintain it and find bugs for it.<p>A question: If you started this now, would you consider using Postgres-XL AFAIK it supports similar shardings, in a more transparent manner for the developers. Any thoughts on this?
uptownalmost 10 years ago
Can somebody help me understand this conversion?<p>Shard ID = (241294492511762325 &gt;&gt; 46) &amp; 0xFFFF = 3429<p>Type ID = (241294492511762325 &gt;&gt; 36) &amp; 0x3FF = 1<p>Local ID = (241294492511762325 &gt;&gt; 0) &amp; 0xFFFFFFFFF = 7075733
评论 #10087539 未加载
评论 #10087407 未加载
thewarrioralmost 10 years ago
I know nothing about scaling databases but this reminds of Amazon&#x27;s Dynamo DB object store but without the consistent hashing trick .
评论 #10088038 未加载
silverlakealmost 10 years ago
Isn&#x27;t it unsafe to expose internal database IDs to external clients? I would have generated a second GUID for public view.
评论 #10089212 未加载
评论 #10088381 未加载
tapirlalmost 10 years ago
Too many configs and maintenance.<p>NoSQL is really better than SQL to scale your databases. It is just sad there is no one open source NoSQL db as good as Google BitTable.
评论 #10091283 未加载
loco5nineralmost 10 years ago
&gt; Shard. Or shard not. There is no try.<p>sorry, that was bugging me...
pferrelalmost 10 years ago
How embarrassing for them, conquering yesterday&#x27;s tech today!
halaylialmost 10 years ago
This looks like a big hack to compensate for using the wrong tool. Cassandra would have been a better solution IMO.<p>With Cassandra, you can set replication factors, speed up the writes, and automatically shard the data without having to manage your own &quot;mapping tables&quot;.
评论 #10087399 未加载
评论 #10087815 未加载
评论 #10087804 未加载
评论 #10090210 未加载
评论 #10088247 未加载