TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: I scraped 25M Shopify products to build a search engine

317 点作者 pencildiver超过 1 年前
Hi HN! I built Agora as a side-project leading up to the holiday season. I wanted to find an easier way to find Christmas gifts, without needing to go store-by-store.<p>My wife asked me for a a pair of red shoes for Christmas. I quickly typed it into Google and found a combination of ads from large retailers and links to a 1948 movie called &#x27;Red Shoes&#x27;. I decided to build Agora to solve my own problem (and stay happily married). The product is a search engine that automatically crawls thousands of Shopify stores and makes them easily accessible with a search interface. There&#x27;s a few additional features to enhance the buying experience including saving products, filters, reviews, and popular products.<p>I&#x27;ve started with exclusively Shopify stores and plan to expand the crawler to other e-commerce platforms like BigCommerce, WooCommerce, Wix, etc. The technical challenge I&#x27;ve found is keeping the search speed and performance strong as the data set becomes larger. There&#x27;s about 25 million products on Agora right now. I&#x27;ll ramp this up carefully to make sure we don&#x27;t compromise the search speed and user experience.<p>I&#x27;d love any feedback!

68 条评论

senecaso超过 1 年前
I hope you have better luck than I did!<p>A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had &gt; 100m products listed, and I don&#x27;t remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550&#x2F;mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That&#x27;s where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.<p>I still maintain that this is a good idea, and constantly have to fight off the urge to &quot;try again&quot;, however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.<p>Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn&#x27;t have mattered if I couldn&#x27;t figure out how to acquire users.
评论 #38638642 未加载
评论 #38638166 未加载
评论 #38640274 未加载
评论 #38637554 未加载
评论 #38649401 未加载
screye超过 1 年前
What was the process for scraping 25M products ?<p>I have always used standard python tools like selenium, bs4 and the like. But I&#x27;m guessing none of these work at scale.<p>Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?<p>______________<p>A recommendation for how to improve search.<p>Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (<a href="https:&#x2F;&#x2F;portal.vision.cognitive.azure.com&#x2F;demo&#x2F;dense-captioning" rel="nofollow noreferrer">https:&#x2F;&#x2F;portal.vision.cognitive.azure.com&#x2F;demo&#x2F;dense-caption...</a>) and generate captions for all your images.<p>Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.<p>Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.
评论 #38636625 未加载
评论 #38636538 未加载
评论 #38637031 未加载
评论 #38638736 未加载
joshuamcginnis超过 1 年前
I love your approach; you found a problem and developed a solution for it. And then you got the courage to share with the larger technical community. Good on you.<p>There&#x27;s obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don&#x27;t let that stop you. I&#x27;m certain they can all be fixed.<p>Keep going! At the least, you&#x27;ll come out of this with an excellent project in your portfolio.
评论 #38636183 未加载
pitched超过 1 年前
Shopify has tried a few times to build a tool like this but hasn’t ever managed to get any traction. I think that missing any curation at all could be what eventually kills it. Their current attempt is <a href="https:&#x2F;&#x2F;shop.app" rel="nofollow noreferrer">https:&#x2F;&#x2F;shop.app</a> and a query for red shoes is mostly red shoes.
评论 #38637672 未加载
评论 #38639059 未加载
callmeed超过 1 年前
I built this a couple years ago (now defunct) for the same reason :) The public JSON endpoints on shopify stores make it pretty easy to get the data. You mentioned using Mongo but it sounds expensive. I honestly think you could do this with just elastic or even postgres full text search and save money.<p>Here&#x27;s a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: <a href="https:&#x2F;&#x2F;hapaboardshop.com&#x2F;cart&#x2F;42165521907955" rel="nofollow noreferrer">https:&#x2F;&#x2F;hapaboardshop.com&#x2F;cart&#x2F;42165521907955</a> (it also supports quantities and coupon codes)<p>A word of caution: more products isn&#x27;t necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it&#x27;s better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.
评论 #38666575 未加载
评论 #38643785 未加载
评论 #38638651 未加载
konschubert超过 1 年前
Hey, I have a Shopify store that sells e-paper calendars &#x2F; smart screens. I tried to search for it but I could not find it. What should I do so your crawler can find me?<p><a href="https:&#x2F;&#x2F;shop.invisible-computers.com" rel="nofollow noreferrer">https:&#x2F;&#x2F;shop.invisible-computers.com</a>
评论 #38638416 未加载
评论 #38635887 未加载
评论 #38637891 未加载
jillesvangurp超过 1 年前
There are a few conferences dedicated to ecommerce search. Mices is pretty good. I did not go there this year but I know some of the people behind it. Good community and lots of stuff happening.<p>Two points here.<p>- 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.<p>- The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.<p>So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can&#x27;t do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.
quaxar超过 1 年前
Great site. Having built a search engine that needed to handle product data on a similar scale, it&#x27;s not an easy thing to manage.<p>Some observations:<p>- Don&#x27;t use infinite scrolling, it&#x27;s an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.<p>- Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)<p>- Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.<p>- The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.
twothamendment超过 1 年前
Searching is slow (kinda expected that right now), but after clicking a product and then hitting back, I have to wait for the search again.<p>Not at computer so I didn&#x27;t check the headers, but maybe allow the client to cache the response for a short time so it doesn&#x27;t need to load search results again.
评论 #38636987 未加载
Redster超过 1 年前
Have Swedish family. Searched dala because family wants traditional Christmas ornaments. Sure enough, there were several results that were 10x cheaper than what I could find on the first page of Big Search Company. Great job!
评论 #38636959 未加载
TekMol超过 1 年前
The Terms page goes to &quot;Jaggi Enterprises&quot;, &quot;A Modern Investment Fund. We buy, build, and invest in software companies with recurring revenue.&quot;.<p>So maybe this is not really something a guy built for his wife, but some anonymous startup that googled &quot;Which terms rank best on Hacker News&quot; and then wrote the &quot;I did ... my wife ..&quot; story?
评论 #38638925 未加载
评论 #38638940 未加载
muratsu超过 1 年前
Agora also doesn&#x27;t return red shoes for the search query &quot;red shoes&quot;. Seems like you haven&#x27;t fully solved the problem yet :)<p>From a technical perspective, crawling 25M products is impressive but the search itself doesn&#x27;t provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that&#x27;s valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.
评论 #38635988 未加载
评论 #38636474 未加载
virtuosarmo超过 1 年前
I believe Shopify built their own app &#x2F; website where you can search for products exclusively from Shopify merchants. <a href="https:&#x2F;&#x2F;shop.app&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;shop.app&#x2F;</a>
xnx超过 1 年前
Great project. If you continue to crawl the data, be sure to save it so you can detect price changes a la camelcamelcamel.
评论 #38636398 未加载
评论 #38636073 未加载
yoru-sulfur超过 1 年前
For those unaware, Shopify already has platform wide search. You can use <a href="https:&#x2F;&#x2F;shop.app&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;shop.app&#x2F;</a> (or the app), and it also has some chatbot thing that can offer suggestions
评论 #38637519 未加载
Asparagirl超过 1 年前
Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their &#x2F;products.json feeds? Or did you just try a huge list of domain names at random?
评论 #38636026 未加载
评论 #38635977 未加载
评论 #38637010 未加载
cmcconomy超过 1 年前
That&#x27;s funny, I made a domain-specific version of this for canadian coffee deals.<p><a href="https:&#x2F;&#x2F;beangrid.mcconomy.org&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;beangrid.mcconomy.org&#x2F;</a>
评论 #38636831 未加载
评论 #38636234 未加载
misterbwong超过 1 年前
What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?
评论 #38635832 未加载
评论 #38635829 未加载
ashvardanian超过 1 年前
Cool project!<p>As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)<p><a href="https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;usearch">https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;usearch</a> - for faster search<p><a href="https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;uform">https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;uform</a> - for cheaper multi-lingual multi-modal embeddings<p>Feel free to reach out with feedback and feature requests!
评论 #38640944 未加载
评论 #38640789 未加载
评论 #38637922 未加载
thih9超过 1 年前
When I search for “op-1”, partial match like “Frontier Co-op Turkey Rub, Organic 1 lb. -- Frontier Co-op” gets ranked higher than “teenage engineering op-1”. I would expect the opposite.
rocauc超过 1 年前
Really neat. I tried your search for red shoes, and I found some, er, unexpected imagery on page 1.<p>One thing you could do is add semantic search so when a user searches &quot;red shoes,&quot; the index returns images that look like red shoes even if the metadata doesn&#x27;t say anything about color or item types. To do this, I&#x27;d use a model like CLIP. Here&#x27;s an example of using CLIP and Supabase to do semantic image search: <a href="https:&#x2F;&#x2F;blog.roboflow.com&#x2F;how-to-use-semantic-search-supabase-openai-clip&#x2F;">https:&#x2F;&#x2F;blog.roboflow.com&#x2F;how-to-use-semantic-search-supabas...</a>
评论 #38637438 未加载
hipadev23超过 1 年前
<a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;search?query=red%20shoes" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.searchagora.com&#x2F;search?query=red%20shoes</a><p>It doesn’t work very well.
评论 #38637697 未加载
noduerme超过 1 年前
This is great - just a couple UI things bugging me. 1. When clicking &quot;Open&quot; on a product, the user should be able to open that in a separate tab. Currently that&#x27;s not possible; I&#x27;m sure because it&#x27;s being delivered in a single page (can&#x27;t check now because you&#x27;re getting hugged to death by HN).<p>2. When the server&#x27;s slow, as it just was, there should be some kind of waiter &#x2F; loader to immediately show the user that the &quot;Open&quot; click was sent on a product. Otherwise people will keep clicking it (or worse, clicking other products) and there&#x27;s no indication that it&#x27;s loading.<p>3. Once a product is open, it&#x27;s not clear how to get back from it. I see the &quot;X&quot; in the corner, but doing that seems to take me back to a blank search page, not to my search results. The back button also doesn&#x27;t take me back to the search results...
评论 #38636821 未加载
pencildiver超过 1 年前
HN— Not sure if anyone will see this but I wanted to thank you all for the support. Although I haven&#x27;t slept much since going live, it has been amazing getting early feedback from the community.<p>Agora is still in MVP stage but getting better by the day. Just pushed a big update: fixing an image shifting bug, a blur effect on loading, Redis for caching, brand pages, architecture fixes, and several other things. Currently working on improving the relevancy algorithm, adding all ~5 million Shopify stores, and then adding WooCommerce stores over the next few days.<p>If you have any suggestions or ideas, reach out to me at support @ searchagora .com :)
codetrotter超过 1 年前
On the page where you show details about the product, I would like to have it include the same product from other Shopify stores by doing an image similarly comparison.<p>And then highlight how the price compares.<p>For example, here are some pretty crazy red shoes. But they are too expensive for me. Would be interesting to see if this is the only store selling these shoes, or if someone else has the same shoes much cheaper.<p><a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;products&#x2F;vasco-4-47fb0f87-5b89-470c-b775-6da4da5d75e5" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.searchagora.com&#x2F;products&#x2F;vasco-4-47fb0f87-5b89-4...</a>
system2超过 1 年前
How are you planning to monetize this? You mentioned you are spending around $2K just to run it. Is there a commission strategy or ads? Or populate with your products at one point so you sell your own thing?
评论 #38641948 未加载
ganesha727超过 1 年前
Idea! Shopify has a ton of resellers that sell junk from China. If you figure out how to avoid them, your life would be 10x easier.
评论 #38639304 未加载
treesciencebot超过 1 年前
This is amazing for finding cute collectibles from my favorite TV show that I would otherwise not noticed among random t-shirt and other &quot;slap the picture and call it co-branded&quot; products! I&#x27;m not super sure how long it is going to be around, but I think I&#x27;m gonna keep playing with it for a while.
评论 #38636064 未加载
bsbechtel超过 1 年前
I searched for &#x27;pão francês&#x27; and my store was the #1 result. I think you&#x27;re doing it right! :)
jonnycoder超过 1 年前
Awesome! It would be good to listen to the enter key when typing in a search query. Your privacy and terms links point to what appears to be the saas code framework you used (just a guess). I was looking for your contact&#x2F;email so I can ask you some questions.
评论 #38636296 未加载
qdequelen超过 1 年前
Hey, I&#x27;m the CEO of Meilisearch. If your issue is performance, I would love to give you a try with Meilisearch. You&#x27;ll be able to create an &quot;as you type&quot; experience with our engine that responds in less than 50ms!
mandeepj超过 1 年前
Do you plan to add filters: price etc?<p>I was about to &#x27;reviews&#x27; as well in the above list but decided not to as they are not always trustworthy. Now AI is so advanced, that it can be used to detect fake reviews and ignore them from sampling.
评论 #38636209 未加载
asdadsdad超过 1 年前
cool project. You might have notice, but there&#x27;s a non-trivial amount of fraud on shopify (fake shops, info stealers, etc). Might be interesting to look at that dataset and explore a bit =) it&#x27;s quite fascinating
评论 #38637013 未加载
评论 #38636891 未加载
nox100超过 1 年前
I have no clue how to implement a search but maybe some words are more important than others.<p>I searched for &quot;mens dress shirts button long sleeve&quot; and after about 6 results it was all women&#x27;s clothing.
alvarome超过 1 年前
I&#x27;m a Shopify store owner myself. I saw there is a $99 per month to get your product verified, how would this compete in terms of CPC with a traditional channel such as google ads or meta ads?
评论 #38642184 未加载
difradev超过 1 年前
Amazing job! I&#x27;ve one question: how did you find the price of every products? I mean, every product page has a different id or class that identify a price. Do you use a regex?
评论 #38652624 未加载
EvanAnderson超过 1 年前
Aside: The ending of the 1948 &quot;The Red Shoes&quot; was funny to me, but I think I was a little loopy after slogging thru it. I don&#x27;t know if I recommend it or not.
评论 #38637110 未加载
quickthrower2超过 1 年前
I like it.<p>I need to be able to filter search to if it will deliver to my country.<p>It desperately needs some indication that your action is being processed, like a spinner, when you search.
评论 #38636287 未加载
krauses超过 1 年前
What&#x27;s your revenue model? I see you expanded on the details of your $1.5K monyhly cost, but failing to see how you make money? Affiliates fees?
评论 #38637066 未加载
glohbalrob超过 1 年前
wow! Nice work. I&#x27;ve been trying to build an index of shopify stores. Did you search for all domains pointing to shopify&#x27;s name servers?
评论 #38636203 未加载
Canada超过 1 年前
Worked well for me, great job. I searched for something I&#x27;ve been looking for and found some interesting options I haven&#x27;t seen before.
jasonlbaptiste超过 1 年前
You should def give Algolia and Typesense a try. You can get 10k in free Algolia credits for the first year too via Secret (startup deals site).
评论 #38637617 未加载
freefruit超过 1 年前
Could you make it so, that I can easily open a product in a new tab. I like to compare lots of products at the same time.
评论 #38637719 未加载
minastirith超过 1 年前
Love it! Some improvements are needed on search but is an amazing MVP, I&#x27;ll use this for my late christmas shopping
bomewish超过 1 年前
Why not manticore as backend? Much better perf than ES, less memory intense, sql syntax. Just fantastic all round!!
sails超过 1 年前
Clicking an item could show you similar items before it takes you to the item (or have capability for similar)
1vuio0pswjnm7超过 1 年前
&quot;There&#x27;s about 25 million products on Agora right now.&quot;<p>How many stores are represented in index.
评论 #38648341 未加载
评论 #38637611 未加载
评论 #38641981 未加载
connectingu超过 1 年前
Incredible. Where can I connect with you? Want to pick your brain &amp; swap some thoughts :)
jross225超过 1 年前
heh, I used to work on the data team at Shopify. I built something similar to search internal dbs for secret santa gifts based on some weird criteria. Scraping might have a large margin of error because a lot of products tend to be ephemeral.<p>Neat project though!
评论 #38641451 未加载
kacyjames超过 1 年前
Any Unicode input (Japanese or Greek text for example) currently causes a 500 error.
IcyHordr超过 1 年前
So cool, good luck in the marriage, you made a very cool thing!!
评论 #38654657 未加载
joshdance超过 1 年前
Amazing. Why doesn&#x27;t Shopify built this natively?
sanketgoyal11超过 1 年前
How did you find the list of shopify stores and names?
评论 #38640307 未加载
moneywoes超过 1 年前
how did you avoid ip based blocking? rotating proxies?
usrme超过 1 年前
Maybe I&#x27;m clearly ignorant, but how does this differ from Klarna (<a href="https:&#x2F;&#x2F;www.klarna.com" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.klarna.com</a>)?
评论 #38638563 未加载
评论 #38638554 未加载
RagnarD超过 1 年前
Is this really within the TOS of Shopify?
评论 #38638737 未加载
taimurayaz超过 1 年前
Cool project but Shopify already has this. <a href="https:&#x2F;&#x2F;shop.app" rel="nofollow noreferrer">https:&#x2F;&#x2F;shop.app</a><p><a href="https:&#x2F;&#x2F;shop.app&#x2F;search&#x2F;results?query=red%20shoes" rel="nofollow noreferrer">https:&#x2F;&#x2F;shop.app&#x2F;search&#x2F;results?query=red%20shoes</a>
评论 #38636000 未加载
评论 #38635961 未加载
评论 #38636113 未加载
评论 #38636586 未加载
bluepnume超过 1 年前
Amazing! Does it have an api?
评论 #38637312 未加载
ttt3ts超过 1 年前
Built the same thing a while back while collecting a lead list for sales. Not bothered to keep data updated but was a fun thing to build in a couple days. (disclaimer mobile experience is meh cause it was a fun project)<p><a href="https:&#x2F;&#x2F;zensear.ch" rel="nofollow noreferrer">https:&#x2F;&#x2F;zensear.ch</a><p>How did you find list of all Shopify stores? I ended up just checking every .com, .net, etc as I didn&#x27;t find an easy way to figure it out directly from shopify.
评论 #38636789 未加载
评论 #38638383 未加载
ctocoder超过 1 年前
how did you get a list of the 25 million stores to crawl?
b2bsaas00超过 1 年前
Basically it’s Amazon
评论 #38636098 未加载
moneywoes超过 1 年前
where did you find a list of shopify stores to scrape
评论 #38636794 未加载
dangoodmanUT超过 1 年前
Super cool!!!
评论 #38641180 未加载
Wajid2502超过 1 年前
Great idea
ganesha727超过 1 年前
Gg
connectingu超过 1 年前
Incredible. Would love to connect with you. Where can I find you LOL
dns_snek超过 1 年前
I&#x27;m sorry, but I have to question where this heartfelt story about looking out for your wife is in any way real?<p>The website certainly doesn&#x27;t look like a side project, it has a fully fledged system for merchants to advertise on Agora for a fee, an affiliate system offering $50 commissions to onboard merchants and the ToS and Privacy policy link to a website with the following mission statement:<p>&gt; We buy, build, and invest in software companies with recurring revenue and product-led growth.
评论 #38636439 未加载
评论 #38636453 未加载
评论 #38636469 未加载
shadowbanned4超过 1 年前
This isn&#x27;t worth the cost or effort. Shopify already has an internal tool with this functionality that they are planning to publicize.
评论 #38637253 未加载