TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: I scraped 200M Shopify products to build a search engine

23 pointsby pencildiverover 1 year ago
Hi HN! In December I launched an MVP for Agora here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38635695">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38635695</a><p>After posting, we got thousands of users and hundreds of comments with valuable feedback from the community. I spent a couple sleepless nights frantically pacing around my room trying to keep the product live and, relatively, performant. After getting some sleep, I got back to work to make the product better.<p>A few updates:<p>1. We&#x27;ve grown from 25 million to 200 million products on Shopify and WooCommerce. The team at WooCommerce reached out after the HN launch to help us figure out how to index their stores. Similar to Shopify, we found that there’s a public file available for all stores that use Wordpress and WooCommerce at [Base URL]&#x2F;wp-json&#x2F;wc&#x2F;v1&#x2F;products. For example, the file for Good Works Tractors is available here: <a href="https:&#x2F;&#x2F;www.goodworkstractors.com&#x2F;wp-json&#x2F;wc&#x2F;store&#x2F;v1&#x2F;products" rel="nofollow">https:&#x2F;&#x2F;www.goodworkstractors.com&#x2F;wp-json&#x2F;wc&#x2F;store&#x2F;v1&#x2F;produc...</a> So I bought a list of 3.5 million active WooCommerce stores on a website called BuiltWith, adapted the product data model, and started the crawler to go down the list. We&#x27;ve indexed around 515k stores so far.<p>2. We improved the search experience. We&#x27;re using Mongo to host the 200 million product records. First, we switched from Mongo Atlas Search to Typesense. After testing Typesense with our product records, we found most searches to be under 200ms. We&#x27;re not storing the product images which slows down the loading speed at times. This week, we set up a server using Paperspace to run SBERT embeddings on a GPU (new to the AI workflow so apologies if I get the lingo wrong). We quickly realized that the dimension size of the embeddings matters a lot here, given the size of the data set. The GPU is still running to process all 200 million records and we&#x27;re about a week away from releasing AI-powered search.<p>3. We localized the user experience. There&#x27;s now frontend and backend IP detection to only show users products that are &#x27;based in&#x27; or &#x27;ship to&#x27; their specific country. This &#x27;ships to&#x27; filter (i.e. stored in all Shopify stores in the &#x2F;meta.json route like <a href="https:&#x2F;&#x2F;wildfox.com&#x2F;meta.json" rel="nofollow">https:&#x2F;&#x2F;wildfox.com&#x2F;meta.json</a>) significantly slows down the search results but we&#x27;re trying to get creative on the loading process and animation. For example, we&#x27;re using Revalidating on Next.JS to give several pages a &#x27;hard coded&#x27; feel and the data refreshes every 60 seconds. <a href="https:&#x2F;&#x2F;nextjs.org&#x2F;docs&#x2F;app&#x2F;building-your-application&#x2F;data-fetching&#x2F;fetching-caching-and-revalidating" rel="nofollow">https:&#x2F;&#x2F;nextjs.org&#x2F;docs&#x2F;app&#x2F;building-your-application&#x2F;data-f...</a><p>4. We got our first few paying customers. Store owners can sign up for free to track their store&#x27;s performance on Agora. We validate that they are the store owner by making sure the email address and store URL match on sign up, and then send them an email verification link. They can upgrade to a subscription tier to &#x27;verify&#x27; their products to get better placement in relevant search results. Additionally, they can pay to &#x27;boost&#x27; products and guarantee that they&#x27;ll show up in the first row of results. Given the high purchase-intent searches on Agora, I&#x27;m finding this to be the right business model.<p>The next challenge to solve: We need to improve the quality of products on Agora. There&#x27;s a lot of resellers, dropshipping stores, and low quality images. Now, just because a product is sold on a reseller or dropshipping website, doesn&#x27;t mean it&#x27;s a bad product. There&#x27;s a lot of exceptions and edge cases to solve. One potential solution: we&#x27;re considering coming up with an &quot;Agora Score&quot; that takes in several factors including the image quality, store name, brand name, website SEO, etc. to tell users how trustworthy we think the product is.<p>I&#x27;d love any feedback or advice. I did solve my original problem of finding &#x27;red shoes&#x27; for my wife, but inadvertently created more problems for myself. I&#x27;m loving every minute of it though. My wife jokes that everything is now &quot;Agora this...Agora that&quot;. Open to any advice on that as well.

15 comments

rivercraftover 1 year ago
Don&#x27;t take this as a harsh criticism but I want to know what problem are you solving? Is this just for fun?<p>This is a lose-lose game. You will never be able to catch up to the providers (shopify and woocommerce and others).<p>What you are doing is not a search problem. It is a traffic problem of which you have little to none. There is a reason why Instagram and FB works as a driver for ecommerce products. My suggestion is to test the market before you invest too much in this area.
评论 #39463429 未加载
评论 #39465525 未加载
lolpandaover 1 year ago
Oh is the site currently down? I tried a few queries, including the ones on the landing page. It gave me empty results.
评论 #39464721 未加载
lpellisover 1 year ago
Seems like a fun scraping project, I think you have to work on extracting more accurate categories though, for example this link does not really include snowboards for me: <a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;search?query=Snowboard" rel="nofollow">https:&#x2F;&#x2F;www.searchagora.com&#x2F;search?query=Snowboard</a> And the first products I clicked have rather weird descriptions, <a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;products&#x2F;snowboard-bd2a90aa-6808-4b23-9b4e-9ff0f3504675-1706684246551-337" rel="nofollow">https:&#x2F;&#x2F;www.searchagora.com&#x2F;products&#x2F;snowboard-bd2a90aa-6808...</a><p>Maybe its my location (South Africa) but I also cannot visit the product store when I click through
评论 #39461262 未加载
piterrroabout 1 year ago
How do you plan to drive customer traffic to this site? As others mentioned, it&#x27;s bare bones, raw search engine. I think these days, consumers need something more than just a bare choice because it&#x27;s too much. People get paralyzed when they are presented with multiple options. I think if you could develop something that works similar to interest or Instagram, that would be more interesting, especially for female consumers who love to spend time on sliding endless feeds with items to buy.
评论 #39465143 未加载
lobito14about 1 year ago
How often and how do you plan to update images, prices and descriptions? Also, I noticed some &quot;more from merchant&quot; links don&#x27;t work, for example: <a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;buy-online&#x2F;https:&#x2F;www.bigbuy.eu&#x2F;es&#x2F;compresas-para-incontinencia-indasec_74300.html" rel="nofollow">https:&#x2F;&#x2F;www.searchagora.com&#x2F;buy-online&#x2F;https:&#x2F;www.bigbuy.eu&#x2F;...</a>
plasmaabout 1 year ago
Unfortunately it seems the underlying search API is throwing &#x27;{ &quot;message&quot;: &quot;Not Ready or Lagging&quot;}&#x27; for every search
评论 #39464672 未加载
alvaromeabout 1 year ago
Love to see that you have posted again, I commented on your post last time! I have two main questions here. Firstly, why would Shopify or Woocommerce not build this themselves? And secondly, how do you intend to drive traffic to the web? I can see how you will solve the search function at scale, but I see a bigger hurdle in driving initial traffic to the site
评论 #39465246 未加载
quickthrower2about 1 year ago
You should get metrics on searches that yield zero results and investigate why. Getting zero is a turn off! My example: timber
评论 #39464918 未加载
iamacyborgover 1 year ago
&gt; The next challenge to solve: We need to improve the quality of products on Agora. There&#x27;s a lot of resellers, dropshipping stores, and low quality images.<p>Glad to see you’re thinking about this. The sheer prevalence of dropshipped junk on Amazon is a huge problem and I’d happily shop elsewhere if I could find a good way to discover products.
评论 #39460632 未加载
LuigiElsaabout 1 year ago
Well done! A lot of progress since last time. Have you guys considered using AI to categorise products (ie; create labels using product images), instead of using the text to match the search? I say this cause I sometimes see some irrelevant products and I can tell you guys are basing the search on text
评论 #39465197 未加载
matterofBeeabout 1 year ago
Search is not working. Also, I seem to get shoes for anything I search. Did you hardcode it by any chance?<p>Are you open to collaborate with others? I might have an automated method of curating products. Please drop a line to comp [dot] turkey [at] gmail.com.
评论 #39464939 未加载
NachoElsaabout 1 year ago
Hey! Cool project, my co-founder told me about this. I suppose you&#x27;re getting initial traffic from search engines, isn&#x27;t this just adding an extra step for users as most search engines already display products at first level?
评论 #39466104 未加载
barbarbarabout 1 year ago
It doesn&#x27;t work. &quot;Soap&quot; returned 0 results.
评论 #39464885 未加载
gndkabout 1 year ago
I don&#x27;t get a single result for any searches.
评论 #39464881 未加载
tonylemesmerabout 1 year ago
‘Paper’ zero results
评论 #39464858 未加载