TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: I scraped 200M Shopify products to build a search engine

23 点作者 pencildiver超过 1 年前
Hi HN! In December I launched an MVP for Agora here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38635695">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38635695</a><p>After posting, we got thousands of users and hundreds of comments with valuable feedback from the community. I spent a couple sleepless nights frantically pacing around my room trying to keep the product live and, relatively, performant. After getting some sleep, I got back to work to make the product better.<p>A few updates:<p>1. We&#x27;ve grown from 25 million to 200 million products on Shopify and WooCommerce. The team at WooCommerce reached out after the HN launch to help us figure out how to index their stores. Similar to Shopify, we found that there’s a public file available for all stores that use Wordpress and WooCommerce at [Base URL]&#x2F;wp-json&#x2F;wc&#x2F;v1&#x2F;products. For example, the file for Good Works Tractors is available here: <a href="https:&#x2F;&#x2F;www.goodworkstractors.com&#x2F;wp-json&#x2F;wc&#x2F;store&#x2F;v1&#x2F;products" rel="nofollow">https:&#x2F;&#x2F;www.goodworkstractors.com&#x2F;wp-json&#x2F;wc&#x2F;store&#x2F;v1&#x2F;produc...</a> So I bought a list of 3.5 million active WooCommerce stores on a website called BuiltWith, adapted the product data model, and started the crawler to go down the list. We&#x27;ve indexed around 515k stores so far.<p>2. We improved the search experience. We&#x27;re using Mongo to host the 200 million product records. First, we switched from Mongo Atlas Search to Typesense. After testing Typesense with our product records, we found most searches to be under 200ms. We&#x27;re not storing the product images which slows down the loading speed at times. This week, we set up a server using Paperspace to run SBERT embeddings on a GPU (new to the AI workflow so apologies if I get the lingo wrong). We quickly realized that the dimension size of the embeddings matters a lot here, given the size of the data set. The GPU is still running to process all 200 million records and we&#x27;re about a week away from releasing AI-powered search.<p>3. We localized the user experience. There&#x27;s now frontend and backend IP detection to only show users products that are &#x27;based in&#x27; or &#x27;ship to&#x27; their specific country. This &#x27;ships to&#x27; filter (i.e. stored in all Shopify stores in the &#x2F;meta.json route like <a href="https:&#x2F;&#x2F;wildfox.com&#x2F;meta.json" rel="nofollow">https:&#x2F;&#x2F;wildfox.com&#x2F;meta.json</a>) significantly slows down the search results but we&#x27;re trying to get creative on the loading process and animation. For example, we&#x27;re using Revalidating on Next.JS to give several pages a &#x27;hard coded&#x27; feel and the data refreshes every 60 seconds. <a href="https:&#x2F;&#x2F;nextjs.org&#x2F;docs&#x2F;app&#x2F;building-your-application&#x2F;data-fetching&#x2F;fetching-caching-and-revalidating" rel="nofollow">https:&#x2F;&#x2F;nextjs.org&#x2F;docs&#x2F;app&#x2F;building-your-application&#x2F;data-f...</a><p>4. We got our first few paying customers. Store owners can sign up for free to track their store&#x27;s performance on Agora. We validate that they are the store owner by making sure the email address and store URL match on sign up, and then send them an email verification link. They can upgrade to a subscription tier to &#x27;verify&#x27; their products to get better placement in relevant search results. Additionally, they can pay to &#x27;boost&#x27; products and guarantee that they&#x27;ll show up in the first row of results. Given the high purchase-intent searches on Agora, I&#x27;m finding this to be the right business model.<p>The next challenge to solve: We need to improve the quality of products on Agora. There&#x27;s a lot of resellers, dropshipping stores, and low quality images. Now, just because a product is sold on a reseller or dropshipping website, doesn&#x27;t mean it&#x27;s a bad product. There&#x27;s a lot of exceptions and edge cases to solve. One potential solution: we&#x27;re considering coming up with an &quot;Agora Score&quot; that takes in several factors including the image quality, store name, brand name, website SEO, etc. to tell users how trustworthy we think the product is.<p>I&#x27;d love any feedback or advice. I did solve my original problem of finding &#x27;red shoes&#x27; for my wife, but inadvertently created more problems for myself. I&#x27;m loving every minute of it though. My wife jokes that everything is now &quot;Agora this...Agora that&quot;. Open to any advice on that as well.

15 条评论

rivercraft超过 1 年前
Don&#x27;t take this as a harsh criticism but I want to know what problem are you solving? Is this just for fun?<p>This is a lose-lose game. You will never be able to catch up to the providers (shopify and woocommerce and others).<p>What you are doing is not a search problem. It is a traffic problem of which you have little to none. There is a reason why Instagram and FB works as a driver for ecommerce products. My suggestion is to test the market before you invest too much in this area.
评论 #39463429 未加载
评论 #39465525 未加载
lolpanda超过 1 年前
Oh is the site currently down? I tried a few queries, including the ones on the landing page. It gave me empty results.
评论 #39464721 未加载
lpellis超过 1 年前
Seems like a fun scraping project, I think you have to work on extracting more accurate categories though, for example this link does not really include snowboards for me: <a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;search?query=Snowboard" rel="nofollow">https:&#x2F;&#x2F;www.searchagora.com&#x2F;search?query=Snowboard</a> And the first products I clicked have rather weird descriptions, <a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;products&#x2F;snowboard-bd2a90aa-6808-4b23-9b4e-9ff0f3504675-1706684246551-337" rel="nofollow">https:&#x2F;&#x2F;www.searchagora.com&#x2F;products&#x2F;snowboard-bd2a90aa-6808...</a><p>Maybe its my location (South Africa) but I also cannot visit the product store when I click through
评论 #39461262 未加载
piterrro超过 1 年前
How do you plan to drive customer traffic to this site? As others mentioned, it&#x27;s bare bones, raw search engine. I think these days, consumers need something more than just a bare choice because it&#x27;s too much. People get paralyzed when they are presented with multiple options. I think if you could develop something that works similar to interest or Instagram, that would be more interesting, especially for female consumers who love to spend time on sliding endless feeds with items to buy.
评论 #39465143 未加载
lobito14超过 1 年前
How often and how do you plan to update images, prices and descriptions? Also, I noticed some &quot;more from merchant&quot; links don&#x27;t work, for example: <a href="https:&#x2F;&#x2F;www.searchagora.com&#x2F;buy-online&#x2F;https:&#x2F;www.bigbuy.eu&#x2F;es&#x2F;compresas-para-incontinencia-indasec_74300.html" rel="nofollow">https:&#x2F;&#x2F;www.searchagora.com&#x2F;buy-online&#x2F;https:&#x2F;www.bigbuy.eu&#x2F;...</a>
plasma超过 1 年前
Unfortunately it seems the underlying search API is throwing &#x27;{ &quot;message&quot;: &quot;Not Ready or Lagging&quot;}&#x27; for every search
评论 #39464672 未加载
alvarome超过 1 年前
Love to see that you have posted again, I commented on your post last time! I have two main questions here. Firstly, why would Shopify or Woocommerce not build this themselves? And secondly, how do you intend to drive traffic to the web? I can see how you will solve the search function at scale, but I see a bigger hurdle in driving initial traffic to the site
评论 #39465246 未加载
quickthrower2超过 1 年前
You should get metrics on searches that yield zero results and investigate why. Getting zero is a turn off! My example: timber
评论 #39464918 未加载
iamacyborg超过 1 年前
&gt; The next challenge to solve: We need to improve the quality of products on Agora. There&#x27;s a lot of resellers, dropshipping stores, and low quality images.<p>Glad to see you’re thinking about this. The sheer prevalence of dropshipped junk on Amazon is a huge problem and I’d happily shop elsewhere if I could find a good way to discover products.
评论 #39460632 未加载
LuigiElsa超过 1 年前
Well done! A lot of progress since last time. Have you guys considered using AI to categorise products (ie; create labels using product images), instead of using the text to match the search? I say this cause I sometimes see some irrelevant products and I can tell you guys are basing the search on text
评论 #39465197 未加载
matterofBee超过 1 年前
Search is not working. Also, I seem to get shoes for anything I search. Did you hardcode it by any chance?<p>Are you open to collaborate with others? I might have an automated method of curating products. Please drop a line to comp [dot] turkey [at] gmail.com.
评论 #39464939 未加载
NachoElsa超过 1 年前
Hey! Cool project, my co-founder told me about this. I suppose you&#x27;re getting initial traffic from search engines, isn&#x27;t this just adding an extra step for users as most search engines already display products at first level?
评论 #39466104 未加载
barbarbar超过 1 年前
It doesn&#x27;t work. &quot;Soap&quot; returned 0 results.
评论 #39464885 未加载
gndk超过 1 年前
I don&#x27;t get a single result for any searches.
评论 #39464881 未加载
tonylemesmer超过 1 年前
‘Paper’ zero results
评论 #39464858 未加载