TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Indexing a billion pages

122 点作者 daoudc超过 1 年前

13 条评论

xnx超过 1 年前
How does the homepage of <a href="https:&#x2F;&#x2F;mwmbl.org&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;mwmbl.org&#x2F;</a> not have a single sentence explaining what it is or even an &quot;About&quot; link?<p>From Github: &quot;Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed.&quot;
评论 #38746047 未加载
评论 #38746734 未加载
评论 #38745954 未加载
jetrink超过 1 年前
&gt; We’ve indexed over 100 million pages<p>&gt; [W]e’re crawling up to a million pages a day, as you can see on our stats page.<p>&gt; Given that Mwmbl is still relatively unknown, it seems plausible that we can reach our target of crawling three billion pages a day, to refresh the entire index in one month.<p>I think this is supposed to read &quot;it seems plausible that we can reach our target of crawling three <i>million</i> pages a day.&quot;
评论 #38751173 未加载
bdcravens超过 1 年前
Most impressive part:<p>&gt; Our estimated annual budget is $752.36 and we have spent $174.49.
评论 #38770819 未加载
mdaniel超过 1 年前
I thought I recalled seeing this before due to its Welsh name and (as is often the case) some are from their domain and some are from the GitHub repo; the ones with over 100 comments are<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37561155">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37561155</a><p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29690877">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29690877</a>
评论 #38749105 未加载
marginalia_nu超过 1 年前
I&#x27;ll race you there ;-)
Alifatisk超过 1 年前
I remember reading about a project who’s sole purpose is to provide a large index of the open web for free, anyone could download it. Forgot the name of the project.<p>Why can’t mwmbl download their index?<p>Also, is mwmbl planning on providing their crawled index for free? Like, can I also download it later?<p>If that is the case, I’s happily download their FF extension.
评论 #38753820 未加载
评论 #38753927 未加载
mdaniel超过 1 年前
&gt; The biggest expense was purchasing a PyCharm professional license at $116.58<p>I mean, awesome that they value good tooling to spend on it but <a href="https:&#x2F;&#x2F;www.jetbrains.com&#x2F;community&#x2F;opensource&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.jetbrains.com&#x2F;community&#x2F;opensource&#x2F;</a> almost certainly means they qualify for a complementary license
Alifatisk超过 1 年前
How do I identify my hash among the users in the stats <a href="https:&#x2F;&#x2F;mwmbl.org&#x2F;stats" rel="nofollow noreferrer">https:&#x2F;&#x2F;mwmbl.org&#x2F;stats</a> ?
Alifatisk超过 1 年前
What&#x27;s the consequence of installing the crawler to FF? Can the ISP &#x2F; Cloudflare &#x2F; any other party start blacklisting you?
评论 #38755886 未加载
评论 #38754910 未加载
hcfman超过 1 年前
Wuite curious. What indexing and retrieval software is this using? I couldn’t find reference to it.<p>Does it index phrases ?
评论 #38747291 未加载
评论 #38750693 未加载
jmclnx超过 1 年前
Very interesting and was quick for me. Nice work!
foreigner超过 1 年前
Do they use Common Crawl?
urbandw311er超过 1 年前
I think the saddest part of this is that, owing to the total enshittification of the web due to SEO, at least 50% of what they index will be absolute garbage.
评论 #38762511 未加载