TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Architecture of Nautilus, the new Dropbox search engine

3 pointsby WalterSobchakover 6 years ago

1 comment

PokemonNoGoover 6 years ago
Very interesting!<p>&gt;For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation, which then gets parsed in order to extract a list of “tokens” (i.e. words) and their “attributes” (i.e. formatting, position, etc…).<p>How good is really Apache Tike at this? I&#x27;ve messed about but its hard to find solutions that cover the base cases.<p>What are the recommendations for covering lets say PDF, OpenXML, and ODF?