TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz

684 pointsby angrygoatover 4 years ago

31 comments

phireskyover 4 years ago
Developer of the tool here :) Glad to see it posted here, I still actively use it myself. Also check out the fzf integration in the README: <a href="https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all&#x2F;blob&#x2F;master&#x2F;doc&#x2F;rga-fzf.gif" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all&#x2F;blob&#x2F;master&#x2F;doc&#x2F;rga-...</a><p>Currently the main branch is undergoing a refactor to add support for having custom extractors (calling out to other tools), and more flexible chains of extractors.<p>Ripgrep itself has functionality integrated to call custom extractors with the `--pre` flag, but by adding it here we can retain the benefits of the rga wrapper (more accurate file type matchers, caching, recursion into archives, adapter chaining, no slow shell scripts in between, etc).<p>Sadly, during rewriting it to allow this, I kind of got hung up and couldn&#x27;t manage to figure out how to cleanly design that in Rust. I&#x27;d be really glad if a Rust expert could help me out here:<p>In the currently stable version, the main interface of each &quot;adapter&quot; is `fn(Read, Write) -&gt; ()`. To allow custom adapter chaining I have to change it to be `fn(Read) -&gt; Read` where each chained adapter wraps the read stream and converts it while reading. But then I get issues with how to handle threading etc, as well as a random deadlock that I haven&#x27;t figured out how to solve so far :&#x2F;
评论 #25280300 未加载
评论 #25281622 未加载
评论 #25280395 未加载
评论 #25328542 未加载
评论 #25282813 未加载
评论 #25281538 未加载
awinter-pyover 4 years ago
thanks but it&#x27;s way faster to have my stuff in G drive<p>that way I can open a browser tab, wait 5 seconds for it to load, locate the new screen location of the search bar, click it, wait for javascript to finish loading so I can click the search bar, click it for real this time, mistype because there&#x27;s some kind of contenteditable event jank, wait 5 seconds for my results to come up, fix the typo, and just have my results waiting for me<p>I&#x27;m not going to learn a new tool when web is fine
评论 #25279032 未加载
评论 #25279048 未加载
评论 #25279601 未加载
评论 #25286905 未加载
评论 #25283414 未加载
评论 #25282771 未加载
评论 #25283573 未加载
ssivarkover 4 years ago
I love that we’re seeing fast &amp; flexible solutions for personal search.<p>I’ve recently been playing with Recoll for full-text-search on content. Since it indexes content up front, the search is pretty fast. It can also easily accommodate tag metadata on files.<p>It would be interesting to consider how ripgrep based tools can fit into generically broad “search your database of content” workflows (as opposed to remember or go through your file system paths).
评论 #25279092 未加载
评论 #25277858 未加载
ghoomketuover 4 years ago
One a related note there is one program that I absolutely miss on Linux called everything (on windows).<p>The closest I can find is mlocate but it does not have a GUI but more importantly it does not index my Windows or NTFS drives.<p>Would appreciate any suggestions if someone knows something like &#x27;everything&#x27; for Ubuntu.
评论 #25278626 未加载
评论 #25278519 未加载
评论 #25279686 未加载
评论 #25279238 未加载
评论 #25280901 未加载
评论 #25284574 未加载
评论 #25282635 未加载
评论 #25278969 未加载
评论 #25282669 未加载
评论 #25278384 未加载
hobofanover 4 years ago
Big fan of rga! I use it almost every day for the academic part of my life, when I want to know the location of some specific keywords in my lecture slides, books or papers I&#x27;ve been reading. Even for single ebooks, it is often more useful than the search in Acrobat Reader.
评论 #25278594 未加载
durnygburover 4 years ago
No ripgrep-all through the package manager:<p><pre><code> $ sudo dnf install -y ripgrep-all [...] No match for argument: ripgrep-all Error: Unable to find a match: ripgrep-all </code></pre> Rust&#x27;s package manager fails:<p><pre><code> $ cargo install ripgrep_all [...] failed to select a version for the requirement `cachedir = &quot;^0.1.1&quot;` candidate versions found which didn&#x27;t match: 0.2.0 location searched: crates.io index required by package `ripgrep_all v0.9.6` </code></pre> Quick search on the web shows that more people have problems with cachedir version.
评论 #25278732 未加载
评论 #25279329 未加载
akavelover 4 years ago
The &quot;Integration with fzf&quot; example looks really cool:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all#integration-with-fzf" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all#integration-with-fzf</a>
alexrufover 4 years ago
Idea behind Rga is cool. Anyway, I tried it on Mac and installed via Homebrew. The formula already says it depends on ripgrep (that&#x27;s fine since I have ripgrep already installed and use it regularly). I still was surprised when I executed Rga for the first time and got an error message that &#x27;pdftotext&#x27; was not found. Since pdftotext has been officially discontinued, I am not sure if I want to install an old version just to make Rga work on my machine. Don&#x27;t think it&#x27;s an good idea to rely on a project which is not maintained actively.
评论 #25281826 未加载
评论 #25281402 未加载
评论 #25282055 未加载
antegamisouover 4 years ago
I always found useful something along the lines of<p><pre><code> pdftotext -layout file.pdf | grep -E ... </code></pre> for PDFs, good to see a Swiss Army knife utility for all sorts of file though!
评论 #25280135 未加载
lafrenierejmover 4 years ago
If anyone is interested gron [0], I have an open PR [1] to add it as an adapter to ripgrep-all. The patch was based on the most recent release, since master is currently not functional.<p>0: <a href="https:&#x2F;&#x2F;github.com&#x2F;TomNomNom&#x2F;gron" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;TomNomNom&#x2F;gron</a><p>1: <a href="https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all&#x2F;pull&#x2F;77" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all&#x2F;pull&#x2F;77</a>
faitswulffover 4 years ago
I noticed that you can use Tesseract as an OCR adapter for rga. Tesseract is written in python, IIRC, and in the OP it comes with a warning that it’s slow and not enabled by default. Are there any other fast, reliable OCR libs out there? Or any rust OCR backends?
评论 #25278182 未加载
评论 #25278700 未加载
soferioover 4 years ago
Can it (or any tool) perform proximity searches on scanned PDFs? E.g word1 within 20 words of word2, on scanned PDFs? (I think this is non trivial but very useful.)
评论 #25280116 未加载
评论 #25280158 未加载
supernova87aover 4 years ago
For PDFs, how does it (does it?) deal with for example, when phrases get ripped apart by the layout? Like if you search for a multiple word phrase, it&#x27;s often foiled by word wrap or being in a table.
diimdeepover 4 years ago
Is anyone preferring some other search tool other than Spotlight on macOS ?
评论 #25278755 未加载
评论 #25278039 未加载
评论 #25278952 未加载
fockover 4 years ago
can it produce links to open the file yet (don&#x27;t know rust, so can&#x27;t add a PR easily). At least gnome-terminal supports that (and normally it should also support opening a specific pdf page)!
评论 #25279438 未加载
dangover 4 years ago
If curious see also<p>2019 <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20196982" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20196982</a>
maxioaticover 4 years ago
This is great. I have 100+ ebooks&#x2F;pdfs of programming and textbooks of which I&#x27;ve been extracting the index pages of. My intention was to always make some sort of search index out of them. I will definitely be trialing this (initial few searches seem promising!)
评论 #25279563 未加载
chris_stover 4 years ago
Curious why this isn&#x27;t a pull request to ripgrep? Maybe it was, and rejected? It&#x27;d be nice to just have one tool, and this doesn&#x27;t feel like it&#x27;s a stretch to add to ripgrep.
评论 #25278720 未加载
SamuelAdamsover 4 years ago
Any advantages to this over something like Agent Ransack?<p><a href="https:&#x2F;&#x2F;www.mythicsoft.com&#x2F;agentransack&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.mythicsoft.com&#x2F;agentransack&#x2F;</a>
评论 #25278073 未加载
评论 #25278046 未加载
cb321over 4 years ago
NOTE: ripgrep already has --pre. (No pre-built indexing, of course.)
评论 #25279309 未加载
hiqover 4 years ago
It would be nice to have a direct comparison with ugrep. In the case of rg the benchmarks are already enough to switch. Why should I use rga instead of ugrep?
评论 #25278769 未加载
patricktloover 4 years ago
Thanks! This is a godsend for someone like me who needs to search through many PDFs&#x2F;docx documents to find information for work!
edm0ndover 4 years ago
Big fan of ripgrep. Use it on Windows to search through 100s of GBs of data really quickly.
评论 #25282944 未加载
lenkiteover 4 years ago
Is there a way to alias adapters ? So that .jar and .esa can be used for .zip ?
nikisweetingover 4 years ago
Aww hell yeah we should definitely use this in place of ripgrep for the new ArchiveBox.io full-text search backend.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ArchiveBox&#x2F;ArchiveBox&#x2F;pull&#x2F;543" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ArchiveBox&#x2F;ArchiveBox&#x2F;pull&#x2F;543</a>
root_axisover 4 years ago
This is really great.
kovekover 4 years ago
How could I use Rga to search my browsing history?
评论 #25279831 未加载
vmchaleover 4 years ago
Wonderful! pdfgrep is good but slow.
评论 #25280163 未加载
0df8dkdfover 4 years ago
Great tool!!!
goptyover 4 years ago
Sounds like a poor man&#x27;s version of recoll<p><a href="https:&#x2F;&#x2F;www.lesbonscomptes.com&#x2F;recoll&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.lesbonscomptes.com&#x2F;recoll&#x2F;</a><p>A PDF in a Zip file, in an email attachment. recoll can index it and do OCR if you like
globular-toastover 4 years ago
I have mixed feelings about these kinds of tools.<p>I can understand it might be nice to have a personal library of PDF books and searching in them. I can&#x27;t think of a time I&#x27;ve ever wished I could search my bookshelf in that way, but you never know.<p>Obviously I use tools like ripgrep for searching codebases and the like.<p>But the extreme flexibility of this one in particular (and others like MacOs Spotlight) makes it seem more like a data recovery tool for me. If my directory structures and databases ever completely failed for some reason I might need to search through everything to find the data again. It&#x27;s good to know such tools exist, I suppose.<p>But my fear is that tools like this teach people to not worry about organisation of data and to just fill up their disks with no structure at all. I think that unless something goes terribly wrong nobody should ever need a tool like this. Once you rely on it, you&#x27;re out of luck it if it ever fails you. What if you just can&#x27;t remember a single searchable phrase from some document, but you just <i>know</i> it must exist somewhere?<p>It&#x27;s similar to what Google has done to the web. When I was growing up it used to be a skill to use the web. People used tools like bookmarks and followed links from one place to another. Now it&#x27;s just type it into Google and if Google doesn&#x27;t know, it doesn&#x27;t exist.
评论 #25278318 未加载
评论 #25278202 未加载
评论 #25278374 未加载