TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Scripts/commands for extracting URL article text? (links -dump but)

1 pointsby WCityMikealmost 6 years ago
I&#x27;d like to have a Unix script that basically generates a text file named, with the page title, with the article text neatly formatted.<p>This seems to me to be something that would be so commonly desired by people that it would&#x27;ve been done and done and done a hundred times over by now, but I haven&#x27;t found the magic search terms to dig up people&#x27;s creations.<p>I imagine it starts with &quot;links -dump&quot;, but then there&#x27;s using the title as the filename, and removing the padded left margin, wrapping the text, and removing all the excess linkage.<p>I&#x27;m a beginner-amateur when it comes to shell scripting, python, etc. - I can Google well and usually understand script or program logic but don&#x27;t have terms memorized.<p>Is this exotic enough that people haven&#x27;t done it, or as I suspect does this already exist and I&#x27;m just not finding it? Much obliged for any help.

3 comments

westurneralmost 6 years ago
&gt; <i>I imagine it starts with &quot;links -dump&quot;, but then there&#x27;s using the title as the filename,</i><p>The title tag may exceed the filename length limit, be the same for nested pages, or contain newlines that must be escaped.<p>These might be helpful for your use case:<p>&quot;Newspaper3k: Article scraping &amp; curation&quot; <a href="https:&#x2F;&#x2F;github.com&#x2F;codelucas&#x2F;newspaper" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;codelucas&#x2F;newspaper</a><p>lazyNLP &quot;Library to scrape and clean web pages to create massive datasets&quot; <a href="https:&#x2F;&#x2F;github.com&#x2F;chiphuyen&#x2F;lazynlp&#x2F;blob&#x2F;master&#x2F;README.md#step-4-clean-the-webpages" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;chiphuyen&#x2F;lazynlp&#x2F;blob&#x2F;master&#x2F;README.md#s...</a><p>scrapinghub&#x2F;extruct <a href="https:&#x2F;&#x2F;github.com&#x2F;scrapinghub&#x2F;extruct" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;scrapinghub&#x2F;extruct</a><p>&gt; <i>extruct is a library for extracting embedded metadata from HTML markup.</i><p>&gt; <i>It also has a built-in HTTP server to test its output as JSON.</i><p>&gt; <i>Currently, extruct supports:</i><p>&gt; <i>- W3C&#x27;s HTML Microdata</i><p>&gt; <i>- embedded JSON-LD</i><p>&gt; <i>- Microformat via mf2py</i><p>&gt; <i>- Facebook&#x27;s Open Graph</i><p>&gt; <i>- (experimental) RDFa via rdflib</i>
WCityMikealmost 6 years ago
Just for the record in case anyone digs this up on a later Google search, install the newspaper, unidecode, and re python libraries (pip3 install), then:<p><pre><code> from sys import argv from unidecode import unidecode from newspaper import Article import re script, arturl = argv url = arturl article=Article(url) article.download() article.parse() title2 = unidecode(article.title) fname2 = title2.lower() fname2 = re.sub(r&quot;[^\w\s]&quot;, &#x27;&#x27;, fname2) fname2 = re.sub(r&quot;\s+&quot;, &#x27;-&#x27;, fname2) text2 = unidecode(article.text) text2 = re.sub(r&#x27;\n\s*\n&#x27;, &#x27;\n\n&#x27;, text2) f = open( &#x27;~&#x2F;Desktop&#x2F;&#x27; + str(fname2) + &#x27;.txt&#x27;, &#x27;w&#x27; ) f.write( str(title2) + &#x27;\n\n&#x27; ) f.write( str(text2) + &#x27;\n&#x27; ) f.close() </code></pre> I execute via from shell:<p><pre><code> #!&#x2F;bin&#x2F;bash &#x2F;usr&#x2F;local&#x2F;opt&#x2F;python3&#x2F;Frameworks&#x2F;Python.framework&#x2F;Versions&#x2F;3.7&#x2F;bin&#x2F;python3 ~&#x2F;bin&#x2F;url2txt.py $1 </code></pre> If I want to run it on all the URLs in a text file:<p><pre><code> #!&#x2F;bin&#x2F;bash while IFS=&#x27;&#x27; read -r l || [ -n &quot;$l&quot; ]; do ~&#x2F;bin&#x2F;u2t &quot;$l&quot; done &lt; $1 </code></pre> I&#x27;m sure most of the coders here are wincing at one or multiple mistakes or badly formatted items I&#x27;ve done here, but I&#x27;m open to feedback ...
评论 #20340333 未加载
spaceprisonalmost 6 years ago
I don&#x27;t know if a specific script but you might be able to make something with python using the requests, beautifulsoup and markdownify modules.<p>Requests to fetch the page. beautifulsoup to grab the tags you care about (title info) and then markdownify to take the raw html and turn it into markdown.