I'd like to have a Unix script that basically generates a text file named, with the page title, with the article text neatly formatted.<p>This seems to me to be something that would be so commonly desired by people that it would've been done and done and done a hundred times over by now, but I haven't found the magic search terms to dig up people's creations.<p>I imagine it starts with "links -dump", but then there's using the title as the filename, and removing the padded left margin, wrapping the text, and removing all the excess linkage.<p>I'm a beginner-amateur when it comes to shell scripting, python, etc. - I can Google well and usually understand script or program logic but don't have terms memorized.<p>Is this exotic enough that people haven't done it, or as I suspect does this already exist and I'm just not finding it? Much obliged for any help.
> <i>I imagine it starts with "links -dump", but then there's using the title as the filename,</i><p>The title tag may exceed the filename length limit, be the same for nested pages, or contain newlines that must be escaped.<p>These might be helpful for your use case:<p>"Newspaper3k: Article scraping & curation"
<a href="https://github.com/codelucas/newspaper" rel="nofollow">https://github.com/codelucas/newspaper</a><p>lazyNLP "Library to scrape and clean web pages to create massive datasets"
<a href="https://github.com/chiphuyen/lazynlp/blob/master/README.md#step-4-clean-the-webpages" rel="nofollow">https://github.com/chiphuyen/lazynlp/blob/master/README.md#s...</a><p>scrapinghub/extruct
<a href="https://github.com/scrapinghub/extruct" rel="nofollow">https://github.com/scrapinghub/extruct</a><p>> <i>extruct is a library for extracting embedded metadata from HTML markup.</i><p>> <i>It also has a built-in HTTP server to test its output as JSON.</i><p>> <i>Currently, extruct supports:</i><p>> <i>- W3C's HTML Microdata</i><p>> <i>- embedded JSON-LD</i><p>> <i>- Microformat via mf2py</i><p>> <i>- Facebook's Open Graph</i><p>> <i>- (experimental) RDFa via rdflib</i>
Just for the record in case anyone digs this up on a later Google search, install the newspaper, unidecode, and re python libraries (pip3 install), then:<p><pre><code> from sys import argv
from unidecode import unidecode
from newspaper import Article
import re
script, arturl = argv
url = arturl
article=Article(url)
article.download()
article.parse()
title2 = unidecode(article.title)
fname2 = title2.lower()
fname2 = re.sub(r"[^\w\s]", '', fname2)
fname2 = re.sub(r"\s+", '-', fname2)
text2 = unidecode(article.text)
text2 = re.sub(r'\n\s*\n', '\n\n', text2)
f = open( '~/Desktop/' + str(fname2) + '.txt', 'w' )
f.write( str(title2) + '\n\n' )
f.write( str(text2) + '\n' )
f.close()
</code></pre>
I execute via from shell:<p><pre><code> #!/bin/bash
/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.7/bin/python3 ~/bin/url2txt.py $1
</code></pre>
If I want to run it on all the URLs in a text file:<p><pre><code> #!/bin/bash
while IFS='' read -r l || [ -n "$l" ]; do
~/bin/u2t "$l"
done < $1
</code></pre>
I'm sure most of the coders here are wincing at one or multiple mistakes or badly formatted items I've done here, but I'm open to feedback ...
I don't know if a specific script but you might be able to make something with python using the requests, beautifulsoup and markdownify modules.<p>Requests to fetch the page. beautifulsoup to grab the tags you care about (title info) and then markdownify to take the raw html and turn it into markdown.