It is making one mistake, it is parsing and scraping in the same loop. You should pull the data, store them and have another process accessing the data store and perform the parsing and understanding of the data. A "quick" parsing can be done to pull the links and build your frontier, but the data should be pulled and stored for the main parsing.<p>This allows you to test your parsing routines independently of the target website, this allows you to later compare with previous versions and this allows you to reparse everything in the future, even after the original website long gone is.<p>My recommendation is to use the WARC archive format to store the results, this way you are on the safe side (the storage is standardized), it compresses very well and the WARC are easy to handle (they are immutable store, nice for backups).
Yeah this might be handy for small stuff but it's way too naive for anything bigger than couple pages. I recently had to scrape some pictures and meta-data from a website and while scripts like these seemed cool they really didn't scale up at all. Consider navigation, following URLs and downloading pictures all while remaining in the limits what's considered non-intrusive.<p>My first attempt, similar to this, failed miserably as the site employed some kind of cookie check that immediately blocked my requests by returning 403.<p>As mentioned in article I then moved on to Scrapy <a href="https://scrapy.org/" rel="nofollow">https://scrapy.org/</a>. While seemingly a bit overkill once you create your scraper it's easy to expand and use the same scaffold on other sites too. Also it gives a lot more control on how gently you scrape and outputs nicely json/jl/csv with the data you want.<p>Most problems I had was with the Scrapy pipelines and getting it to output properly two json files and images. I could write a very short tutorial on my setup if I wasn't at work and otherwise busy right now.<p>And yes it's a bit of grey area but for my project (training a simple CNN based on the images) I think it was acceptable considering that I could have done the same thing manually (and spent less time too).
I love requests+lxml, use it fairly regularly, just a few quick notes:<p>1. lxml is <i>way</i> faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.<p>2. Don't forget to check the status code of r (r.status_code or less generally r.ok)<p>3. Those with a background in coding might prefer the .cssselect method available in whatever object the parsed document results in. That's obviously a tad slower than find/findall/xpath, but it's oftentimes too convenient to pass upon.<p>4. Kind of automatic, but I'll say it anyway - scraping is a gray area, always make sure that what you're doing is legitimate.
This is perhaps the fastest way to screenscrape a dynamically executed website.<p>1. First go get and run this code, which allows immediate gathering of all text nodes from the DOM: <a href="https://github.com/prettydiff/getNodesByType/blob/master/getNodesByType.js" rel="nofollow">https://github.com/prettydiff/getNodesByType/blob/master/get...</a><p>2. Extract the text content from the text nodes and ignore nodes that contain only white space:<p>let text = document.getNodesByType(3),
a = 0,
b = text.length,
output = [];
do {
if ((/^(\s+)$/).test(text[a].textContent) === false) {
output.push(text[a].textContent);
}
a = a + 1;
} while (a < b);
output;<p>That will gather ALL text from the page. Since you are working from the DOM directly you can filter your results by various contextual and stylistic factors. Since this code is small and executes stupid fast it can be executed by bots easily.
I wonder how many folks using this will obey the robots.txt as explained nicely within the article:<p>"Robots<p>Web scraping is powerful, but with great power comes great responsibility. When you are scraping somebody’s website, you should be mindful of not sending too many requests. Most websites have a “robots.txt” which shows the rules that your web scraper should obey (which URLs are allowed to be scraped, which ones are not, the rate of requests you can send, etc.)."
I found a lot of use cases for web scraping is kinda ad-hoc and usually, occurs as part of another task (eg. a research project or enhancing a record). I ended up releasing a simple hosted API service called Page.REST (<a href="https://page.rest" rel="nofollow">https://page.rest</a>) for people who would like to save that extra dev effort and infrastructure cost.
With a headless browser the web scraping script can be even simpler. For example, have a look at the same scraper for datawhatnow.com at <a href="https://www.apify.com/jancurn/YP4Xg-api-datawhatnow-com" rel="nofollow">https://www.apify.com/jancurn/YP4Xg-api-datawhatnow-com</a>
I've found .NET great for scraping. More so than Python as LINQ can be really really useful for weird cases I find.<p>My usual setup on OSX is .NET Core + HTMLAgilityPack + Selenium.
Could CSS selectors, with a few minor extensions, be just as good at XPath for this kind of thing?<p>I guess a lot of the reason I find xpath frustrating is my usage frequency corresponds exactly to the time needed to forget the syntax and have to relearn/refresh it in my head.<p>If CSS selectors needed only a few enhancements to compete with XPath, it might be worth enhancing a selector library to enable quick ramp up speed for more web people.
As an alternative to lxml or BeautifulSoup, I've used a library called PyQuery (<a href="https://pythonhosted.org/pyquery/" rel="nofollow">https://pythonhosted.org/pyquery/</a>) with some success. It has a very similar API to jQuery.
I can't stress enough what a bad idea it usually is to copy XPath expressions generated by dev tools. They tend to be super inefficient for traversing the tree (e.g. beginning with "*" for no reason), and don't make good use of tag attributes.
I wrote a Clojure library that facilitates writing this sort of scripts in a relatively robust way:<p><a href="https://github.com/nathell/skyscraper" rel="nofollow">https://github.com/nathell/skyscraper</a>
lxml is nice. i would as suggested parse and scrape in different threads so you can speed up a bit, but it's not required per se. if you can't get the data you see on the website using lxml there might be ajax or other stuff implemented. to capture these streams / datas use a headless browser like phantomJS or so.
Article looks good to me for 'simple' scrapings and is a good base to start playing with the concepts.<p>The nice thing about making a scraper from scratch like this is that you get to decide it's behaviour and fingerprint ,and you wont get blocked as some known scraperr. that being said, most people would appreiciate if you parse their robots.txt , but depending on your geographical locatin this might be an 'extra' step which isnt needed... (i'd advise to do it anyway if you are a friendly ;) and maybe put in user agent for requests something like 'i don't bite' to let ppl know you are benign...) if you get blocked while trying to scrape you can try to fake site into thinking you are browser just by setting user agent and other headrs appropriately. if you dont know which these are, open nc -nlvp 80 on your local machine and wget or firefox into it to see headers...<p>Deciding on good xpath or 'markers' to scrape can be automated, but it's often ,. if you need good accurate data from a singlular source, a good idea to manually go through the html and seek some good markers...<p>an alternate method of scraping is automating wget --recursive + links -dump to render html pages to txt output and grep or w/e these for what data you need... tons of methods can be devised... depending on your needs some will be more practical and stable than others.<p>saving files is only usefull if you need assurance on data quality and if you want to be able to tweak the results without having to re-request the data from the server. (just point to local data directory instead...). this way you can setup a harvester and parsers fr this datas.<p>if you want to scrape or harvest LARGE data sets consider a proxy network or something like a tor connection jugling docker instance or so to ensure rate limiting is not killing your hrvesters...<p>good luck have fun and don't kill peopels servers with your traffic spam, that's a dick move.... (throttle/humanise your scrapings...)