Web Scraping 101 with Python

392 pointsby daolfover 4 years ago

29 comments

philshemover 4 years ago

Before jumping into frameworks, if your data is lucky enough to be stored in an html table:<pre><code> import pandas as pd dfs = pd.read_html(url) </code></pre> Where ‘dfs’ is an array of dataframes - one item for each html table on the page.<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html" rel="nofollow">https://pandas.pydata.org/pandas-docs/stable/reference/api/p...</a>

评论 #26096729 未加载

评论 #26113634 未加载

评论 #26098786 未加载

评论 #26096947 未加载

评论 #26094578 未加载

评论 #26093093 未加载

评论 #26093264 未加载

NDizzleover 4 years ago

I've been involved in many web scraper jobs over the past 25 years or so. The most recent one, which was a long time ago at this point, was using scrapy. I went with XML tools for controlling the DOM.It's worked unbelievably well. It's been running for roughly 5 years at this point. I send a command at a random time between 11pm and 4am to wake up an ec2 instance. It checks its tags to see if it should execute the script. If so, it does so. When it's done with its scraping for the day, it turns itself off.This is a tiny snapshot of why it's been so difficult for me to go from python2 to python3. I'm strongly in the camp of "if it ain't broke, don't fix it".

评论 #26093520 未加载

评论 #26093725 未加载

评论 #26096343 未加载

评论 #26097221 未加载

VBprogrammerover 4 years ago

One tip I would pass on when trying to scrape data from a website, start by using wget in mirror mode to download the useful pages. It's much faster to iterate on scraping the data once you have it locally. Also, less likely to accidentally kill the site or attract the attention of the host.

评论 #26094454 未加载

评论 #26092743 未加载

评论 #26092127 未加载

cubanoover 4 years ago

My last contract job was to build a 100% perfect website mirroring program for a group of lawyers who were interested in building class action lawsuits against some of the more henious scammers out there.I ended up building like 8 versions of it, literally using every PHP and Python library and resource I could find.I tried httrack, php-ultimate-web-scraper (from github), headless chromium. headless selenium, and a few othersBy far the biggest problem was dealing with JS links...you wouldn't think from the start it would be such a big deal but yet..it was.Selenium with python turned out to be the winning combination, and of course, it was the last one I tried. Also, this is an ideal project to implement recursion altho you have to be careful about exit conditions.One thing that was VERY important for performance was not visiting any page more then once because, obviously, certain links in headers and footers are duped sometimes 100s of times.JS links often made it very difficult to discover the linked page, are certain library calls that were supposed to get this info for you often didn't work.It was a super fun project, and in the end considering I only worked for 2 months, I shipped some decent code that was getting like 98.6% of the pages perfectly.The final presentation was interesting...for some reason my client I think got in his head that I wasn't very good programmer or something, and as we ran thru his list of sample sites expecting my program to error out or incorrectly mirror the site, but it handled all 10 of the sites about perfectly and he was rather flabbergasted because he told me it would have taken him a week hand clicking the site for the mirror but instead the program did them all in under an hour.

评论 #26094926 未加载

评论 #26096790 未加载

评论 #26092618 未加载

评论 #26092634 未加载

cameroncairnsover 4 years ago

I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need.Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily.

评论 #26093792 未加载

Tistelover 4 years ago

It’s fun to combine jupyter notebooks and py scraping. If you are working 15 pages/screens deep, you can “stay at the coal face” and not have to rerun the whole script after making a change to the latest step.

评论 #26091994 未加载

评论 #26091998 未加载

评论 #26093101 未加载

评论 #26091977 未加载

tluyben2over 4 years ago

I wanted to do some larger distributed scraping jobs recently and although it was easy to get everything running on one machine (with different tools including Scrapy), I was surprised how hard it was to do at scale. The open source ones I could find was hard/impossible to get working, overly complex, badly documented etc.The services I found to be reasonably priced for small jobs, but at scale they quickly become vastly more expensive than setting this up yourself. Especially when you need to run these jobs every month or so. Even if you have to write some code to make the open source solutions actually work.

评论 #26093970 未加载

评论 #26093652 未加载

1vuio0pswjnm7over 4 years ago

One thing I notice with all blog articles, and HN comments, on scraping is that they always omit the actual use case, i.e., the specific website that someone is trying to scrape. Any examples tend to be so trivial as to be practically meaningless. They do not prove anything.If authors did name websites they wanted to scrape, or show tests on actual websites, then we might see others come forward with different solutions. Some of them might beat the ones being put forward by the pre-packaged software libraries/frameworks and commercial scraping services built on them, e.g., less brittle, faster, less code, easier to repair.We will never know.

评论 #26096693 未加载

kruchoneover 4 years ago

In my career I found several reasons not to use regular expressions for parsing an HTML response, but the largest was the fact that it may work for 'properly formed' documents, but you would be surprised how lax all browsers are about requiring the document to be well-formed. Your regex, unless particularly handled, will not be able to handle sites like this (and there are a lot, at least from my career experience). And you may be able to work 'edge cases' into your RegEx, but good luck finding anyone but the expression author who fully understands and can confidently change it as time goes on. It is also a PITA to debug when groupings/etc. aren't working (and there will be a LOT of these cases with HTML/XML documents).It is honestly almost never worth it unless you have constraints on what packages you can use and you MUST use regular expressions. Just do your future-self a favor and use BeautifulSoup or some other package designed to parse the tree-like structure of these documents.One way it can be used appropriately is just finding a pattern in the document- without caring where it is w.r.t. the rest of the document. But even then, do you really want to match:  ?

评论 #26092335 未加载

评论 #26098591 未加载

评论 #26092225 未加载

pknerdover 4 years ago

I have been developing scrapers and crawlers and writing[1] about them for many years and used many Python based libs so far including Selenium. I have write such scrapers for individuals and startups for several purposes. The biggest issue I faced was rendering of dynamic sites and blocking of IPs due to absence of proxies which are not cheap at all, especially for individuals.Services like Scrapingbee and ScraperAPI are serving quite good for such problems. I personally liked ScraperAPI for rendering dynamic websites due to the better response time.Shameless Plug: In case if anyone is interested, long time back, I had written about it on my blog which you can read here[2]. Now you do not need to setup remote Chrome instance or anything. What all is required is to hit an API endpoint to fetch content from a dyanmic JS rendered websites.[1] <a href="http://blog.adnansiddiqi.me/tag/scraping/" rel="nofollow">http://blog.adnansiddiqi.me/tag/scraping/</a>[2] <a href="http://blog.adnansiddiqi.me/scraping-dynamic-websites-using-scraper-api-and-python/" rel="nofollow">http://blog.adnansiddiqi.me/scraping-dynamic-websites-using-...</a>

turtlebitsover 4 years ago

Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.

toolsliveover 4 years ago

fetching html and then parsing it navigating the parsed result (or with regexp) is what used to work 20 years ago. These days, with all these reactive javascript frameworks you better skip to item number 5: headless browsing. Also mind that Facebook, Instagram, ... will have anti-scraping measures in place. It's a race ;)

评论 #26091501 未加载

评论 #26092006 未加载

评论 #26091496 未加载

bityardover 4 years ago

Is web scraping going to continue to be a viable thing, now that the web is mainly an app delivery platform rather than a content delivery platform?Can you scrape a webasm site?

评论 #26093164 未加载

评论 #26093030 未加载

anyfactorover 4 years ago

I have been websraping for almost 4 years now. That is my entire niche.The problem with web scraping is that you really don't know the ethical point of scraping ends. These days I will reverse engineer a website to minimize the request load and only target specific API endpoints. But then again I am breaching some security measures they have while doing that.

pythonbaseover 4 years ago

I do web scraping for fun and profit, primarily using Python. Wrote a post some time back about it.<a href="https://www.kashifaziz.me/web-scraping-python-beautifulsoup.html/" rel="nofollow">https://www.kashifaziz.me/web-scraping-python-beautifulsoup....</a>

spsphulseover 4 years ago

Is there a SOTA library for common web scraping issues at scale( especially distributed over cluster of nodes) for Captcha detection, IP rotation, Rate throttling, Queue Management etc.?

评论 #26094606 未加载

zelphirkaltover 4 years ago

Personally I have not needed Beautifulsoup a single time, when web scraping. People say it is better for unclean HTML, which I cannot confirm, because I never needed it and always were able to get my result using LXML + etree with XPath and CSS selectors. Once I also used Scrapy, but still not Beautifulsoup. I am glad there is a guide, that starts with LXML, instead of immediately jumping to Beautifulsoup.

评论 #26103775 未加载

inovicaover 4 years ago

We created a fun side project to grab the index page of every domain - we downloaded a list of approx 200m domains. However, we ran into problems when our provider complained. It was something to do with the DNS side of things and we were told to run our own DNS server. If there is anyone on here with experience of crawling across this number of domain names it would be great to talk!

jj_jaqover 4 years ago

The biggest struggle I had while building web scrappers is scaling Selenium. If you need to launch Selenium hundred of thousands times per month, you need a lot of computer power which is really expensive on EC2.A couple years ago, I discovered browserless.io which does this job for you and it's amazing. I really don't know how they made this but it just scales without any limit.

评论 #26098760 未加载

FL33TW00Dover 4 years ago

I recently undertook my first scraping project, and after trying a number of things landed upon Scrapy.It’s been a blessing. Not only can it handle difficult sites, but it’s super quick to write another spider for the easy sites that provide the JSON blob in a handy single API call.Only problem I had was getting around cloudflare, tried a few things like puppeteer but no luck.

评论 #26096097 未加载

dastxover 4 years ago

For data extraction I highly recommend weboob. Despite the unfortunate name, it does some really cool stuff. Writing modules is quite straightforward and the structure they've chosen makes a lot of sense.I do wish there was a Go version of it, mostly because I much prefer working with Go, but also because single binary is extremely useful.

fudged71over 4 years ago

I really appreciate the tips in the comments here.As a beginner it makes a lot of sense to iterate on a local copy with jupyter rather than fetching resources over and over until you get it right. I wish more tutorials focused on this workflow.

ackbar03over 4 years ago

I've always had pretty bad experiences with web scrapping, it's such a pain in the ass and frequently breaks. I'm not sure if I'm doing it wrong or if that's how it's supposed to be.

评论 #26092386 未加载

评论 #26092051 未加载

max_over 4 years ago

How does one do scraping properly on dynamic client side rendered pages?

评论 #26093824 未加载

评论 #26093616 未加载

mikeceover 4 years ago

Aside from the Beautiful Soup library, is there something about Python that makes it a better choice for web scraping than languages such as Java, JavaScript, Go, Perl or even C#?

评论 #26091967 未加载

评论 #26092128 未加载

评论 #26092035 未加载

评论 #26092061 未加载

评论 #26092079 未加载

js8over 4 years ago

Does anyone know how could I script Save Page WE extension in Firefox? It does a really nice job of saving the page as it looks, including dynamic content.

jC6fhrfHRLM9b3over 4 years ago

This is an ad.

tyingqover 4 years ago

PyPpeteer might be worth a look as well. Basically a port of the JS puppeteer project that drives headless Chrome via the Devtools API.As mentioned elsewhere, using anything other than headless isn't useful beyond a fairly narrow scope these days.<a href="https://github.com/pyppeteer/pyppeteer" rel="nofollow">https://github.com/pyppeteer/pyppeteer</a>

评论 #26091798 未加载

评论 #26091855 未加载

评论 #26093027 未加载

ohmyblockover 4 years ago

Any advantage/disadvantage in using Javascript instead of Python for web scraping?

评论 #26094678 未加载