TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Python Headless Web Browser Scraping on Amazon Linux

102 点作者 steven5158将近 12 年前

12 条评论

fauigerzigerk将近 12 年前
PhantomJS is brilliant, but Selenium is a questionable choice for this task. For some reason, the creators of Selenium have decided that passing HTTP status codes back through the API is and always will be outside the scope of their project. So if you request a page and it returns 404 you have no way to find out (other than using crude heuristics). This makes Selenium completely unusable for anything I would have used it for.<p>Fortunately you can do it by using phantomjs directly instead of going through the Selenium WebDriver API. Maybe one day the phantomjs WebDriver API implementation (ghostdriver) will extend the API to pass HTTP status information back to the caller. Until then, this API is unusable (at least for me).
评论 #5893550 未加载
评论 #5893411 未加载
评论 #5894106 未加载
slaxo将近 12 年前
For anyone using PhantomJS I&#x27;d recommend checking out CasperJS (<a href="http:&#x2F;&#x2F;casperjs.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;casperjs.org&#x2F;</a>) . It adds some nice features to PhantomJS and takes out a lot of the pain points
diminoten将近 12 年前
I find it preferable to determine the requests that jQuery is making and perform them myself to extract the necessary data, rather than load up a whole browser just to do the same thing.<p>Selenium is <i>terrible</i>, performance wise, and requires a <i>significant</i> investment in environment in order to work reliably. I try to avoid it except when I absolutely cannot.
评论 #5899182 未加载
brechin将近 12 年前
If you&#x27;re writing Python and need to do something like this, you could try using Phantompy, a Python port of PhantomJS: <a href="https:&#x2F;&#x2F;github.com&#x2F;niwibe&#x2F;phantompy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;niwibe&#x2F;phantompy</a><p>It&#x27;s still &quot;in an early stage of development&quot; but it&#x27;s on my list of libraries to keep an eye on for when I have time to tackle the JS-heavy sites of the world.
spikels将近 12 年前
For scraping phantomjs or casperjs is the best way to go but you will have to use some JavaScript [1]. Both give you access to everything a WebKit browser user does with either a Node-style callback syntax (phantomjs) or a procedural&#x2F;promises-style syntax (casperjs). Easy to setup, simple to use and fast enough for scraping but only WebKit (for now).<p>For testing on browsers other than WebKit (or vendor specific WebKit edge cases) use Selenium. Harder to setup, more complex, probably faster (still slow for testing) but not limited to WebKit.<p>[1] Sorry folks but some JavaScript is required to programmatically interacting with the web - also need some HTML and CSS.
xfour将近 12 年前
One more thing, has anyone used BeautifulSoup for forever? Is the project still active? I mean the website is cute and all, but I find pyquery ( Also based on lxml) so much easier with parsing the scraped data.
评论 #5893998 未加载
评论 #5894336 未加载
评论 #5894365 未加载
616c将近 12 年前
I recently tried to get back into Selenium for a work-related project and, despite its frustrations, it is one my favorite open source gems I found in the last several years. When showing it uninitiated web devs their heads almost exploded from joy and amazement. Your setup with Selenium intrigued me since the pain point for me has become how difficult it is to maneuver some browsers with Selenium IDE to throw together ideas, if that is even encouraged anymore.
phaer将近 12 年前
You are installing some devel-packages, but i don&#x27;t see anything compiling? Does the selenium installation build native extensions? Then the commands should probably the other way round. Or is phantomjs compiling something on the first run?<p>Minor nitpick: I don&#x27;t think it is a good idea to copy a binary directly to &#x2F;usr&#x2F;bin, without a package manager. You could just put it into &#x2F;opt and symlink to &#x2F;usr&#x2F;(local&#x2F;)bin.
评论 #5893297 未加载
j-kidd将近 12 年前
Off topic: it is perfectly fine to install things like PyQt &#x2F; PySide on a headless server. I suppose the problem is because the distro doesn&#x27;t provide these packages?<p>Also, PhantomJS works fine in this case because the binary in the tarball is statically compiled. You can find a whole lot of qt stuffs inside PhantomJS source repository. There ain&#x27;t no such thing as &quot;truly headless&quot;.
techaddict009将近 12 年前
Wow was searching something similar. Actually was trying to build a app which scraps data from movie ticket booking sites and provides data via SMS to user that whether tickets are still available or not. Because everyone doesn&#x27;t have access to internet in India yet.<p>@Steven5158 thanks for the share.<p>If anyone here wants help in building SMS apps do contact me.
keypusher将近 12 年前
We do quite a bit of web scraping &#x2F; parsing on headless servers with Selenium. What we did was just install some X packages and run VNC server on the headless clients with Firefox. Cool thing about that is you can then go watch the scripts executing if you connect to the VNC session and take a screenshot on failure, etc.
评论 #5898054 未加载
JimmaDaRustla将近 12 年前
I am under the assumption the python-requests would have the same issue - it does not render the page, it only retrieves the original page response.<p>Very, very good to know when diving into scraping.