Some months ago I found <a href="https://import.io/" rel="nofollow">https://import.io/</a> and it just blow my mind.<p>I remember the pain it was to write custom scrapers every time (I used to do it with Perl, btw).<p>They have a custom browser with a nice interface, but the biggest thing are the so called "Connectors": you instruct the system into how to query and parse results and Import.IO will give you an API endpoint for this query, now automatized.<p>One can, say, create a "connector" which can query Airbnb and parse results, then create another "connector" which queries booking.com. Now it is possible to use the API to make a query for Boa Vista, Roraima (my city) and get the dataset.<p>I am not affiliated with them in any way, just a very happy old-school scrapper.<p>Nice walkthrough: <a href="http://www.youtube.com/watch?v=_16O10Wx2W4" rel="nofollow">http://www.youtube.com/watch?v=_16O10Wx2W4</a><p>UPDATE:<p>Unsurprisingly, import.io was Hacker News stuff in the past: <a href="https://news.ycombinator.com/item?id=7582858" rel="nofollow">https://news.ycombinator.com/item?id=7582858</a>
I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it's incredible fast to write a simple scraper with your data neatly exported in .json.<p>[0]: <a href="http://scrapy.org/" rel="nofollow">http://scrapy.org/</a>
Shameless Plug: I work for an NYC-based startup - SeatGeek.com - that is basically this[1]. We used to do forecasting but found that wasn't really useful[2] or worth the time it took to maintain, so we nixed it.<p>- [1]: As an example, here is the Firefly event the OP was scraping. : <a href="https://seatgeek.com/firefly-music-festival-tickets" rel="nofollow">https://seatgeek.com/firefly-music-festival-tickets</a><p>- [2]: We haven't included Craigslist because the data is much less structured and inexperienced users may have a Bad Time™. YMMV<p>- [3]: It was also a royal pain in the ass to maintain. I know because I had to update the underlying data provided to the model, and also modify it whenever available data changed :( . Here is a blog post on why we removed it from the product in general: <a href="http://chairnerd.seatgeek.com/removing-price-forecasts" rel="nofollow">http://chairnerd.seatgeek.com/removing-price-forecasts</a>
Combine this with Pushover[0] to get alerted whenever there is a new lowest price. I had to resort to scraping+pushover to snatch a garage parking spot in SF.<p>[0] <a href="https://pushover.net/" rel="nofollow">https://pushover.net/</a>
Useful article, I use lxml myself. Find that this is a good resource: <a href="http://jakeaustwick.me/python-web-scraping-resource/" rel="nofollow">http://jakeaustwick.me/python-web-scraping-resource/</a>
I did this recently when trying to get tickets to a sold out Cloud Nothings show. I'd scrape Craiglist for postings every 10 minutes, and then send myself a text if any of the posts were new. I ended up getting tickets the day before the show.<p>Since the show was at a very small venue (capacity of maybe 500), I didn't have to worry about a constant stream of false positives. I would have needed to handle these if I were searching for tickets to a sold out <popular band> show, since ticket brokers just spam Craigslist constantly with popular terms.
This reminds me of something I knocked up back in 2006. It's not a scraper, it's not Python, but here you are:<p><a href="http://giggr.com/?q=klaxons" rel="nofollow">http://giggr.com/?q=klaxons</a><p>Searches multiple UK ticket sites and returns the artist page matching the query.<p>Clicking a header label (i.e. Ticketweb) switches to that provider.<p>Double-clicking the header re-searches based on the value of the search box.<p>I use it for the 9am scramble for newly released tickets.<p>Oh, it seems Ticketmaster has broken. Maybe I'll fix that one day... I haven't used it in a while.
Doesn't work for me. Which Python version is required?<p><pre><code> Traceback (most recent call last):
File "./tickets.py", line 20, in <module>
for listing in soup.findall('p', {'class': 'row'}):
TypeError: 'NoneType' object is not callable</code></pre>
This is not very good code. Here's a little better refactor - <a href="https://github.com/realpython/interview-questions/blob/master/refactor_me/after1.py" rel="nofollow">https://github.com/realpython/interview-questions/blob/maste...</a>