Finding the best ticket price – Simple web scraping with Python

44 pointsby danielforsythalmost 11 years ago

13 comments

jknuppalmost 11 years ago

A shorter, more comprehensible version:import requestsfrom bs4 import BeautifulSoupfrom urlparse import urljoinURL = '<a href="http://philadelphia.craigslist.org/search/sss?sort=date&query=firefly%20tickets'" rel="nofollow">http://philadelphia.craigslist.org/search/sss?sort=date&quer...</a>BASE = '<a href="http://philadelphia.craigslist.org/cpg/'" rel="nofollow">http://philadelphia.craigslist.org/cpg/'</a>response = requests.get(URL)soup = BeautifulSoup(response.content)for listing in soup.find_all('p',{'class':'row'}):<pre><code> if listing.find('span',{'class':'price'}): price = int(listing.text[2:6]) if 100 < price <=250: print listing.text print urljoin(BASE, listing.a['href']) + '\n'</code></pre>

评论 #7904288 未加载

评论 #7908407 未加载

motoboialmost 11 years ago

Some months ago I found <a href="https://import.io/" rel="nofollow">https://import.io/</a> and it just blow my mind.I remember the pain it was to write custom scrapers every time (I used to do it with Perl, btw).They have a custom browser with a nice interface, but the biggest thing are the so called "Connectors": you instruct the system into how to query and parse results and Import.IO will give you an API endpoint for this query, now automatized.One can, say, create a "connector" which can query Airbnb and parse results, then create another "connector" which queries booking.com. Now it is possible to use the API to make a query for Boa Vista, Roraima (my city) and get the dataset.I am not affiliated with them in any way, just a very happy old-school scrapper.Nice walkthrough: <a href="http://www.youtube.com/watch?v=_16O10Wx2W4" rel="nofollow">http://www.youtube.com/watch?v=_16O10Wx2W4</a>UPDATE:Unsurprisingly, import.io was Hacker News stuff in the past: <a href="https://news.ycombinator.com/item?id=7582858" rel="nofollow">https://news.ycombinator.com/item?id=7582858</a>

评论 #7905475 未加载

tstalmost 11 years ago

I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it's incredible fast to write a simple scraper with your data neatly exported in .json.[0]: <a href="http://scrapy.org/" rel="nofollow">http://scrapy.org/</a>

评论 #7903852 未加载

评论 #7904420 未加载

dai_polealmost 11 years ago

I had a go "just for fun" using curl, grep, sed, and tr. Probably too much regex?<pre><code> #!/bin/sh # # tickets.sh - A "no BS" ticket price scraper. Output in CSV format. # Uses standard issue Unix utilities only. # No soup for you! URL="http://philadelphia.craigslist.org" QUERY="firefly+tickets" RESULTS=`curl -s -m 10 "$URL/search/sss?sort=date&query=$QUERY" \ | grep '[ \t]*<!><!g; \ s![,:]! !g; \ s!&#x0024;$[0-9]\{1,\}$[^.]*>$[A-Z]\{1\}[a-z]\{2\} \{1,\}[0-9]\{1,2\}$[^.]*<a h[^>]*\.html">$[^<]*$</a>$[^.]*$!\1,$\2,\3,\4:!g; \ s! *! !g; \ s!, *!,!g' \ | tr ':' '\n'` echo "$RESULTS"</code></pre>

josegonzalezalmost 11 years ago

Shameless Plug: I work for an NYC-based startup - SeatGeek.com - that is basically this[1]. We used to do forecasting but found that wasn't really useful[2] or worth the time it took to maintain, so we nixed it.- [1]: As an example, here is the Firefly event the OP was scraping. : <a href="https://seatgeek.com/firefly-music-festival-tickets" rel="nofollow">https://seatgeek.com/firefly-music-festival-tickets</a>- [2]: We haven't included Craigslist because the data is much less structured and inexperienced users may have a Bad Time™. YMMV- [3]: It was also a royal pain in the ass to maintain. I know because I had to update the underlying data provided to the model, and also modify it whenever available data changed :( . Here is a blog post on why we removed it from the product in general: <a href="http://chairnerd.seatgeek.com/removing-price-forecasts" rel="nofollow">http://chairnerd.seatgeek.com/removing-price-forecasts</a>

jtokophalmost 11 years ago

Combine this with Pushover[0] to get alerted whenever there is a new lowest price. I had to resort to scraping+pushover to snatch a garage parking spot in SF.[0] <a href="https://pushover.net/" rel="nofollow">https://pushover.net/</a>

tomaisthorpealmost 11 years ago

Useful article, I use lxml myself. Find that this is a good resource: <a href="http://jakeaustwick.me/python-web-scraping-resource/" rel="nofollow">http://jakeaustwick.me/python-web-scraping-resource/</a>

评论 #7903831 未加载

gjredaalmost 11 years ago

I did this recently when trying to get tickets to a sold out Cloud Nothings show. I'd scrape Craiglist for postings every 10 minutes, and then send myself a text if any of the posts were new. I ended up getting tickets the day before the show.Since the show was at a very small venue (capacity of maybe 500), I didn't have to worry about a constant stream of false positives. I would have needed to handle these if I were searching for tickets to a sold out <popular band> show, since ticket brokers just spam Craigslist constantly with popular terms.

buro9almost 11 years ago

This reminds me of something I knocked up back in 2006. It's not a scraper, it's not Python, but here you are:<a href="http://giggr.com/?q=klaxons" rel="nofollow">http://giggr.com/?q=klaxons</a>Searches multiple UK ticket sites and returns the artist page matching the query.Clicking a header label (i.e. Ticketweb) switches to that provider.Double-clicking the header re-searches based on the value of the search box.I use it for the 9am scramble for newly released tickets.Oh, it seems Ticketmaster has broken. Maybe I'll fix that one day... I haven't used it in a while.

评论 #7904425 未加载

nivertechalmost 11 years ago

Doesn't work for me. Which Python version is required?<pre><code> Traceback (most recent call last): File "./tickets.py", line 20, in <module> for listing in soup.findall('p', {'class': 'row'}): TypeError: 'NoneType' object is not callable</code></pre>

评论 #7903924 未加载

rakooalmost 11 years ago

You should integrate this in weboob [0][0] <a href="http://weboob.org/" rel="nofollow">http://weboob.org/</a>

评论 #7903960 未加载

mjhea0almost 11 years ago

This is not very good code. Here's a little better refactor - <a href="https://github.com/realpython/interview-questions/blob/master/refactor_me/after1.py" rel="nofollow">https://github.com/realpython/interview-questions/blob/maste...</a>

Hilyinalmost 11 years ago

There is also iftt.com that can poll a specific CL search and email you when something hits.