TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Finding the best ticket price – Simple web scraping with Python

44 pointsby danielforsythalmost 11 years ago

13 comments

jknuppalmost 11 years ago
A shorter, more comprehensible version:<p>import requests<p>from bs4 import BeautifulSoup<p>from urlparse import urljoin<p>URL = &#x27;<a href="http://philadelphia.craigslist.org/search/sss?sort=date&amp;query=firefly%20tickets&#x27;" rel="nofollow">http:&#x2F;&#x2F;philadelphia.craigslist.org&#x2F;search&#x2F;sss?sort=date&amp;quer...</a><p>BASE = &#x27;<a href="http://philadelphia.craigslist.org/cpg/&#x27;" rel="nofollow">http:&#x2F;&#x2F;philadelphia.craigslist.org&#x2F;cpg&#x2F;&#x27;</a><p>response = requests.get(URL)<p>soup = BeautifulSoup(response.content)<p>for listing in soup.find_all(&#x27;p&#x27;,{&#x27;class&#x27;:&#x27;row&#x27;}):<p><pre><code> if listing.find(&#x27;span&#x27;,{&#x27;class&#x27;:&#x27;price&#x27;}): price = int(listing.text[2:6]) if 100 &lt; price &lt;=250: print listing.text print urljoin(BASE, listing.a[&#x27;href&#x27;]) + &#x27;\n&#x27;</code></pre>
评论 #7904288 未加载
评论 #7908407 未加载
motoboialmost 11 years ago
Some months ago I found <a href="https://import.io/" rel="nofollow">https:&#x2F;&#x2F;import.io&#x2F;</a> and it just blow my mind.<p>I remember the pain it was to write custom scrapers every time (I used to do it with Perl, btw).<p>They have a custom browser with a nice interface, but the biggest thing are the so called &quot;Connectors&quot;: you instruct the system into how to query and parse results and Import.IO will give you an API endpoint for this query, now automatized.<p>One can, say, create a &quot;connector&quot; which can query Airbnb and parse results, then create another &quot;connector&quot; which queries booking.com. Now it is possible to use the API to make a query for Boa Vista, Roraima (my city) and get the dataset.<p>I am not affiliated with them in any way, just a very happy old-school scrapper.<p>Nice walkthrough: <a href="http://www.youtube.com/watch?v=_16O10Wx2W4" rel="nofollow">http:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=_16O10Wx2W4</a><p>UPDATE:<p>Unsurprisingly, import.io was Hacker News stuff in the past: <a href="https://news.ycombinator.com/item?id=7582858" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7582858</a>
评论 #7905475 未加载
tstalmost 11 years ago
I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it&#x27;s incredible fast to write a simple scraper with your data neatly exported in .json.<p>[0]: <a href="http://scrapy.org/" rel="nofollow">http:&#x2F;&#x2F;scrapy.org&#x2F;</a>
评论 #7903852 未加载
评论 #7904420 未加载
dai_polealmost 11 years ago
I had a go &quot;just for fun&quot; using curl, grep, sed, and tr. Probably too much regex?<p><pre><code> #!&#x2F;bin&#x2F;sh # # tickets.sh - A &quot;no BS&quot; ticket price scraper. Output in CSV format. # Uses standard issue Unix utilities only. # No soup for you! URL=&quot;http:&#x2F;&#x2F;philadelphia.craigslist.org&quot; QUERY=&quot;firefly+tickets&quot; RESULTS=`curl -s -m 10 &quot;$URL&#x2F;search&#x2F;sss?sort=date&amp;query=$QUERY&quot; \ | grep &#x27;&lt;p class=\&quot;row&#x27; \ | sed &#x27;s!^[ \t]*!!; \ s!&gt;[ \t]*&lt;!&gt;&lt;!g; \ s![,:]! !g; \ s!&lt;p class=\&quot;row[^&#x2F;]*\&quot;\([^\&quot;]*\)\&quot; class=\&quot;[^#]*\&quot;&gt;&amp;#x0024;\([0-9]\{1,\}\)&lt;&#x2F;span&gt;[^.]*&gt;\([A-Z]\{1\}[a-z]\{2\} \{1,\}[0-9]\{1,2\}\)[^.]*&lt;a h[^&gt;]*\.html&quot;&gt;\([^&lt;]*\)&lt;&#x2F;a&gt;\([^.]*&lt;&#x2F;p&gt;\)!\1,$\2,\3,\4:!g; \ s! *! !g; \ s!, *!,!g&#x27; \ | tr &#x27;:&#x27; &#x27;\n&#x27;` echo &quot;$RESULTS&quot;</code></pre>
josegonzalezalmost 11 years ago
Shameless Plug: I work for an NYC-based startup - SeatGeek.com - that is basically this[1]. We used to do forecasting but found that wasn&#x27;t really useful[2] or worth the time it took to maintain, so we nixed it.<p>- [1]: As an example, here is the Firefly event the OP was scraping. : <a href="https://seatgeek.com/firefly-music-festival-tickets" rel="nofollow">https:&#x2F;&#x2F;seatgeek.com&#x2F;firefly-music-festival-tickets</a><p>- [2]: We haven&#x27;t included Craigslist because the data is much less structured and inexperienced users may have a Bad Time™. YMMV<p>- [3]: It was also a royal pain in the ass to maintain. I know because I had to update the underlying data provided to the model, and also modify it whenever available data changed :( . Here is a blog post on why we removed it from the product in general: <a href="http://chairnerd.seatgeek.com/removing-price-forecasts" rel="nofollow">http:&#x2F;&#x2F;chairnerd.seatgeek.com&#x2F;removing-price-forecasts</a>
jtokophalmost 11 years ago
Combine this with Pushover[0] to get alerted whenever there is a new lowest price. I had to resort to scraping+pushover to snatch a garage parking spot in SF.<p>[0] <a href="https://pushover.net/" rel="nofollow">https:&#x2F;&#x2F;pushover.net&#x2F;</a>
tomaisthorpealmost 11 years ago
Useful article, I use lxml myself. Find that this is a good resource: <a href="http://jakeaustwick.me/python-web-scraping-resource/" rel="nofollow">http:&#x2F;&#x2F;jakeaustwick.me&#x2F;python-web-scraping-resource&#x2F;</a>
评论 #7903831 未加载
gjredaalmost 11 years ago
I did this recently when trying to get tickets to a sold out Cloud Nothings show. I&#x27;d scrape Craiglist for postings every 10 minutes, and then send myself a text if any of the posts were new. I ended up getting tickets the day before the show.<p>Since the show was at a very small venue (capacity of maybe 500), I didn&#x27;t have to worry about a constant stream of false positives. I would have needed to handle these if I were searching for tickets to a sold out &lt;popular band&gt; show, since ticket brokers just spam Craigslist constantly with popular terms.
buro9almost 11 years ago
This reminds me of something I knocked up back in 2006. It&#x27;s not a scraper, it&#x27;s not Python, but here you are:<p><a href="http://giggr.com/?q=klaxons" rel="nofollow">http:&#x2F;&#x2F;giggr.com&#x2F;?q=klaxons</a><p>Searches multiple UK ticket sites and returns the artist page matching the query.<p>Clicking a header label (i.e. Ticketweb) switches to that provider.<p>Double-clicking the header re-searches based on the value of the search box.<p>I use it for the 9am scramble for newly released tickets.<p>Oh, it seems Ticketmaster has broken. Maybe I&#x27;ll fix that one day... I haven&#x27;t used it in a while.
评论 #7904425 未加载
nivertechalmost 11 years ago
Doesn&#x27;t work for me. Which Python version is required?<p><pre><code> Traceback (most recent call last): File &quot;.&#x2F;tickets.py&quot;, line 20, in &lt;module&gt; for listing in soup.findall(&#x27;p&#x27;, {&#x27;class&#x27;: &#x27;row&#x27;}): TypeError: &#x27;NoneType&#x27; object is not callable</code></pre>
评论 #7903924 未加载
rakooalmost 11 years ago
You should integrate this in weboob [0]<p>[0] <a href="http://weboob.org/" rel="nofollow">http:&#x2F;&#x2F;weboob.org&#x2F;</a>
评论 #7903960 未加载
mjhea0almost 11 years ago
This is not very good code. Here&#x27;s a little better refactor - <a href="https://github.com/realpython/interview-questions/blob/master/refactor_me/after1.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;realpython&#x2F;interview-questions&#x2F;blob&#x2F;maste...</a>
Hilyinalmost 11 years ago
There is also iftt.com that can poll a specific CL search and email you when something hits.