Interesting. So, it seems like you aren't respecting robots.txt. I picked Old Navy, as it was on your supported stores page [0], and went to their robots.txt [1]<p><pre><code> User-agent: *
Disallow: /buy/
Disallow: /checkout/
</code></pre>
So, do you have permission to violate robots.txt, as I'm sure there is some automated interaction with checkout/purchasing pages? Or I am I missing something about how TwoTap works? Scraping is one thing, but accessing when the management of the website prohibits it seems like a big no no.<p>[0] : <a href="https://twotap.com/supported-stores/" rel="nofollow">https://twotap.com/supported-stores/</a><p>[1]: <a href="http://oldnavy.gap.com/robots.txt" rel="nofollow">http://oldnavy.gap.com/robots.txt</a>
Looks like I'm at one of the retailers you crawl. Recently, our site was getting hit with a web crawler that was following links incorrectly. I black listed several IP addresses from accessing the site and now I wonder if it was this!<p>Does your crawler obey robots.txt rules?
I'm confused about the legality of scraping. Is it completely open, or are there some restrictions on scraping any site without explicit permission?
I don't understand why you're pro-scrapping. ( I did write a blog post on this, and I believe that I posted it to HN before: <a href="http://theexceptioncatcher.com/blog/2012/07/how-to-get-rid-of-screen-scrapers-from-your-website/" rel="nofollow">http://theexceptioncatcher.com/blog/2012/07/how-to-get-rid-o...</a> )<p>But, wouldn't it be more beneficial to get websites to open up an API to you, communicate to them to do so, or even offer consulting services to build an API?<p>I know that there are a few cart/store offerings out there. it seems to me that they would have an API.<p>Magneto: <a href="http://www.magentocommerce.com/api/soap/checkout/checkout.html" rel="nofollow">http://www.magentocommerce.com/api/soap/checkout/checkout.ht...</a><p>OpenCart Propretary API: <a href="http://opencart-api.com/" rel="nofollow">http://opencart-api.com/</a><p>Prestashop API: <a href="http://doc.prestashop.com/display/PS14/Using+the+REST+webservice" rel="nofollow">http://doc.prestashop.com/display/PS14/Using+the+REST+webser...</a>
The hard part is not scraping, it's returns. For many kinds of online products, the return rate is over 40%. The shopper must be completely aware of how to contact the merchant of record and how to return the product.<p>Also, if you are scraping a large retailer you are effectively required to be PCI DSS level 1 compliant, which takes a bit of extra effort.
I've worked with two shopping search engines, and interestingly, scraping sites was one of the things they did to build up their inventory as well. The big difference being, they simply organized the products into a searchable format, then sent traffic to the ecommerce site and let them handle the checkout . What you're doing is arguably more complex.<p>(They also prioritized the feeds that were sent to them directly by retailers above the scraped items feeds - thus prioritizing paid listings, similar to the Google SERPs - so a different business model entirely.)<p>That being said, a very cool concept - and agreed that, given the relatively-small number of ecommerce platforms out there, scraping then erving them up seems pretty scalable. Interested to see how it goes.
I built a CJ scraper for a deals website that is now defunct. What a pain it was to maintain. All the different retailers dump their data into CJ in different ways. I might just put it on github if anyone's interested. Python + chromedriver + beautifulsoup + mechanize
I tried the demo with a Lego castle priced 99€ and got a grand total of more than $10k...<p>FYI, Lego showed me the French version of their website as it's where I live. You seems to only offer shipping in the US though that's not clear reading your website. Still very interesting.<p>Product URL: <a href="http://shop.lego.com/fr-FR/Le-ch%C3%A2teau-fort-70404?fromListing=listing" rel="nofollow">http://shop.lego.com/fr-FR/Le-ch%C3%A2teau-fort-70404?fromLi...</a><p>Screenshot: <a href="http://imgur.com/mlr8Q2e" rel="nofollow">http://imgur.com/mlr8Q2e</a>
Can anyone go into a bit more detail about how the affiliate commissions work here? From what I have read, I would feed my affiliate link through TwoTap and you would then handle the cookie and conversion and everything?<p>If I was using URLs gathered from a Commission Junction datafeed, is this basically a plug and play solution? Or do I need to process those URLs?<p>Do you have a backend stats dashboard? Or would I still rely on CJ for that data?
So you guys are scraping all the product information for a retailer and keeping it up to date? Or is it all live, you fetch it when that particular url is called? Where do you get the list of retailers to scrape?
I don't get it. Is this just a middle man between all the retail websites and the publishers? Sort of like what Google is doing with the product search and also giving comissions on the items sold?