Scraping Recipe Websites

463 pointsby benawadabout 5 years ago

44 comments

selecsosiabout 5 years ago

I highly useful tool in my household for dealing with the SEO/tracking scourge that recipe blogs have become is <a href="https://www.paprikaapp.com/" rel="nofollow">https://www.paprikaapp.com/</a>.Hoping someday to have some spare time to integrate this with <a href="https://grocy.info/" rel="nofollow">https://grocy.info/</a> and have a pipeline for recipe -> preparation automation.

评论 #23143541 未加载

评论 #23144199 未加载

评论 #23148341 未加载

评论 #23145788 未加载

评论 #23144607 未加载

评论 #23147534 未加载

评论 #23146084 未加载

评论 #23149542 未加载

julianlamabout 5 years ago

I'm surprised nobody has mentioned "Recipe Filter" <a href="https://addons.mozilla.org/en-CA/firefox/addon/recipe-filter/" rel="nofollow">https://addons.mozilla.org/en-CA/firefox/addon/recipe-filter...</a>Cuts the fluff and puts the recipe front and center. I wouldn't be able to find recipes online without this.

jedieastonabout 5 years ago

Paprika 3 (I use the iOS version, but I believe the Mac version has the same function) has a fantastic web scraper for recipes. I've had to correct maybe 1-2 errors across 100 recipes I've brought in from a bunch of different sites. It's super helpful to look through them in a standardized way (and you can sort by ingredient/category) to figure out what to make.

评论 #23143133 未加载

评论 #23147057 未加载

评论 #23144159 未加载

kmbfjrabout 5 years ago

But how will I read about "Dakota", an avid yoga enthusiast who just happens to be a mom, who enjoys making healthy and savory meals for her family while blogging?Seriously, I hope this spells an end to the Google ranking imposed nonsense that makes the simple act of searching for a recipe so insufferable.

评论 #23144110 未加载

评论 #23144309 未加载

评论 #23150658 未加载

评论 #23143720 未加载

评论 #23143244 未加载

评论 #23142892 未加载

评论 #23143451 未加载

评论 #23143452 未加载

评论 #23142843 未加载

memsetabout 5 years ago

Interesting! I wrote <a href="https://plainoldrecipe.com" rel="nofollow">https://plainoldrecipe.com</a> (open source!) to solve this, an inadvertently discovered many of the metadata tags described here.The irony is that the content is required for SEO purposes, but once you’ve landed on the page you don’t want to see it. I wonder if there would be a way to write SEO that only the google bot sees and hide it from humans...

评论 #23146167 未加载

评论 #23146232 未加载

m_keabout 5 years ago

Are there any legal issues with scraping recipe sites in a commercial app like that?I'm assuming ingredients and directions are "facts" so can't be copyrighted, but what about the pictures?

评论 #23143294 未加载

评论 #23143319 未加载

评论 #23142920 未加载

评论 #23143021 未加载

logfromblammoabout 5 years ago

The simple truth is that the core recipes are fact-based and non-copyrightable, and the 1000-word blogspam recipe header is both copyrightable and garners better search result rankings.So the business model is to take facts from the public domain, wrap it in bullshit prose, and then SEO the bullshit to have higher ranking than the naked source facts, for more unique visitors and ad revenue.Making comments about "providing recipes for free" are exactly as useful as comments about "providing phone numbers for free" or "providing mailing addresses for free" or "providing the original text of 'Little Women' for free" or "providing the steps of the long division algorithm for free".Obfuscating the public domain is not a valuable service. Automatically removing the obfuscation is valuable. A "Project Gutenberg" style repository of recipes would be recurringly donation-worthy.

stxabout 5 years ago

This could also be useful for websites that do not print well. I have run into a few occasions where adds and other website elements printed with the actual recipe. The result was a small recipe divided on several pages mostly covered with other content. There were pictures and text formatting that I could not copy out. Often for stuff like that I just pull the HTML and edit until it prints well but I would rather have an easier way.

WrtCdEvrydyabout 5 years ago

Here's the question... why is it so difficult to do this in Android?Seriously, AndroidDriver for Selenium was last updated 2013... and importing it throws an HttpClient error now. Update that client and you get a class duplication hell that is impossible to exit.All I needed was to interact with 2-3 fields on a webpage but it's been eight hours and now I hate my life.

评论 #23143547 未加载

评论 #23143333 未加载

zwiebackabout 5 years ago

Cool, now the next interesting step would be to categorize recipes, maybe some kind of clustering algorithm, to see how similar they are and whether they have a common ancestor.When I look at a recipe and notice some unusual proportions I usually check against Joy of Cooking or some other standard book. I've noticed that often everything old is new again.

qrv3wabout 5 years ago

This is great! Its a wonderful write-up.I've also made something almost identical - a Go library for recipes scrapers for ingredients [1] and instructions [2]. Instead of the LCA method here, in my version I try to find the longest sequence of highest scoring HTML tags and those are "ingredients" or "instructions". It works very well (although I think this one works better).Like the article mentioned, I found that the heuristics for finding HTML elements with ingredients turn out to be surprisingly simple - they usually include just a number, a measurement, and a food! This simple heuristic worked better than other sophisticated things I tried.[1]: <a href="https://github.com/schollz/ingredients" rel="nofollow">https://github.com/schollz/ingredients</a>[2]: <a href="https://github.com/schollz/instructions" rel="nofollow">https://github.com/schollz/instructions</a>

fulldecent2about 5 years ago

I saw all the terrible SEOd recipe websites and my first thought was: I should make a better recipe website that is simpler and is better SEOd.---FIRST EXAMPLE:How to cook chicken on a skilletStep 1 -- get this much chicken [picture]Step 2 -- cook on skillet for 5 minutesOPTIONAL -- here are seasonings you may add [pictures]RELATED:- How to cook a lot of chicken on a skillet [LINK]- How to fry chicken breast [LINK]---But then I didn't understand how any of these websites are making money so I didn't do it.

评论 #23150257 未加载

nicbouabout 5 years ago

I just started transcribing every recipe I make. Even if you can extract all the essential information from a recipe site, some changes are needed:- I need to convert recipes to metric. I am neither equipped nor inclined to cook in freedom units.- A "can" or a "packet" is not a standard unit of measurement.- Package sizes vary between countries. I often adjust recipes to avoid wasting food.- I cook by mass, not volume. I convert the units them round them.- Instructions are sometimes too verbose. I make them easier to follow while my hands are busy.- I will make my own changes and I must write them down somewhere.Besides, sites go down and links break. Food.com broke many of my bookmarks a few years ago. Other sites went dark. My recipes are plain text. They are editable, searchable, editable, and available offline.

评论 #23165384 未加载

mark_l_watsonabout 5 years ago

Hey Ben, thanks for that write up! You may not have time for this, but your article and the intersection of food/recipes and computer science would make a good book, at least I would read it.I wrote [1] about 12 years ago in Clojure because for health reasons I had to track my intake of vitamin K, then decided to track all nutrients in the USDA nutrition database. I am working on a semantic web product (with another semantic product in planning) but maybe the end of this year will get to rewriting my food web app in Common Lisp and as a macOS app. I am adding a link to your article and these comments here to my notes for that project. Useful stuff.[1] <a href="http://cookingspace.com" rel="nofollow">http://cookingspace.com</a>

welanesabout 5 years ago

Neat write-up, and thanks for putting me on to jsonld.js - looks useful.I'm building <a href="https://simplescraper.io" rel="nofollow">https://simplescraper.io</a> and we're trying to create heuristics to update CSS selectors whenever a website changes. People become unhappy when a scrape task that ran smoothly on Monday suddenly returns nothing on Tuesday so while it's a tough nut to crack it's super important.We use a combination of XPath, historical data and data type (the value may change but the type and length often remain the same or similar) to narrow down the options.Of course there's more sophisticated methods using Machine learning etc. but it's fun to try different approaches to solve this problem.

Cactus2018about 5 years ago

In 2011, Google released "Google Recipe Search". With filtering based on ingredients, cook time, and calories.<a href="https://www.wired.com/2011/02/google-recipe-semantic/" rel="nofollow">https://www.wired.com/2011/02/google-recipe-semantic/</a><a href="https://latimesblogs.latimes.com/technology/2011/02/google-debuts-recipe-view-search-function-for-cooks.html" rel="nofollow">https://latimesblogs.latimes.com/technology/2011/02/google-d...</a>

kevindongabout 5 years ago

I personally just find recipes, make it as written from the website, and then (if I actually like it), I'll convert it to be sane for actually following and output into Apple Notes.What I mean by that is most recipes call for using wwwaaayyy more intermediary bowls/plates than actually required (e.g. if spices, chopped veggies, and minced garlic are going into the pot at the same time, there's no point in using three bowls) or list ingredients out of order of how you'd actually use them.

peterwwillisabout 5 years ago

So far the best way I've found to search for recipes is to search in a foreign language. Translate what you're looking for, then search and translate back to English. There are still recipe blogs, but 5 instead of 5,000, and usually an authentic dish, not what Michelle The Stir Fry Queen From Michigan thinks constitutes a "Moroccan" dish because it has cinnamon and tomatoes.Would love to see someone put together a search engine that excludes recipe blogs and penalizes SEO.

jangstromabout 5 years ago

This is pretty interesting. I wonder how the recipe parsers from MyFitnessPal or Pinterest compare to this. Sometimes I think they do pretty good, but often they do miss the mark. My guess is on Pinterest they only treat something as a Recipe if it contains the metadata mentioned in the article, and do the easy parse if so. MFP seems to try something a bit more advanced, but I've never been super-impressed with its parsing abilities.

imgabeabout 5 years ago

This is great. I made a similar product at No Nonsense Recipes <a href="https://nononsense.recipes" rel="nofollow">https://nononsense.recipes</a> because I was also tired of dealing with all the dreck on recipe sites. I did scrape some recipes to seed the site with but haven't integrated it as a feature yet.I did ignore the photos though, since while recipes are not subject to copyright, photos are.

wantackerabout 5 years ago

Off-topic, but I just wanted to mention that Ben's been one of my favorite 'teachers' in YouTube. He has some quality content on React and JS stuff. For those wanting to learn React (including some advanced stuff), check out his channel! And no he didn't pay me to post this here. Hey thanks Ben - I know a bit of React and have used it on a few projects thanks (also) to you.

thinkloopabout 5 years ago

Any recommendations for a js lib that does all the "easy" scraping (microdata, og tags, jsonld, etc)?

评论 #23143526 未加载

评论 #23143317 未加载

评论 #23143415 未加载

linsomniacabout 5 years ago

A surprisingly good UX for recipes is Google Home. Ask it for a recipe, and it will ask if you want directions or ingredients. If you ask for ingredients, it will say them one by one, and pause between them until you ask it for the next one. My son has used it to great effect to make pancakes.

aodjabout 5 years ago

Really nice! I often copy and paste recipes into text files I have locally so this is a great alternative.One feature request (if I may be so bold): it would be great to offer an imperial<->metric convertor. This is predominantly one of the reasons I keep copies of recipes I find and use.

评论 #23143685 未加载

GrantSolarabout 5 years ago

I've been working on something similar for the past couple of days, but the trouble comes with wanting static types. There are a few projects out there that offer either a microdata parser, or types derived from schema.org but nothing that combines the two as yet

评论 #23146955 未加载

dsilverabout 5 years ago

<a href="https://www.eater.com/2020/3/31/21201374/why-are-free-online-recipes-so-long-stop-shaming-food-bloggers" rel="nofollow">https://www.eater.com/2020/3/31/21201374/why-are-free-online...</a>

评论 #23145844 未加载

franciscopabout 5 years ago

This is pretty interesting, I wonder if this meta could be reused for tutorials of any kind (and not only of food, a.k.a. recipes). A tutorial normally has some requisites, and then step by step guide of how to achieve it, and then the final result.

ben_utzerabout 5 years ago

I did something similar a while ago. I still have somewhere a DB with half a million recipes somewhere. I didn't continue it because I got stuck with the client side and I didn't find anyone interested in helping me.

chirauabout 5 years ago

Is there any recipe tool out there that can do at least one of the following:1) Scale the quantity of ingredients and cooking time as number of people to be served increases?2) Tell me what dishes I can make with the ingredients I have?

评论 #23150484 未加载

gklittabout 5 years ago

Pleasantly surprised to learn that most recipe sites include structured metadata. Makes sense given the combination of a relatively straightforward schema, and SEO incentive from Google.

vadanskyabout 5 years ago

I've been using Tasty. Quick videos showing all the steps and the how it's supposed to look like along the way. That's the only way I can accept recipes anymore.

monksyabout 5 years ago

This is pretty awesome. I'm currently working on a data pipeline to demonstrate recipe scraping with kafka streams. This is going to be a big help in part of it.

brendanmcdabout 5 years ago

Googling for recipes drove me to install ad-blocker. Have to say I never considered how google created recipe card featured snippets -- cool stuff!

IncRndabout 5 years ago

Almost every comment on this page is helpful and from people's direct experiences. Wonderful :)Thank you everyone for all of this information!

Mela1998about 5 years ago

I wish I had this when I first started cooking! I love this concept, but wouldn't this also harm the creator's traffic???

ohhaimarcabout 5 years ago

This case is a perfect 'recipe' for reinforcement learning. Let me know if you want help here.

评论 #23147880 未加载

sum2000about 5 years ago

Neat! I am interested in developing REST API around it to support more functionality, wanna collaborate?

saadalemabout 5 years ago

Next level : Shazam for cooking shows.

评论 #23143222 未加载

triyambakamabout 5 years ago

Hey Ben, if you read this, thanks for your helpful and entertaining youtube videos!

throwaway55554about 5 years ago

There is markup specifically for recipes. I wonder why it isn't more often used.EDIT: Yes, the article mentions it, but doesn't give a clue why it isn't more prevalent.

评论 #23143167 未加载

评论 #23143602 未加载

hamilyon2about 5 years ago

So, Google actually encourages open semantical web? That is news

评论 #23145677 未加载

SeanDavabout 5 years ago

Another tool for difficult-to-scrape sites is OCR. There are a few decent free/opensource options available:<a href="https://source.opennews.org/articles/so-many-ocr-options/" rel="nofollow">https://source.opennews.org/articles/so-many-ocr-options/</a>

partiallyproabout 5 years ago

Be careful with this, some recipes are subject to copyright law. I think you can list ingredients of a recipe with no problem, but once you get to exact measurements and prep it somehow switches over to falling under copyright law. There used to be a bunch of open sourced recipe repos/databases...but almost all of them are gone.

评论 #23151982 未加载

评论 #23146168 未加载

评论 #23145381 未加载

papadocabout 5 years ago

Is this post illegal as it contains information on how to commit crimes such as copyright infringement?

评论 #23143036 未加载

评论 #23143518 未加载

评论 #23143053 未加载

评论 #23142903 未加载

评论 #23143550 未加载