I’m biased since I’m an owner of a web scraping agency (<a href="https://webscrapingsolutions.co.uk/" rel="nofollow">https://webscrapingsolutions.co.uk/</a>). I was asking myself the same question in 2019.
You can use any programming language, but have settled on this tech-stack Python, Scrapy (<a href="https://github.com/scrapy/scrapy" rel="nofollow">https://github.com/scrapy/scrapy</a>), Redis, PostgreSQL. for the following reasons:<p>[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.<p>[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.<p>[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).<p>[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.<p>[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.<p>[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (<a href="https://www.zyte.com/scrapy-cloud/" rel="nofollow">https://www.zyte.com/scrapy-cloud/</a>) is good and cheap enough for 99% of the projects.<p>[7] If you decide to have your own infrastructure, you can use <a href="https://github.com/scrapy/scrapyd" rel="nofollow">https://github.com/scrapy/scrapyd</a>.<p>[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.<p>[9] It’s easy to integrate your own AI/ML models into the scraping workflow.<p>[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using <a href="https://github.com/rmax/scrapy-redis" rel="nofollow">https://github.com/rmax/scrapy-redis</a>.<p>[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.<p>We have built dozens of projects in multiple industries:<p>- news monitoring<p>- job aggregators<p>- real estate aggregators<p>- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)<p>- lead generation<p>- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)<p>- macroeconomic research & indicators<p>- social media, NFT marketplaces, etc<p>So, most of the projects can be finished using these tools.