A raw dump of companies from all over the world by LinkedIn handle

197 pointsby mfrye0about 2 years ago

28 comments

It's funny how OP does not address where this data comes from even though it's obviously from LinkedIn. I see many people in the comments asking questions so I will add my two cents as someone who is currently employed by LinkedIn and has an interest in web scraping.This dataset was taken from scraping the company pages from LinkedIn. A company has to pay to have this page, so this certainly does not include all companies. If you have a premium account your search is not rate limited so you can iteratively scrape anything you want even though it's technically a violation of the terms of service.There are many companies that sell data scraped from LinkedIn as a product. LinkedIn won a court case against hiQ Labs for scraping member data and other things[1]. I am not trying to compare this court case to the OP's website, just something worth mentioning.In any case, web scraping is a sort of gray area of the law. In my opinion, this data set does not contain member data and is not being monetized so it feels kosher to me.(Opinions expressed are solely my own and do not express the views or opinions of my employer.)[1] <a href="https://www2.staffingindustry.com/Editorial/IT-Staffing-Report/Jan.-5-2023/LinkedIn-ends-legal-battle-in-data-scraping-case" rel="nofollow">https://www2.staffingindustry.com/Editorial/IT-Staffing-Repo...</a>

评论 #35978226 未加载

评论 #35978643 未加载

评论 #35979271 未加载

评论 #35979450 未加载

评论 #35982879 未加载

评论 #35981818 未加载

评论 #35982625 未加载

评论 #35984723 未加载

dangabout 2 years ago

The submitted title was "World's largest open source company dataset", but (1) "world's largest" is linkbait and the article walks it back, (2) "open source" could be worded better per <a href="https://news.ycombinator.com/item?id=35979581" rel="nofollow">https://news.ycombinator.com/item?id=35979581</a>, and (3) the only thing left in the title after taking those out would be "company dataset", which is too generic to be a good title.I've therefore replaced the title above with what appears to be an accurate description from <a href="https://news.ycombinator.com/item?id=35978156" rel="nofollow">https://news.ycombinator.com/item?id=35978156</a>.

评论 #35980407 未加载

mfrye0about 2 years ago

Hey HN, we're thrilled to announce our latest project - the World's Largest Open Source Company Dataset. Our team has been working hard on this product for the past few months, and we're excited to finally share it with you all.We started off years ago trying to build a B2B app, but getting basic company data at scale was a huge barrier for us. This 15M+ record dataset attempts to solve that and has all the key company fields like name, industry, size, location, LinkedIn handle, etc. We aim to update it quarterly to ensure that you always have the most up-to-date information.Disclaimer: Okay, we have to admit, we didn't exactly comb through every dataset out there to verify that ours is the world's largest, but we did our research, and we're pretty sure it might be. Whether or not that's true, we believe this dataset is a robust and invaluable resource for anyone interested in company data.

评论 #35977409 未加载

评论 #35978156 未加载

评论 #35977654 未加载

paxysabout 2 years ago

You use "open source" multiple times in the post, HN title, HN comments, but:1. The source code for the project isn't shared anywhere.2. The data isn't shared under any standard open source license.3. The terms of your site explicitly prohibit commercial use of this data.So what exactly makes this "open source in the broadest sense"?

评论 #35979675 未加载

评论 #35979578 未加载

评论 #35979581 未加载

simonwabout 2 years ago

It's a 2.64GB CSV file with the following columns:<pre><code> handle type name website founded industry specialties size city state country_code </code></pre> 15,263,246 rows.I think the main listing for Google is this one (as an example):10361050:company/google,Public Company,Google,goo.gle,,Software Development,"search, ads, mobile, android, online video, apps, machine learning, virtual reality, cloud, hardware, artificial intelligence, youtube, and software","10,001+",Mountain View,California,US

评论 #35978398 未加载

评论 #35978247 未加载

评论 #35978812 未加载

decide1000about 2 years ago

Not sure why you need an account. Download it here:<a href="https://bigpicture-datasets-public.s3.us-west-2.amazonaws.com/companies-dataset-2023-02-ckgENv.csv.gz" rel="nofollow">https://bigpicture-datasets-public.s3.us-west-2.amazonaws.co...</a>

评论 #35983226 未加载

评论 #35982172 未加载

评论 #35981754 未加载

评论 #35981639 未加载

nologic01about 2 years ago

Good luck with your launch! This reminded me of a similar project, the opencorporates database (<a href="https://opencorporates.com/" rel="nofollow">https://opencorporates.com/</a>), though the target use cases seem different.

photochemsynabout 2 years ago

"With over 15 million global companies included..."What distinguishes a global from a non-global company? Also, how many of these are anonymous Delaware/Nevada/South Dakota/etc-based shell companies, or are those excluded from the dataset somehow?

评论 #35977833 未加载

评论 #35977866 未加载

gorbachevabout 2 years ago

Is LinkedIn scraped data open source?

Murrawhipabout 2 years ago

On your home page you list Microsoft as being one of your clients. I'm pretty impressed that you managed to sell them what appears to be (mostly) their own data.

评论 #35978856 未加载

1024coreabout 2 years ago

What would be really interesting is to turn this into a graph based on, say, past experience of CEOs/big dealings with each other/etc.

评论 #35977768 未加载

givemeethekeysabout 2 years ago

Are the entries deduped? If one company owns another, is that represented as well?

ricardo81about 2 years ago

'open source'scrape crunchbasescrape companies housescrape wherever elsescrape linkedinfrontier company... or maybe not.

tuukkahabout 2 years ago

A simple Wikidata query can return same type of information in case you prefer open data: <a href="https://w.wiki/6ify" rel="nofollow">https://w.wiki/6ify</a>

data_maanabout 2 years ago

I looked at the attributes they say the dataset has. Not too many (e.g. number of people, location). The really interesting ones, like who is doing business with whom, are missing.

评论 #35977701 未加载

评论 #35977992 未加载

tomalaciabout 2 years ago

How hard is it to scrape LinkedIn for all its public profile data? Do you need special developer access? Do you need to sign some contract with MSFT for anything nontrivial?

borkborkimacatabout 2 years ago

semi-direct link as there's some tomfoolery going on with this: <a href="https://wetransfer.com/downloads/b937345cd81d96654cb2d2bb43d4d97c20230518012146/45406d9077ab083a84bca5909fc425a720230518012156/cde513?trk=TRN_TDL_01&utm_campaign=TRN_TDL_01&utm_medium=email&utm_source=sendgrid" rel="nofollow">https://wetransfer.com/downloads/b937345cd81d96654cb2d2bb43d...</a>

Wronnayabout 2 years ago

I always get "Oops! We ran into an error. Contact us at support@bigpicture.io" when I try to sign-up

评论 #35983073 未加载

companydataguyabout 2 years ago

This is Duedil + Company Check + Open Corporates.Duedil and CC were (mostly) powered by Creditsafe data which is much better in Europe at least than D&BOpen Corporates sold their data to Creditsafe for low 5 figures.Interesting point re DnB in Eu it’s mostly a license if the brand name and owns little of the data or the business.

r3trohack3rabout 2 years ago

This is awesome and in the ballpark of something I'm working on right now.I currently have a list of developer handles, their associated aliases, and their associated email addresses - trying to map that set to employment history.Do folks know of any good data sets for this?

andylynchabout 2 years ago

How do you plan to identify or differentiate between legal entities? Eg a big company like your Uber example will often numerous subsidiaries, in many jurisdictions. Do you plan on including well-known identifiers like LEIs in your model?

visargaabout 2 years ago

I searched for a dataset like this for a long time, trying to use it to augment named entity recognition tasks for documents. But now that GPT is on the market, this works out of the box. It's still useful as reference for validation.

pimlottcabout 2 years ago

I'm confused, most of what's in this dataset has nothing to do with RedHat.

speedgooseabout 2 years ago

The blog picture looks to be generated with dalle2. The quality was mind blowing less than a year ago, while it is now a mess of artefacts compared to dalle2.5 (Bing), adobe firefly, stable diffusion, and of course MidJourney.

wg0about 2 years ago

What I am curious to know about is - what company buys from whom and the whole dependency graph to visualise how complex our modern economy is.But not sure that kind of information is in there.

评论 #35978557 未加载

评论 #35978116 未加载

Rastonburyabout 2 years ago

15 million is small, for example there are vendors who have indexed several hundred million, not speaking from data quality but size wise

brentisabout 2 years ago

Can we rename to Glengarry leads?

评论 #35986476 未加载

whoomp12342about 2 years ago

sounds fun until I have to make an account. then noep. too lazy.