TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

A raw dump of companies from all over the world by LinkedIn handle

197 pointsby mfrye0about 2 years ago

28 comments

babblingfishabout 2 years ago
It&#x27;s funny how OP does not address where this data comes from even though it&#x27;s obviously from LinkedIn. I see many people in the comments asking questions so I will add my two cents as someone who is currently employed by LinkedIn and has an interest in web scraping.<p>This dataset was taken from scraping the company pages from LinkedIn. A company has to pay to have this page, so this certainly does not include all companies. If you have a premium account your search is not rate limited so you can iteratively scrape anything you want even though it&#x27;s technically a violation of the terms of service.<p>There are many companies that sell data scraped from LinkedIn as a product. LinkedIn won a court case against hiQ Labs for scraping member data and other things[1]. I am not trying to compare this court case to the OP&#x27;s website, just something worth mentioning.<p>In any case, web scraping is a sort of gray area of the law. In my opinion, this data set does not contain member data and is not being monetized so it feels kosher to me.<p>(Opinions expressed are solely my own and do not express the views or opinions of my employer.)<p>[1] <a href="https:&#x2F;&#x2F;www2.staffingindustry.com&#x2F;Editorial&#x2F;IT-Staffing-Report&#x2F;Jan.-5-2023&#x2F;LinkedIn-ends-legal-battle-in-data-scraping-case" rel="nofollow">https:&#x2F;&#x2F;www2.staffingindustry.com&#x2F;Editorial&#x2F;IT-Staffing-Repo...</a>
评论 #35978226 未加载
评论 #35978643 未加载
评论 #35979271 未加载
评论 #35979450 未加载
评论 #35982879 未加载
评论 #35981818 未加载
评论 #35982625 未加载
评论 #35984723 未加载
dangabout 2 years ago
The submitted title was &quot;World&#x27;s largest open source company dataset&quot;, but (1) &quot;world&#x27;s largest&quot; is linkbait and the article walks it back, (2) &quot;open source&quot; could be worded better per <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35979581" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35979581</a>, and (3) the only thing left in the title after taking those out would be &quot;company dataset&quot;, which is too generic to be a good title.<p>I&#x27;ve therefore replaced the title above with what appears to be an accurate description from <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35978156" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35978156</a>.
评论 #35980407 未加载
mfrye0about 2 years ago
Hey HN, we&#x27;re thrilled to announce our latest project - the World&#x27;s Largest Open Source Company Dataset. Our team has been working hard on this product for the past few months, and we&#x27;re excited to finally share it with you all.<p>We started off years ago trying to build a B2B app, but getting basic company data at scale was a huge barrier for us. This 15M+ record dataset attempts to solve that and has all the key company fields like name, industry, size, location, LinkedIn handle, etc. We aim to update it quarterly to ensure that you always have the most up-to-date information.<p>Disclaimer: Okay, we have to admit, we didn&#x27;t exactly comb through every dataset out there to verify that ours is the world&#x27;s largest, but we did our research, and we&#x27;re pretty sure it might be. Whether or not that&#x27;s true, we believe this dataset is a robust and invaluable resource for anyone interested in company data.
评论 #35977409 未加载
评论 #35978156 未加载
评论 #35977654 未加载
paxysabout 2 years ago
You use &quot;open source&quot; multiple times in the post, HN title, HN comments, but:<p>1. The source code for the project isn&#x27;t shared anywhere.<p>2. The data isn&#x27;t shared under any standard open source license.<p>3. The terms of your site explicitly prohibit commercial use of this data.<p>So what exactly makes this &quot;open source in the broadest sense&quot;?
评论 #35979675 未加载
评论 #35979578 未加载
评论 #35979581 未加载
simonwabout 2 years ago
It&#x27;s a 2.64GB CSV file with the following columns:<p><pre><code> handle type name website founded industry specialties size city state country_code </code></pre> 15,263,246 rows.<p>I think the main listing for Google is this one (as an example):<p>10361050:company&#x2F;google,Public Company,Google,goo.gle,,Software Development,&quot;search, ads, mobile, android, online video, apps, machine learning, virtual reality, cloud, hardware, artificial intelligence, youtube, and software&quot;,&quot;10,001+&quot;,Mountain View,California,US
评论 #35978398 未加载
评论 #35978247 未加载
评论 #35978812 未加载
decide1000about 2 years ago
Not sure why you need an account. Download it here:<p><a href="https:&#x2F;&#x2F;bigpicture-datasets-public.s3.us-west-2.amazonaws.com&#x2F;companies-dataset-2023-02-ckgENv.csv.gz" rel="nofollow">https:&#x2F;&#x2F;bigpicture-datasets-public.s3.us-west-2.amazonaws.co...</a>
评论 #35983226 未加载
评论 #35982172 未加载
评论 #35981754 未加载
评论 #35981639 未加载
nologic01about 2 years ago
Good luck with your launch! This reminded me of a similar project, the opencorporates database (<a href="https:&#x2F;&#x2F;opencorporates.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;opencorporates.com&#x2F;</a>), though the target use cases seem different.
photochemsynabout 2 years ago
&quot;With over 15 million global companies included...&quot;<p>What distinguishes a global from a non-global company? Also, how many of these are anonymous Delaware&#x2F;Nevada&#x2F;South Dakota&#x2F;etc-based shell companies, or are those excluded from the dataset somehow?
评论 #35977833 未加载
评论 #35977866 未加载
gorbachevabout 2 years ago
Is LinkedIn scraped data open source?
Murrawhipabout 2 years ago
On your home page you list Microsoft as being one of your clients. I&#x27;m pretty impressed that you managed to sell them what appears to be (mostly) their own data.
评论 #35978856 未加载
1024coreabout 2 years ago
What would be really interesting is to turn this into a graph based on, say, past experience of CEOs&#x2F;big dealings with each other&#x2F;etc.
评论 #35977768 未加载
givemeethekeysabout 2 years ago
Are the entries deduped? If one company owns another, is that represented as well?
ricardo81about 2 years ago
&#x27;open source&#x27;<p>scrape crunchbase<p>scrape companies house<p>scrape wherever else<p>scrape linkedin<p>frontier company... or maybe not.
tuukkahabout 2 years ago
A simple Wikidata query can return same type of information in case you prefer open data: <a href="https:&#x2F;&#x2F;w.wiki&#x2F;6ify" rel="nofollow">https:&#x2F;&#x2F;w.wiki&#x2F;6ify</a>
data_maanabout 2 years ago
I looked at the attributes they say the dataset has. Not too many (e.g. number of people, location). The really interesting ones, like who is doing business with whom, are missing.
评论 #35977701 未加载
评论 #35977992 未加载
tomalaciabout 2 years ago
How hard is it to scrape LinkedIn for all its public profile data? Do you need special developer access? Do you need to sign some contract with MSFT for anything nontrivial?
borkborkimacatabout 2 years ago
semi-direct link as there&#x27;s some tomfoolery going on with this: <a href="https:&#x2F;&#x2F;wetransfer.com&#x2F;downloads&#x2F;b937345cd81d96654cb2d2bb43d4d97c20230518012146&#x2F;45406d9077ab083a84bca5909fc425a720230518012156&#x2F;cde513?trk=TRN_TDL_01&amp;utm_campaign=TRN_TDL_01&amp;utm_medium=email&amp;utm_source=sendgrid" rel="nofollow">https:&#x2F;&#x2F;wetransfer.com&#x2F;downloads&#x2F;b937345cd81d96654cb2d2bb43d...</a>
Wronnayabout 2 years ago
I always get &quot;Oops! We ran into an error. Contact us at support@bigpicture.io&quot; when I try to sign-up
评论 #35983073 未加载
companydataguyabout 2 years ago
This is Duedil + Company Check + Open Corporates.<p>Duedil and CC were (mostly) powered by Creditsafe data which is much better in Europe at least than D&amp;B<p>Open Corporates sold their data to Creditsafe for low 5 figures.<p>Interesting point re DnB in Eu it’s mostly a license if the brand name and owns little of the data or the business.
r3trohack3rabout 2 years ago
This is awesome and in the ballpark of something I&#x27;m working on right now.<p>I currently have a list of developer handles, their associated aliases, and their associated email addresses - trying to map that set to employment history.<p>Do folks know of any good data sets for this?
andylynchabout 2 years ago
How do you plan to identify or differentiate between legal entities? Eg a big company like your Uber example will often numerous subsidiaries, in many jurisdictions. Do you plan on including well-known identifiers like LEIs in your model?
visargaabout 2 years ago
I searched for a dataset like this for a long time, trying to use it to augment named entity recognition tasks for documents. But now that GPT is on the market, this works out of the box. It&#x27;s still useful as reference for validation.
pimlottcabout 2 years ago
I&#x27;m confused, most of what&#x27;s in this dataset has nothing to do with RedHat.
speedgooseabout 2 years ago
The blog picture looks to be generated with dalle2. The quality was mind blowing less than a year ago, while it is now a mess of artefacts compared to dalle2.5 (Bing), adobe firefly, stable diffusion, and of course MidJourney.
wg0about 2 years ago
What I am curious to know about is - what company buys from whom and the whole dependency graph to visualise how complex our modern economy is.<p>But not sure that kind of information is in there.
评论 #35978557 未加载
评论 #35978116 未加载
Rastonburyabout 2 years ago
15 million is small, for example there are vendors who have indexed several hundred million, not speaking from data quality but size wise
brentisabout 2 years ago
Can we rename to Glengarry leads?
评论 #35986476 未加载
whoomp12342about 2 years ago
sounds fun until I have to make an account. then noep. too lazy.