As is often true of GPT responses, there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.<p>I think this is false, but if you're reading quickly, it sounds cogent enough. As Sarah Constantin observed about GPT2 [0]:<p>> if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot...The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)<p>[0] <a href="https://www.skynettoday.com/editorials/humans-not-concentrating" rel="nofollow">https://www.skynettoday.com/editorials/humans-not-concentrat...</a>
It seems to write in the generic "style" of GPT, instead of in the style I would recognise as a HN poster. Is that because of something baked into how the training process works? It lacks a sort of casualness or air of superiority ;)
Am I correct in understanding that this doesn't actually <i>generate</i> answers based on HN, but instead finds semantically-near comments and sends them verbatim to GPT to summarize? Seems like a good enough hack, though I'd love a detailed writeup of how to actually specialize an existing LLM with additional training data (like HN).
I'm quite pleasantly surprised by this response:<p>Question: what are the biggest fallacies of people that post on hackernews?<p>Answer: One of the biggest fallacies of people that post on Hacker News is the nirvana fallacy, which is when someone jumps from "thing x is imperfect" to "thing x is worthless" without considering any of the benefits. People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology. Another fallacy is the assumption that there is one best tool for doing things - the one that supports the largest amount of use cases. Additionally, people may use appeals to authority, invent their own solutions rather than use existing ones, and assume that the approaches they first learned are most correct. Finally, people may assume open source software is a gift and that dependencies don't need to be tested.
I'm a little surprised that Hacker News comments weren't already in the GPT-3 training set. I just assumed that OpenAI had vacuumed up most of the web already.
Nice! We built something very similar recently, it is more like "Ask your documentation" but very similar implementations otherwise.<p>See a demo on the huggingface transformers documentation: <a href="https://huggingface.co/spaces/jerpint/buster" rel="nofollow">https://huggingface.co/spaces/jerpint/buster</a><p>code: <a href="https://github.com/jerpint/buster">https://github.com/jerpint/buster</a>
For those who are wondering,<p>HN data is indexed with embeddings for semantic search. When queried, it finds closest article, top comments and summarizes with GPT-3.<p>GPT-3 serves as a rendering tool for compressed comments.
My own experiments made me think that the impact of finetuning is comparable to that of a molecule in a drop in a bucket.<p>> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.<p>I'm assuming you used the openai fine-tuning pathway to make a custom model?<p>Have you tested the responses on vanilla GPT3 vs your custom model?<p>I'd be curious to see the comparison.
This might be a dumb question, but is this based on the <i>collective</i> wisdom of HN. Because I would say that the collective wisdom is just as much in the interaction of the comments and the ranking of those comments as it is in the comments themselves. If you just injest all the comments wholesale, aren't you rather getting the <i>average</i> wisdom of HN?
Let's admit that HN's culture is that many of us are confidentially wrong, which we cover up with impressive technical jargon. As such, any wrong answer in this AI is in fact correct.
I love this! I used to append "reddit" to my Google search queries to get best results, but the quality of dialog over there has really dropped in recent years. These days I've switched to appending "hackernews", but this is even better.
Nice work! Been playing with Langchain and was not aware of patterns.app.<p>This whole space is moving so fast its hard to keep up for someone whos immediate day job doesn't revolve around this space. Congrats.
Anyone here know how to generate subtitles automatically using AI when a video is playing on the web?<p>Was planning to see how I can build something like these sites but without the need to regularly update the subtitles catalog:
<a href="https://subscene.be" rel="nofollow">https://subscene.be</a>
<a href="https://subtitlecat.com" rel="nofollow">https://subtitlecat.com</a>
<a href="https://subtitletrans.com" rel="nofollow">https://subtitletrans.com</a>
<a href="https://my-subs.co" rel="nofollow">https://my-subs.co</a>
Is there any LLM model that can be self hosted and fed a corpus of data to ingest for question answering? The part I find difficult is how to feed (not train) the open LLM models with entire dataset not available to public?
ChatGPT and friends always talk like those Microsoft and Apple forum responders with 100k reputation.<p>I see that you are asking about "How to get a job at Google". I will help you with "How to get a job at Google". In order to solve the problem of "How to get a job at Google" please follow the following steps first:<p>- rewrite your resume in Google Docs<p>- reinstall Chrome<p>- apply to the job<p>Let me know if I can help further with "How to get a job at Google". I like using it, but I have to tune my prompts to make sure that they don't bullshit me before getting to the point.
I like the project. Had been wanting to do this myself for a long time, because HN has become the first place I go to nowadays for answers, and I value the intelligence and experience distilled in the comments here.<p>I do not like that it seems to be effectively an ad.<p>> Embedding every single one of the 6.5 eligible comments was prohibitively time-consuming and expensive (12 hours and ~$2,000).<p>Does anybody understand what he’s talking about here? Assuming 6.5 million comments and an average token length of 70 we'd be looking at $180 ($0.0004 / 1K tokens).
Nice. I just sort of assumed early on my comments were training some future AI, and I hope that in some small way I have been able to moderate some of its stupider urges.<p>A version where you can turn knobs of flavored contributors would be pretty funny. I know my comment style is easily identifiable and reproducable, and it encodes a certain type of logical conjugation, albeit biased with some principles and trigger topics, and I think there is enough material on HN that there may be such a thing as a distinct, motohagiographic lens. :)
Some day I will sue people like OP (if they're monetizing it) and OpenAI for monetizing my public posts. You can use, reuse and alter public speech but when you earn ad dollars...yeah part of that is mine if your model used my public content. I probably won't actually sue but someone will.<p>I am not a lawyer but there has to be a jurisdiction where I can establish standing at least.
> 4. Index the embeddings in a database<p>If Op is reading. I'm curious about the database you are using to store the embeddings. Pinecone, Weaviate ...?
Related question: I've written probably a million words over my lifetime.<p>Is there an easy way to load up GPT with my thoughts to have it be a fake me?
> The methodology I used here is a generic, scalable solution for distilling a knowledge corpus into an embodied intelligence<p>The methodology used here is a generic solution for distilling a non-generic corpus of utterances of , into a generic platitude machine.
I have an experiment that uses the embeddings to visualize clusterings of HN comments (using tsne). Not super useful but interesting to view the comments in 3D and seeing how similar ones cluster together into mostly relevant themes.
Hmm. I thought perhaps he was going to take the questions from askHN and the top upvoted comments and fine tuning a model with that as the prompt / reply pair.<p>Curious how that would differ; but would be an expensive endevour.
Seeing a ton of projects utilizing chatgpt nowadays. Are the project owners basically paying the API costs out of pocket ? Think it would add up pretty quick especially if front page on HN.
Just to be sure: This is NOT a finetuned GTP model, but rather standard GPT-3 API, used to summarize search results of a HN Comments DB, based on user input. Right?
Did you also ingest dead comments to the corpus?<p>I would very much like to see the ghost of Terry pop up from time to time, to offer his wisdom and unique style of response.
> I trained on a corpus of over 6.5 million Hacker News comments<p>How long did it take to scrape them and train the "corpus" on this content?