TechEcho

15 comments

janalsncmabout 1 year ago

Pretty cool, a js implementation of n-grams!What is amazing to me is this: imagine that English only had 10,000 words. For each of those 10,000 words there’s 100 valid subsequent words. So there’s 1 million valid bigrams. Now if you want trigrams that takes you to 100 million, and for 4-grams it’ll be 10 billion. Just for that, you’d need 14 bytes per word and gigabytes of storage.LLMs typically have context windows in the hundreds if not thousands. (Back in my day GPT2 had a context window of 1024 and we called that an LLM. And we liked it.) So it’s kind of amazing that a model that can fit on a flash drive can make reasonable next token prediction on the whole internet and with a context size that can fit a whole book.

评论 #39584629 未加载

评论 #39563219 未加载

throwaway4adayabout 1 year ago

Nice work! I built something similar years ago and I did compile the probabilities based on a corpus of text (public domain books) in an attempt to produce writing in the style of various authors. The results were actually quite similar to the output of nanoGPT[0]. It was very unoptimized and everything was kept in memory. I also knew nothing about embeddings at the time and only a little about NLP techniques that would certainly have helped. Using a graph database would have probably been better than the datastructure I came up with at the time. You should look into stuff like Datalog, Tries[1], and N-Triples[2] for more inspiration.Your idea of splitting the probabilities based on whether you're starting the sentence or finishing it is interesting but you might be able to benefit from an approach that creates a "window" of text you can use for lookup, using an LCS[3] algorithm could do that. There's probably a lot of optimization you could do based on the probabilities of different sequences, I think this was the fundamental thing I was exploring in my project.Seeing this has inspired me further to consider working on that project again at some point.[0] <a href="https://github.com/karpathy/nanoGPT">https://github.com/karpathy/nanoGPT</a>[1] <a href="https://en.wikipedia.org/wiki/Trie" rel="nofollow">https://en.wikipedia.org/wiki/Trie</a>[2] <a href="https://en.wikipedia.org/wiki/N-Triples" rel="nofollow">https://en.wikipedia.org/wiki/N-Triples</a>[3] <a href="https://en.wikipedia.org/wiki/Longest_common_subsequence" rel="nofollow">https://en.wikipedia.org/wiki/Longest_common_subsequence</a>

sodimelabout 1 year ago

This is cool. I don't want a 1GB+ download and an entire LLM running on my machine (or worse, on someone else machine) to find words.What I really want is some simple predictive text engine to make writing in English easier (because its not my first language), helping me find more complex words that I don't use because I don't know them well enough.

评论 #39560985 未加载

pseudo_metaabout 1 year ago

Seeing that you use Obsidian: Have you thought about turning that into an Obsidian plugin that offers the predictions as editor suggestions?

评论 #39562365 未加载

评论 #39559526 未加载

adzmabout 1 year ago

So this is basically a Markov chain right?

评论 #39559531 未加载

underscoringabout 1 year ago

Nice work!> If it can do this in 13kb, it makes me wonder what it could do with more bytes.Maybe I misunderstand, but is this not just the first baby steps of an LLM written in JS? "what it could do with more bytes" is surely "GPT2 in javascript"?

评论 #39561948 未加载

zerojamesabout 1 year ago

Another day, another story on HN that will have me down a rabbit hole (yesterday's was a three-hour tangent into ternary bit compression after the Microsoft paper) :DYour project is delightful -- thank you for sharing. I have explored this realm a bit before [0] [1], but in Python. The tool I made was for personal use, but streaming every keystroke through a network connection added a lot of unnecessary latency.I used word surprisals (entropy) to calculate the most likely candidates, and gave a boost to words from my own writing (thus, the predictive engine was "fine-tuned" on my writing). The result is a dictionary of words with their probabilities of use. This can be applied to bigrams, too. Your project has me thinking: how could that be pruned, massively, to create the smallest possible structure. Your engine feels like the answer.My use case is technical writing: you know what you want to say, including long words you have to repeat over again, but you want a quicker way of typing.[0]: <a href="https://jamesg.blog/2023/12/15/auto-write/" rel="nofollow">https://jamesg.blog/2023/12/15/auto-write/</a>[1]: <a href="https://github.com/capjamesg/autowrite">https://github.com/capjamesg/autowrite</a>

amneabout 1 year ago

"What if we go to a place where there is no way"Barely got 1 or 2 suggestions while typing that. Am I holding it wrong?

评论 #39563011 未加载

SamBamabout 1 year ago

What's the data corpus? I'm very surprised that the words "Be," "Of," and "And" are among the 25 most common first words of sentences.

评论 #39562511 未加载

tagueisabout 1 year ago

What tool are you using to view dictionary.js in your screenshot?

schnuriabout 1 year ago

Cool thing!But: The demo is not usable for me on my mobile device, the keyboard is missing a tab key.

luke-stanleyabout 1 year ago

It's a small language model.

dariosalvi78about 1 year ago

An SLM?

valvalabout 1 year ago

What's the point? I tested it out for around 10 seconds and noticed that it's completely useless.

评论 #39560100 未加载

评论 #39575522 未加载

k__about 1 year ago

What does that mean no LLM?Markov chains have been a thing for a long time and predictive text was used before LLMs.

评论 #39579602 未加载

评论 #39561854 未加载

15 comments

janalsncmabout 1 year ago

评论 #39584629 未加载

评论 #39563219 未加载

throwaway4adayabout 1 year ago

sodimelabout 1 year ago

评论 #39560985 未加载

pseudo_metaabout 1 year ago

Seeing that you use Obsidian: Have you thought about turning that into an Obsidian plugin that offers the predictions as editor suggestions?

评论 #39562365 未加载

评论 #39559526 未加载

adzmabout 1 year ago

So this is basically a Markov chain right?

评论 #39559531 未加载

underscoringabout 1 year ago

评论 #39561948 未加载

zerojamesabout 1 year ago

amneabout 1 year ago

"What if we go to a place where there is no way"Barely got 1 or 2 suggestions while typing that. Am I holding it wrong?

评论 #39563011 未加载

SamBamabout 1 year ago

What's the data corpus? I'm very surprised that the words "Be," "Of," and "And" are among the 25 most common first words of sentences.

评论 #39562511 未加载

tagueisabout 1 year ago

What tool are you using to view dictionary.js in your screenshot?

schnuriabout 1 year ago

Cool thing!But: The demo is not usable for me on my mobile device, the keyboard is missing a tab key.

luke-stanleyabout 1 year ago

It's a small language model.

dariosalvi78about 1 year ago

An SLM?

valvalabout 1 year ago

What's the point? I tested it out for around 10 seconds and noticed that it's completely useless.

评论 #39560100 未加载

评论 #39575522 未加载

k__about 1 year ago

What does that mean no LLM?Markov chains have been a thing for a long time and predictive text was used before LLMs.

评论 #39579602 未加载

评论 #39561854 未加载

Show HN: Predictive text using only 13kb of JavaScript. no LLM

15 comments

Show HN: Predictive text using only 13kb of JavaScript. no LLM

15 comments