Hello HN! We're Sam and Kanyes. We're building an extension to help you remember what you read online. We're calling it Ferret [1].<p>When you open Ferret on an HTML page, it generates recall-based questions + answers to reinforce key concepts with NLP. Consider the following toy example where we open Ferret on an explanation of Bayesian statistics. [2]<p>Q: What does the frequentist interpretation view probability as?
A: the limit of the relative frequency of an event after many trials<p>Q: What is often computed in Bayesian statistics using mathematical optimization methods?
A:The maximum a posteriori<p>We do this by (1) Parsing the DOM tree of an HTML page for <p> tags on the client, and segmenting these into preprocessed chunks (2) Performing inference on question-generation with a T5-base model pretrained on SQuAD (3) Extractive question-answering with the chunk & question we've generated with RoBERTa, also pretrained on SQuAD.<p>No GPT-3 here— where's the fun in an API call when you can do it yourself. Ferret is built as a React.JS app deployed as a chrome extension, with models hosted on AWS Sagemaker.<p>Finally, why could this be helpful?
Human memory is lossy. Psychologists have shown for forever that your memory can be modeled with a forgetting curve. If you don't attempt to retain knowledge, you'll likely lose it. But most of the content we read online (technical blog posts, documentation, course notes, articles) gets ingested and quickly forgotten. We're interested in low-friction approaches to helping people better remember this content , starting with fellow engineers who depend on their ability to remember key concepts to do the best job.<p>We've open-sourced the full repo and are actively responding to PRs + issues. [3]. You can read more about the technical + product challenges we faced if that interests you as well. [4]<p>We appreciate all feedback and suggestions!<p>[1]https://chrome.google.com/webstore/detail/ferret/mjnmolplinickaigofdpejfgfoehnlbh
[2] https://en.wikipedia.org/wiki/Bayesian_statistics<p>[3] https://github.com/kanyesthaker/qgqa-flashcards<p>[4] https://samgorman.notion.site/Ferret-c7508ec65df841859d1f84e518fcf21d
Hi, Kanyes here from Ferret. Starting the discussion by sharing an unsolved technical hurdle that may be of interest. We made a decision early in development to perform all inference on CPU to avoid unfriendly production costs and inefficiencies processing single inputs instead of batches.<p>Sequential models like T5 tend to be large (300mb >), and we observed high latency per inference of approx 8s. We've masked this latency on the frontend, mainly sending concurrent requests with async code (4 at a time) and preloading content early. However, this is kind of hacky and we'd (ideally) want to reduce inference time.<p>To this end, we've demonstrated roughly 1.7x speedup by converting our model weights in pytorch to a quantized ONNX graph. However, we've found a lot of friction in trying to deploy ONNX graphs to AWS. We understand there are a variety of potential solutions (training smaller distilled models, deploying ONNX, contesting our rationale to use CPU etc), so we're looking for suggestions for the optimal method to make inference faster!
Aside from challenges regarding per inference latency, any other unique challenges you guys faced when deploying nlp models to web? It's pretty cool to see ml being applied more actively in day-to-day web browsing.