PdfGptIndexer: Indexing and searching PDF text data using GPT-2 and FAISS

311 pointsby raghavanklalmost 2 years ago

15 comments

chaxoralmost 2 years ago

The most frustrating thing about the many, many clones of this exact type of idea is that pretty much all of them require OpenAI.Stop doing that.You will have way more users if you make OpenAI (or anything that requires cloud) the 'technically possible but pretty difficult art of hoops to make it happen' option, instead of the other way around.The best way to make these apps IMO is to make them work entirely locally, with an easy string that's swappable in a .toml file to any huggingface model. Then if you really want OpenAI crap, you can make it happen with some other docker secret or `pass` chain or something with a key, while changing up the config.The default should be local first, do as much as possible, and then if the user /really/ wants to, make the collated prompt send a very few set of tokens to openAI.

评论 #36649342 未加载

评论 #36649339 未加载

评论 #36649443 未加载

评论 #36649861 未加载

评论 #36649302 未加载

评论 #36649493 未加载

评论 #36649264 未加载

评论 #36649591 未加载

评论 #36649318 未加载

评论 #36649106 未加载

评论 #36651148 未加载

评论 #36653016 未加载

评论 #36667347 未加载

评论 #36649169 未加载

评论 #36649244 未加载

评论 #36649548 未加载

hialmost 2 years ago

Keep your data private and don't leak it to third parties. Use something like privateGPT (32k stars). Not your keys, not your data."Interact privately with your documents using the power of GPT, 100% privately, no data leaks"[0][0] <a href="https://github.com/imartinez/privateGPT">https://github.com/imartinez/privateGPT</a>

评论 #36651382 未加载

评论 #36649306 未加载

评论 #36649229 未加载

评论 #36651499 未加载

评论 #36649298 未加载

评论 #36649469 未加载

eminent101almost 2 years ago

Is it going to send my personal data to OpenAI? Isn't that a serious problem? Does not sound like a wise thing to do, not at least without redacting all sensitive personal data from the data. Am I missing something?

评论 #36650296 未加载

评论 #36650124 未加载

Imnimoalmost 2 years ago

This readme is very confusing. It says we're going to use the GPT-2 tokenizer, and use GPT-2 as an embedding model. But looking at the code, it seems to use the default LangChain OpenAIEmbeddings and OpenAI LLM. Aren't those text-embedding-ada-002 and text-davinci-003, respectively?I don't understand how GPT-2 enters into this at all.

评论 #36667597 未加载

AJRFalmost 2 years ago

Is there a company that makes a hosted version of something like this? I quite want a little AI that I can feed all my data to to ask questions to.

评论 #36649146 未加载

评论 #36650706 未加载

评论 #36649797 未加载

评论 #36651408 未加载

评论 #36649180 未加载

gigel82almost 2 years ago

I don't get it, GPT-2 is (one of the few) open models from OpenAI, you can just run it locally, why would you use their API for this? <a href="https://github.com/openai/gpt-2">https://github.com/openai/gpt-2</a>

评论 #36651613 未加载

评论 #36667782 未加载

评论 #36650562 未加载

cloudkingalmost 2 years ago

Am I the only one who doesn't need to search across my data? What are the use cases here

评论 #36650050 未加载

评论 #36650237 未加载

JimmyRuskaalmost 2 years ago

Anyone know how milvus, quickwit, pinecone compares?I've been thinking about seeing if there's consulting opportunities for local businesses for LLMs, finetuning/vector search, chat bots. Also making tools to make it easier to drag and drop files and get personalized inference. Recently I saw this one pop into my linkedin feed, <a href="https://gpt-trainer.com/" rel="nofollow noreferrer">https://gpt-trainer.com/</a> . There's been a few others for documents I've found<a href="https://www.explainpaper.com/" rel="nofollow noreferrer">https://www.explainpaper.com/</a><a href="https://www.konjer.xyz/" rel="nofollow noreferrer">https://www.konjer.xyz/</a>Nope nope, wouldn't want to compete with that on pricing. Local open source LLMs on a 3090 would also be a cool service, but wouldn't have any scalability.Are there any other finetuning or vector search context startups you've seen?

评论 #36649777 未加载

评论 #36650709 未加载

评论 #36650109 未加载

csjhalmost 2 years ago

Why have the OpenAI dependency when there's local embeddings models that would be both faster and more accurate?

评论 #36649501 未加载

N4au5nalmost 2 years ago

I’m working for a company that works as a security layer between any sensitive enterprise data and the LLMs. Regardless of the model (HF, ChatGPT, Bard), and regardless of the medium - conversational data, pdf, knowledge bases like Notion etc. It hides the sensitive data, preventing risky use and fact checking at the same time. Happy to make an intro if that’s what you’re looking for! tothepoint.tech

zikohhalmost 2 years ago

Also what does this do that llamaindex doesn't?

syntaxingalmost 2 years ago

gpt4all has this truly locally. I recommend those with a decent GPU to give it a go.

评论 #36649294 未加载

einpoklumalmost 2 years ago

Don't build a personal ChatGPT, and don't let OpenAI, Microsoft and their business partners (and probably the US government) have a bunch of your personal and private information.

评论 #36650312 未加载

评论 #36649554 未加载

emmenderalmost 2 years ago

Please provide this reference in your readme / blog as it is the original source for your work... and provides the background for the tradeoff between the 2 approaches: 1) fine-tuning vs 2) Search-ask<a href="https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb">https://github.com/openai/openai-cookbook/blob/main/examples...</a>

评论 #36667687 未加载

quickthrower2almost 2 years ago

The author has a demo of this here: <a href="https://www.swamisivananda.ai/" rel="nofollow noreferrer">https://www.swamisivananda.ai/</a>

评论 #36667765 未加载