Visual ChatGPT

698 点作者 debdut大约 2 年前

32 条评论

p-e-w大约 2 年前

The "memory usage" section of the README highlights the surprising fact that image generation models need much less memory than text-based language models. ChatGPT itself is by far the most resource-hungry part of the system.Why is that so? It seems counterintuitive. A single picture snapped with a phone takes more space to store than the text of all the books in a typical home library, yet Stable Diffusion runs with 5 GB of RAM while LLAMA needs 130 GB.Can someone illuminate what's going on here?

评论 #35092208 未加载

评论 #35090955 未加载

评论 #35090889 未加载

评论 #35092346 未加载

评论 #35090874 未加载

评论 #35090911 未加载

评论 #35090770 未加载

评论 #35090743 未加载

评论 #35090778 未加载

评论 #35092554 未加载

评论 #35092178 未加载

评论 #35096213 未加载

评论 #35091799 未加载

评论 #35090957 未加载

评论 #35091557 未加载

评论 #35091010 未加载

评论 #35091831 未加载

评论 #35131127 未加载

评论 #35091609 未加载

评论 #35091545 未加载

评论 #35092953 未加载

评论 #35090885 未加载

评论 #35092818 未加载

评论 #35128638 未加载

评论 #35093343 未加载

评论 #35091846 未加载

评论 #35093344 未加载

harveywi大约 2 年前

Meta will probably soon release a competing technology. It will be called "DALL-E LLaMA".

评论 #35097354 未加载

评论 #35094500 未加载

评论 #35102110 未加载

评论 #35094456 未加载

iandanforth大约 2 年前

This feels like it owes more to LangChain than a link at the bottom of the page.Compare their prompt:<a href="https://github.com/microsoft/visual-chatgpt/blob/main/visual_chatgpt.py#L51">https://github.com/microsoft/visual-chatgpt/blob/main/visual...</a>With that of the LangChain ReAct conversational agent:<a href="https://github.com/hwchase17/langchain/blob/master/langchain/agents/conversational/prompt.py">https://github.com/hwchase17/langchain/blob/master/langchain...</a>Also it seems appropriate to cite the original ReAct paper (from Google mainly)<a href="https://arxiv.org/abs/2210.03629" rel="nofollow">https://arxiv.org/abs/2210.03629</a>

评论 #35094173 未加载

评论 #35096543 未加载

spaceman_2020大约 2 年前

Man, Microsoft is kicking ass at AI. Maybe the others have great AI models too but haven’t seen any large company release product after product with AI.

评论 #35090604 未加载

评论 #35092617 未加载

评论 #35091108 未加载

评论 #35090624 未加载

评论 #35091350 未加载

spagoop大约 2 年前

Very cool. It's almost as if that chat session is a terminal, but instead of running commands you run prose. Very much a new HCI paradigm.

评论 #35091703 未加载

评论 #35092626 未加载

评论 #35092386 未加载

pedrovhb大约 2 年前

That's neat, but it's not doing anything in the latent space of ChatGPT, is it? As I understand, it basically teaches the assistant to use SD for generating images/descriptions, but comes with all the limitations of the image model being used (as opposed to a leap in results quality such as GPT 3.5 itself was). Teaching it to use tools is of course an interesting concept itself, though.

swyx大约 2 年前

i have been trying for an hour and am completely unable to run this project. currently facing a "Building wheel for numpy (pyproject.toml) did not run successfully." error.the state of python dependency management and project distribution is just abjectly horrible.---update: perhaps spoke too soon. just made it work! <a href="https://github.com/microsoft/visual-chatgpt/issues/37">https://github.com/microsoft/visual-chatgpt/issues/37</a>

评论 #35092924 未加载

评论 #35092533 未加载

sharkjacobs大约 2 年前

We're at the point where these generative AIs are good enough that they're doing things which are really surprising and unexpected and kind of exciting, but they're bad enough that almost everything they create falls somewhere between mediocre and dogshit.I really hope, if these this stuff is going to be ubiquitous, that there are big strides made in improving the quality of the output, very soon. The novelty of seeing fake screencaps of Disney's Beauty and the Beast directed by David Cronenberg is wearing off fast, and aside from some very niche use cases (write some boiler plate code for this common design pattern in this very popular language) I haven't found much it's actually useful for

评论 #35090741 未加载

评论 #35090705 未加载

评论 #35090822 未加载

评论 #35090646 未加载

评论 #35090751 未加载

评论 #35091279 未加载

评论 #35090954 未加载

osigurdson大约 2 年前

I think GPT is super useful but can't seem to eke any value out of DAL-E. Yes, it can draw a bear in a business suit on the beach well, which is impressive but I can't think of how to utilize this.As an example, I've tried to get it to draw architecture diagrams, it draws a few boxes but then places the strangest text on those boxes.

评论 #35094701 未加载

评论 #35097065 未加载

评论 #35096405 未加载

评论 #35094570 未加载

doctoboggan大约 2 年前

Wow, this is very timely! I just finished up a script that uses ChatGPT (via openAI APIs) to read my customer support messages on Etsy and generate a response. Since I often send and receive images via Etsy support (my customers can customize the product with images) I have been searching for a way to let ChatGPT "know" what the image is. Current the script just inserts the text "<uploaded image>", but I was just hacking together something using stable-diffusion-webui's API (interrogate using CLIP), but was struggling with a few things. I took a break to browse HN and this pops up!I will definitely be taking a look to see how this works and will try to get it integrated with my script.

评论 #35106879 未加载

tuanx5大约 2 年前

This reminds me of Christina's workstation in Westworld Season 4

iamflimflam1大约 2 年前

Linked paper is available here: <a href="https://arxiv.org/abs/2303.04671" rel="nofollow">https://arxiv.org/abs/2303.04671</a>

mmq大约 2 年前

I think the chat interface is a bit restrictive when it comes to multimodal models. A much cleaner interface would be an "AI notebook" where the user can move, compare, rerun blocks. Also sharing, versioning and collaborating with others on notebooks is more straightforward.

userbinator大约 2 年前

"ChatGPT, I meant a desk with legs."For a second, I thought this was a Visual Studio-related plugin.

aaronrobert大约 2 年前

ChatGPT now is not only a simple standalone AI model, but a powerful AI core engine, and more and more people or companies will develop more and more interesting things based on ChatGPT. Like this awesome visual ChatGPT.

est大约 2 年前

Microsoft is releasing second toy while Google had trouble launching its first.

lwneal大约 2 年前

The most incredible thing about this system is that it uses Stable Diffusion (the open source AI art generator), rather than DALL-E (the proprietary closed art generator owned by OpenAI).The fact that even Microsoft, which partially owns OpenAI, is giving up on DALL-E shows the power of building an open-source community around models with published, downloadable weights.

评论 #35090907 未加载

评论 #35090698 未加载

评论 #35090794 未加载

评论 #35093721 未加载

评论 #35091903 未加载

gavi大约 2 年前

If you are trying to run this on a single GPU, please be aware the models take up a lot of memory. You can reduce the number of tools by modifying the self.tools portion of the python script

tomohelix大约 2 年前

I guess one of the advantage of being early is that Microsoft get to pick all the low hanging fruit first.All of these products are very useful and interesting by itself but it is still too early to know if MS can continue to refine and maintain a competitive edge. Dall-E basically died in a few months, unable to compete. Hopefully these other stuff will have better fate.

shp0ngle大约 2 年前

I think they ate using StableDiffusion and not Dall-E? Which makes it kind of funny

zhangyiwu大约 2 年前

I have tried to use my Macbook pro(M2 pro) to run it out, but failed to download the massive file。

hackerlight大约 2 年前

There are more examples in the paper:<a href="https://arxiv.org/pdf/2303.04671.pdf" rel="nofollow">https://arxiv.org/pdf/2303.04671.pdf</a>

yazzku大约 2 年前

The shit has an MIT license... then requires an API key. Open source all the way, guys! Microsoft loves Open Source!

razodactyl大约 2 年前

The comprehension thrown around in this thread is beautiful. Love the passion.

totetsu大约 2 年前

Are there any recommendable resources for learning about designing these kind of system architectures?

qntmfred大约 2 年前

hmmm can I use this to see how far away we are now<a href="https://karpathy.github.io/2012/10/22/state-of-computer-vision/" rel="nofollow">https://karpathy.github.io/2012/10/22/state-of-computer-visi...</a>

Havoc大约 2 年前

Happy that this is <8gb vram. Neatly fits into medium/highish consumer GPUs

golol大约 2 年前

future AI systems based on LLMs and other foundation models might think less like individuals and more like companies. Ironically, LLMs might finally make symbolic AI possible! The way I see it, symbolic AI was always missing a small sprinkle of "general intelligence" too amooth things out, to grease the gears and connect interfaces. I feel like LLMs have that little bit of magical "generality" so we can start building "symbolic" AI systems which produce work by managing a number of black box models. It is like a company: protocols and management structures are a sort of symbolic AI that connects black box humans to eachother.

amccloud大约 2 年前

Ive created a little api to grab images from pages to embed in chats. Was surprisingly easy to control with natural language.<a href="https://aimgsrc.com" rel="nofollow">https://aimgsrc.com</a>

pmarreck大约 2 年前

the pace of all this is astonishing, this is amazing

trompetenaccoun大约 2 年前

Endless new possibilities for online scammers. Bright times ahead.

kilgnad大约 2 年前

Now is a really good time to make a start up called skynet.