The "memory usage" section of the README highlights the surprising fact that image generation models need <i>much</i> less memory than text-based language models. ChatGPT itself is by far the most resource-hungry part of the system.<p>Why is that so? It seems counterintuitive. A single picture snapped with a phone takes more space to store than the text of all the books in a typical home library, yet Stable Diffusion runs with 5 GB of RAM while LLAMA needs 130 GB.<p>Can someone illuminate what's going on here?
This feels like it owes more to LangChain than a link at the bottom of the page.<p>Compare their prompt:<p><a href="https://github.com/microsoft/visual-chatgpt/blob/main/visual_chatgpt.py#L51">https://github.com/microsoft/visual-chatgpt/blob/main/visual...</a><p>With that of the LangChain ReAct conversational agent:<p><a href="https://github.com/hwchase17/langchain/blob/master/langchain/agents/conversational/prompt.py">https://github.com/hwchase17/langchain/blob/master/langchain...</a><p>Also it seems appropriate to cite the original ReAct paper (from Google mainly)<p><a href="https://arxiv.org/abs/2210.03629" rel="nofollow">https://arxiv.org/abs/2210.03629</a>
Man, Microsoft is kicking ass at AI. Maybe the others have great AI models too but haven’t seen any large company release product after product with AI.
That's neat, but it's not doing anything in the latent space of ChatGPT, is it? As I understand, it basically teaches the assistant to use SD for generating images/descriptions, but comes with all the limitations of the image model being used (as opposed to a leap in results quality such as GPT 3.5 itself was). Teaching it to use tools is of course an interesting concept itself, though.
i have been trying for an hour and am completely unable to run this project. currently facing a "Building wheel for numpy (pyproject.toml) did not run successfully." error.<p>the state of python dependency management and project distribution is just abjectly horrible.<p>---<p>update: perhaps spoke too soon. just made it work! <a href="https://github.com/microsoft/visual-chatgpt/issues/37">https://github.com/microsoft/visual-chatgpt/issues/37</a>
We're at the point where these generative AIs are good enough that they're doing things which are really surprising and unexpected and kind of exciting, but they're bad enough that almost everything they create falls somewhere between mediocre and dogshit.<p>I really hope, if these this stuff is going to be ubiquitous, that there are big strides made in improving the quality of the output, very soon. The novelty of seeing fake screencaps of Disney's Beauty and the Beast directed by David Cronenberg is wearing off fast, and aside from some very niche use cases (write some boiler plate code for this common design pattern in this very popular language) I haven't found much it's actually useful for
I think GPT is super useful but can't seem to eke any value out of DAL-E. Yes, it can draw a bear in a business suit on the beach well, which is impressive but I can't think of how to utilize this.<p>As an example, I've tried to get it to draw architecture diagrams, it draws a few boxes but then places the strangest text on those boxes.
Wow, this is very timely! I just finished up a script that uses ChatGPT (via openAI APIs) to read my customer support messages on Etsy and generate a response. Since I often send and receive images via Etsy support (my customers can customize the product with images) I have been searching for a way to let ChatGPT "know" what the image is. Current the script just inserts the text "<uploaded image>", but I was just hacking together something using stable-diffusion-webui's API (interrogate using CLIP), but was struggling with a few things. I took a break to browse HN and this pops up!<p>I will definitely be taking a look to see how this works and will try to get it integrated with my script.
Linked paper is available here: <a href="https://arxiv.org/abs/2303.04671" rel="nofollow">https://arxiv.org/abs/2303.04671</a>
I think the chat interface is a bit restrictive when it comes to multimodal models.
A much cleaner interface would be an "AI notebook" where the user can move, compare, rerun blocks.
Also sharing, versioning and collaborating with others on notebooks is more straightforward.
ChatGPT now is not only a simple standalone AI model, but a powerful AI core engine, and more and more people or companies will develop more and more interesting things based on ChatGPT. Like this awesome visual ChatGPT.
The most incredible thing about this system is that it uses Stable Diffusion (the open source AI art generator), rather than DALL-E (the proprietary closed art generator owned by OpenAI).<p>The fact that even Microsoft, which partially owns OpenAI, is giving up on DALL-E shows the power of building an open-source community around models with published, downloadable weights.
If you are trying to run this on a single GPU, please be aware the models take up a lot of memory. You can reduce the number of tools by modifying the self.tools portion of the python script
I guess one of the advantage of being early is that Microsoft get to pick all the low hanging fruit first.<p>All of these products are very useful and interesting by itself but it is still too early to know if MS can continue to refine and maintain a competitive edge. Dall-E basically died in a few months, unable to compete. Hopefully these other stuff will have better fate.
There are more examples in the paper:<p><a href="https://arxiv.org/pdf/2303.04671.pdf" rel="nofollow">https://arxiv.org/pdf/2303.04671.pdf</a>
hmmm can I use this to see how far away we are now<p><a href="https://karpathy.github.io/2012/10/22/state-of-computer-vision/" rel="nofollow">https://karpathy.github.io/2012/10/22/state-of-computer-visi...</a>
future AI systems based on LLMs and other foundation models might think less like individuals and more like companies. Ironically, LLMs might finally make symbolic AI possible! The way I see it, symbolic AI was always missing a small sprinkle of "general intelligence" too amooth things out, to grease the gears and connect interfaces. I feel like LLMs have that little bit of magical "generality" so we can start building "symbolic" AI systems which produce work by managing a number of black box models. It is like a company: protocols and management structures are a sort of symbolic AI that connects black box humans to eachother.
Ive created a little api to grab images from pages to embed in chats. Was surprisingly easy to control with natural language.<p><a href="https://aimgsrc.com" rel="nofollow">https://aimgsrc.com</a>