FunSearch: Making new discoveries in mathematical sciences using LLMs

388 点作者 reqo超过 1 年前

21 条评论

jcgrillo超过 1 年前

To what extent is an LLM necessary here? As far as I can tell (and perhaps I haven't looked closely enough yet) the purpose of the LLM here is to generate things that look plausibly like python functions conforming to a given type signature.But it should be possible to generate random, correct python functions conforming to a given type signature without an LLM. This would be an exercise like [1], just with a substantially more complex language. But might a restricted language be more ergonomic? Something like PushGP [2]?So I guess my questions would be:(1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?(2) Are other genetic programming techniques less competitive on the same problems? Do they produce less fit solutions?(3) If a more "traditional" genetic programming approach can achieve similar fitness, is there a difference in compute cost, including the cost to train the LLM?[1] <a href="http://www.davidmontana.net/papers/stgp.pdf" rel="nofollow noreferrer">http://www.davidmontana.net/papers/stgp.pdf</a> [2] <a href="https://faculty.hampshire.edu/lspector/push.html" rel="nofollow noreferrer">https://faculty.hampshire.edu/lspector/push.html</a>

评论 #38651019 未加载

评论 #38649955 未加载

评论 #38651218 未加载

评论 #38654089 未加载

评论 #38652248 未加载

评论 #38651934 未加载

nybsjytm超过 1 年前

Some important context: the discovery is that a certain number in combinatorics is now known to be between 2.2202 and 2.756, not just between 2.218 and 2.756 as discovered last year. The improvement is by finding some particular sequences of numbers which have some special properties, not by a logic-heavy mathematical proof. (That doesn't mean it's unrigorous.)But it's an interesting and possibly useful method of coming up with examples, pretty much a genetic algorithm with LLMs.

ignoramous超过 1 年前

Related commentary on "self-play" by Subbarao: <a href="https://twitter.com/rao2z/status/1728121216479949048" rel="nofollow noreferrer">https://twitter.com/rao2z/status/1728121216479949048</a>From the article:<pre><code> FunSearch uses an evolutionary method powered by LLMs, which promotes and develops the highest scoring ideas. These ideas are expressed as computer programs, so that they can be run and evaluated automatically. The user writes a description of the problem in the form of code. This description comprises a procedure to evaluate programs, and a seed program used to initialize a pool of programs. At each iteration, FunSearch selects some programs from the current pool. The LLM creatively builds upon these, and generates new programs, which are automatically evaluated. The best ones are added back to the pool of existing programs, creating a self-improving loop. </code></pre> For websearch, I (evaluator) use pplx.ai and phind.com in a similar manner. Ask it a question (seed) and see what references it brings up (web links). Refine my question or ask follow-ups (iterate) so it pulls up different or more in-depth references (improve). Works better in unearthing gems than sifting through reddit or Google.Given Tech Twitter has amazing content too, looking forward to using Grok for research, now that it is open to all.

alphabetting超过 1 年前

<a href="https://twitter.com/gfodor/status/1735348301812383906" rel="nofollow noreferrer">https://twitter.com/gfodor/status/1735348301812383906</a>>If DeepMind just definitively proved neural networks can generate genuinely new knowledge then it’s the most important discovery since fire.If this were actually the case why wouldn't everyone be talking about this? I am impressed it was done on Palm 2 given that's less advanced than GPT-4 and Gemini. Will be wild to see what the next few generations of models can do utilizing methods like this.

评论 #38645404 未加载

评论 #38645156 未加载

评论 #38645406 未加载

评论 #38647668 未加载

评论 #38651440 未加载

评论 #38645580 未加载

评论 #38646693 未加载

aconz2超过 1 年前

summary: given a program template/skeleton and a fitness function (# correct results, shorter programs, etc), generate a population of programs with LLM, use a prompt that generates a new program from k other versions (they found k=2 is good, kinda biological eh), run the programs on inputs and score them with the fitness function, uses the island model for evolutionI think the prompt looks in principle something like<pre><code> def foo_v1(a, b): ... def foo_v2(a, b): ... # generate me a new function using foo_v1 and foo_v2. You can only change things inside two double curly braces like THIS in {{ THIS }} # idk not a "prompt engineer" def foo(a, b): return a + {{}} </code></pre> They achieved the new results with only ~1e6 LLM calls (I think I'm reading that right) which seems impressively low. They talk about evaluating/scoring taking minutes. Interesting to think about the depth vs breadth tradeoff here which is tied to the latency vs throughput of scoring an individual vs population. What if you memoize across all programs. Can you keep the loss function multidimensional (1d per input or input bucket) so that you might find a population of programs that do well in different areas first and then it can work on combining them.Did we have any prior on how rare the cap set thing is? Had there been previous computational efforts at this to no avail? Cool nonetheless

brotchie超过 1 年前

Paraphrasing a post on Twitter / X:Things will only get better from here.i.e.AI capabilities are strictly monotonically increasing (as they have been for decades), and in this case, the capabilities are recursively self-improving: I'm already seeing personal ~20-30% productivity gains in coding with AI auto-complete, AI-based refactoring, and AI auto-generated code review diffs from comments.I feel like we've hit a Intel-in-the-90s era of AI. To make your code 2x as fast, you just had to wait for the next rev. of Intel CPUs. Now it's AI models, once you have parts of a business flow hooked up with a LLM system (e.g. coding, customer support, bug triaging), "improving" the system amounts to swapping out the model name.We can expect a "everything kinda getting magically better" over the next few years, with minimal effort beyond the initial integration.

评论 #38649726 未加载

karencarits超过 1 年前

One of the problems they were approaching was the cap set problem<a href="https://en.m.wikipedia.org/wiki/Cap_set" rel="nofollow noreferrer">https://en.m.wikipedia.org/wiki/Cap_set</a>> The problem consists of finding the largest set of points (called a cap set) in a high-dimensional grid, where no three points lie on a line. This problem is important because it serves as a model for other problems in extremal combinatorics - the study of how large or small a collection of numbers, graphs or other objects could be. Brute-force computing approaches to this problem don’t work – the number of possibilities to consider quickly becomes greater than the number of atoms in the universe.> FunSearch generated solutions - in the form of programs - that in some settings discovered the largest cap sets ever found. This represents the largest increase in the size of cap sets in the past 20 years. Moreover, FunSearch outperformed state-of-the-art computational solvers, as this problem scales well beyond their current capabilities.

mejutoco超过 1 年前

I wonder how someone will integrate symbolic reasoning with LLMs, or if it will be possible.

评论 #38645151 未加载

评论 #38644684 未加载

评论 #38645730 未加载

JoelJacobson超过 1 年前

The recent DeepMind paper on FunSearch highlighted their use of pre-trained large language models (LLMs) for generating code improvements. Interestingly, while the main LLM used was Codey, built on the PaLM2 model family, they also referenced StarCoder, an open-source LLM, in their supplementary information.However, the GitHub repository for FunSearch doesn't include implementations for these LLMs. For instance, in `sampler.py`:``` class LLM: """Language model that predicts continuation of provided source code."""<pre><code> def __init__(self, samples_per_prompt: int) -> None: self._samples_per_prompt = samples_per_prompt def _draw_sample(self, prompt: str) -> str: """Returns a predicted continuation of `prompt`.""" raise NotImplementedError('Must provide a language model.')</code></pre> ```This code suggests the need for an external LLM implementation. Given that they successfully used StarCoder, it's surprising that no integration guide or basic implementation for it (or any similar open-source LLM) is provided. Such an inclusion would have significantly enhanced the reproducibility and accessibility of their research.

mmaunder超过 1 年前

Regardless of whether this is verifiably new knowledge, it's an interesting case study when you consider limiting access to AI based on model size or some other regulatory measure, and the unfair advantage that confers on corporations that do discover new knowledge or laws of nature, and can monetize them without sharing.

arkadiytehgraet超过 1 年前

I have been wondering recently about a good test for checking whether LLMs can only parrot stochastically or do they manage to indeed learn some higher order concepts inside their weights.What if we would take a bare LLM, train it only on the information (in Maths, physics, etc) that was available to, say, Newton and then try to prompt it to solve the problems that Newton solved? Will it be able to derive the calculus basics and stuff having never seen it before, maybe even with little bit of help from the prompts? Maybe it will be able to come up with something completely different, yet also useful? Or nothing at all...

zackmorris超过 1 年前

FunSearch is more along the lines of how I wanted AI to evolve over the last 20 years or so, after reading Genetic Programming III by John Koza:<a href="https://www.amazon.com/Genetic-Programming-III-Darwinian-Invention/dp/1558605436" rel="nofollow noreferrer">https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv...</a>I wanted to use genetic algorithms (GAs) to come up with random programs run against unit tests that specify expected behavior. It sounds like they are doing something similar, finding potential solutions with neural nets (NNs)/LLMs and grading them against an "evaluator" (wish they added more details about how it works).What the article didn't mention is that above a certain level of complexity, this method begins to pull away from human supervisors to create and verify programs faster than we can review them. When they were playing with Lisp GAs back in the 1990s on Beowulf clusters, they found that the technique works extremely well, but it's difficult to tune GA parameters to evolve the best solutions reliably in the fastest time. So volume III was about re-running those experiments multiple times on clusters about 1000 times faster in the 2000s, to find correlations between parameters and outcomes. Something similar was also needed to understand how tuning NN parameters affects outcomes, but I haven't seen a good paper on whether that relationship is understood any better today.Also GPU/SIMD hardware isn't good for GAs, since video cards are designed to run one wide algorithm instead of thousands or millions of narrow ones with subtle differences like on a cluster of CPUs. So I feel that progress on that front has been hindered for about 25 years, since I first started looking at programming FPGAs to run thousands of MIPS cores (probably ARM or RISC-V today). In other words, the perpetual AI winter we've been in for 50 years is more about poor hardware decisions and socioeconomic factors than technical challenges with the algorithms.So I'm certain now that some combination of these old approaches will deliver AGI within 10 years. I'm just frustrated with myself that I never got to participate, since I spent all of those years writing CRUD apps or otherwise hustling in the struggle to make rent, with nothing to show for it except a roof over my head. And I'm disappointed in the wealthy for hoarding their money and not seeing the potential of the countless millions of other people as smart as they are who are trapped in wage slavery. IMHO this is the great problem of our time (explained by the pumping gas scene in Fight Club), although since AGI is the last problem in computer science, we might even see wealth inequality defeated sometime in the 2030s. Either that or we become Borg!

评论 #38647932 未加载

lgessler超过 1 年前

Tl;dr in my words after a skim: this is a method for using LLMs to find new, SOTA (or at least very good?) heuristic algorithms for NP hard combinatorial optimization problems. They achieve this not by making the LLM itself smarter but by finding a way to keep only its best ideas. (Haven't had a chance to read carefully yet so caveat lector.)Pretty impressive, but don't panic--we're not at the singularity yet.

rhosseinzadeh超过 1 年前

I wonder what would happen if this was applied to alpha code. Instead of generating 1 million codes for each problem during inference, do this kind of iteration during training and then generate less codes for each problem (or generate 1 million but hopefully better ones?).

anonzzzies超过 1 年前

For me, this is the only right way for LLMs to go. Use them to trim search space for robustly designed logical programs/algo's we then use to do things. Part of the chain, not the end 'solution' of the chain.

knicholes超过 1 年前

With the universal approximation theorem, we can use artificial neutral networks with ReLUs to accurately approximate a function. But we just get weights out of it at the end. This approach feels similar, but provides code in the end.

Ericson2314超过 1 年前

Spitting out some Python is weak sauce. The next version of this I think will spit out Lean, or at the first great version. Then we'll be on to something.

chengdujin超过 1 年前

Enough talk, has anybody, replicated what Deepmind has done?

mnemotronic超过 1 年前

Garbage in. Garbage out. It's useless unless it can provide a functional mathematical proof of correctness or, at the least, provide a process by with it's fictional composition can be proven or disproven.

stevenhuang超过 1 年前

Wonder where the goalposts will be moved now by the "stochastic parrot" parroters.

评论 #38647199 未加载

评论 #38648514 未加载

wait_a_minute超过 1 年前

Is this the start of Singularity?