My finetuned models beat OpenAI's GPT-4

414 pointsby majc211 months ago

24 comments

kcorbitt11 months ago

(Disclaimer: I'm the founder of OpenPipe, one of the fine-tuning services OP tried and ultimately the one that produced the highest performing model, it appears.)Data extraction is a use case that fine-tuned models are fantastic at, so I'm not surprised that OP got good results. That said, I've also found it's pretty easy to beat GPT-4 across many task types if you have a way of getting strong training data. We published some research[1] a week ago where we found that across 4 example tasks spanning creative summarization, question answering, data extraction and classification a fine-tuned Llama 3 8B was able to outperform GPT-4 on 3 of them. The key was to create a repeatable way of generating high-quality training data, which is also addressed in the post.[1]: <a href="https://openpipe.ai/blog/mixture-of-agents">https://openpipe.ai/blog/mixture-of-agents</a>

评论 #40847580 未加载

评论 #40846880 未加载

评论 #40848053 未加载

评论 #40858784 未加载

评论 #40849673 未加载

评论 #40845896 未加载

gillesjacobs11 months ago

This is entirely unsurprising and in-line with the finding that even small specialized models do better in information extraction and text classification. So no wonder finetuned large LMs do good too.Personally, my PhD did fine grained ACE-like event and sentiment extraction and "small" specialized finetuned transformers outperformed prompting LLMs like BERT and Roberta-large. Would love to see an inclusion of small model scores with some sota pipelines.This is great work anyway even if it replicates known results!

评论 #40846797 未加载

评论 #40844518 未加载

dimask11 months ago

Thanks for putting all this work and sharing it in such a detail! Data extraction/structuring data is the only serious application of LLMs I have actually engaged in for real work and found useful. I had to extract data from experience sampling reports which I could not share online, thus chatgpt etc was out of question. There were sentences describing onsets and offsets of events and descriptions of what went on. I ran models through llama.cpp to turn these into csv format with 4 columns (onset, offset, description, plus one for whether a specific condition was met in that event or not which had to interpreted through the description). Giving some examples of how I want it all structured in the prompt, was enough for many different models to do it right. Mixtral 8x7b was my favourite because it ran the fastest in that quality level on my laptop.I am pretty sure that a finetuned smaller model would be better and faster for this task. It would be great to start finetuning and sharing such smaller models: they do not really have to be really better than commercial LLMs that run online, as long as they are not at least worse. They are already much faster and cheaper, which is a big advantage for this purpose. There is already need for these tasks to be offline when one cannot share the data with openai and the like. Higher speed and lower cost also allow for more experimentation with more specific finetuning and prompts, with less care about token lengths of prompts and cost. This is an application where smaller, locally run, finetunable models can shine.

评论 #40845383 未加载

评论 #40844975 未加载

scosman11 months ago

And that’s the point of fine tuning models.Still good to see someone walk through their fine tuning process, with a mix of hosted and local options.

评论 #40844586 未加载

评论 #40845272 未加载

botro11 months ago

Thanks for sharing this, It's well written and informative. I noticed you used 'temperature=1' in the GPT test for the example in the post. Is this best practice for a task requiring structured output? Have you tested other temperature settings? My casual understanding was that a temperature of 0 is best for these types of workloads while higher temperatures would be more effective for more 'creative' workloads.

评论 #40844622 未加载

mewpmewp211 months ago

1. It would be nice to see examples where GPT-4o was inaccurate, but best performing models were accurate.2. It would be nice to try again with 0 temperature, as I do a lot of structured data extraction. In my experience 0 temperature should always be used, and it can make a huge difference. Temperature of 1 essentially means that it will start to pick tokens with lower probability of being accurate...

评论 #40857457 未加载

denhaus11 months ago

For anyone interested, we wrote a paper on a similar topic: <a href="https://www.nature.com/articles/s41467-024-45563-x" rel="nofollow">https://www.nature.com/articles/s41467-024-45563-x</a>

courseofaction11 months ago

Really interesting. Could the potentially controversial content of the target news article have an effect on ChatGPT's ability to summarize it?

评论 #40845334 未加载

评论 #40844486 未加载

jrm411 months ago

At the risk of sounding like an old head;Seems to me then, priority one should be "free and open source all the models as hard as possible, so that EVERYONE can fine-tune."(This being a subset of the idea of, free / open source is generally preferable for both freedom and quality)

评论 #40846377 未加载

michaelortega0111 months ago

At Predibase, we recently conducted 700+ fine-tuning experiments to benchmark the performance of popular open-source LLMs across 30 tasks and compared their results to GPT-4.85% of the time they beat GPT-4.You can see the results here: <a href="https://predibase.com/fine-tuning-index" rel="nofollow">https://predibase.com/fine-tuning-index</a>.The site has a series of interactive charts and a link to our Arxiv paper.

mewpmewp211 months ago

I took a look at a random row to try to find why mistakes were happening.Why is this one labelled with start_date: 2011-02-07?> Afghan, Coalition Forces Clear Northern Kandahar ISAF Joint Command - Afghanistan 2011-02-D-081 For Immediate Release KABUL, Afghanistan (Feb. 12) – Afghan and coalition forces set out to provide security and assist the local population during a clearing operation in a remote village in Shah Wali Kot district, Kandahar province, Feb. 8. District Chief of Police Bacha Khan, and his policemen; Afghan commandos from 2nd Company, 3rd Commando Kandak, along with U.S. service members from Special Operations Task Force – South, searched the village throughout the day and detained 20 suspected insurgents. Also found were 80 pounds (36 kilograms) of homemade explosives and various improvised explosive device-making materials. Leading a squad during the operation was Afghan commando Sgt. Hafiz Rahman, who said this operation has shown him progress. “The people are respecting us,” Rahman said. “They ask us if we want tea, or ‘do we want bread?’ They are thankful for the security.” Children during the operation brought commandos blankets in the evening and offered them food throughout the day.Trying to find the source, I'm also not seeing any indication of Feb 7.<a href="https://www.dvidshub.net/news/65238/afghan-police-commandos-us-special-forces-clear-northern-kandahar" rel="nofollow">https://www.dvidshub.net/news/65238/afghan-police-commandos-...</a>---------------And why is this labelled as Mar 6, GPT-4o and I personally find Mar 7 to be logical.ISAF Joint Command Morning Operational Update, March 8, 2011 ISAF Joint Command - Afghanistan 2011-03-S-022 For Immediate Release KABUL, Afghanistan (March 8, 2011) Afghan and coalition forces targeted a Taliban district chief, killed one insurgent and detained several others during an operation in Burkah district, Baghlan province, yesterday. The Taliban district chief maintains ties to Taliban senior leadership throughout Kunduz, Baghlan, and Takhar provinces. He is involved in purchasing weapons and IEDs. Intelligence reports led the security force to the targeted compound in the city, where Afghan forces called for all occupants to exit the buildings peacefully before conducting a search. During that time, an armed individual threatened the security force and the force returned fire, killing him. Several suspected insurgents were detained after initial questioning at the scene.But despite that the "finetuned" model also gets Mar 6. How does the finetuned model get Mar 6?

toisanji11 months ago

I'm most excited about getting a faster model. A model like GPT4 can be overkill because its too slow. What are the smallest fine tuned models that could beat a gpt4 model? Is it 7b or could a 3b model like phi3 do well for tasks like classification and summarization?

soist11 months ago

Eventually people will realize any underdetermined system of equations has infinitely many solutions. Give me any open source AI model and I will beat any SOTA benchmark. Why am I so confident? Because curve fitting can be applied to any data set to get as good of a result as needed. Combine this approach with mixtures of "experts" and any predetermined set of benchmarks will fall to a curve fit to the benchmark.The hype is really getting tiresome. There is no way to get from here to any intelligent system with the current techniques. New breakthroughs will require insights into discrete spaces which are not amenable to curve fitting with gradient descent.

simonw11 months ago

I'd be interested to see how well these fine-tuned models compare to Claude 3 Haiku (or one of the more expensive Claude models) with a larger set of examples.The Claude models all have a 200,000 token limit and respond _really_ well to examples - you can feed them in as chat JSON message pairs of user input / ideal assistant output.Haiku is dirt cheap for this kind of thing and with 200,000 tokens you can probably provide a dozen or so examples.

Tiberium11 months ago

Did you release the dataset and the code for testing? It would be interesting to check how 3.5 Sonnet performs on this task.

评论 #40845847 未加载

w4nderlust11 months ago

We got very similar findings: we published a paper that show that smaller LLMs (3-7b) when finetuned with LoRA can match or outperform GPT-4 on a variety of tasks (29 out of 31) including classification, summarization, info extraction, "reasoning". <a href="https://arxiv.org/abs/2405.00732" rel="nofollow">https://arxiv.org/abs/2405.00732</a> (Predibase cofounder and coauthor of the paper)

blueboo11 months ago

Why would you set temperature=1 for this task?

visarga11 months ago

What is a good fine-tuning script for Mistral and LLaMA3 on an A100?

评论 #40844496 未加载

评论 #40845085 未加载

animanoir11 months ago

Anything beats GPT-4 nowdays to be honest.

uptownfunk11 months ago

Remember folks there is no free lunch :)

pcwelder11 months ago

Here are some test data samples and corresponding closest train data rows to give you an idea of the task complexity.---Test 1: KABUL, Afghanistan (Jan. 25, 2013) During a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.Train: KABUL, Afghanistan (Jan. 8, 2013) – During a security operation in Washer district, Helmand province, yesterday, an Afghan and coalition force killed the Taliban leader, Mohammad Sayed, and one other insurgent. Mohammad Sayed distributed weapons and ammunition to Taliban fighters. Prior to his death, Sayed was attempting to acquire rockets for attacks targeting Afghan government officials in the province.---Test 2: For Immediate ReleaseKABUL, Afghanistan (Aug. 6, 2012) Afghan and coalition forces conducted a security operation in search of a Haqqani leader in Tsamkani district, Paktiya province, yesterday. During the operation the security force engaged a group of insurgents with a precision airstrike. After the strike, the Afghan and coalition security force conducted a follow-on assessment and confirmed several insurgents had been killed in the strike. They also confirmed the strike had not injured any civilians or damaged any civilian property.Train: For Immediate ReleaseKABUL, Afghanistan (July 22, 2012) — Afghan and coalition forces conducted a security operation in Muhammad Aghah district, Logar province, Saturday.During the operation, a group of armed insurgents were engaged with a precision airstrike. After the strike, the Afghan and coalition force conducted a follow-on assessment and confirmed multiple insurgents had been killed.The security force also confirmed the airstrike had not injured any civilians or damaged civilian property.---Test 3: ISAF Joint Command Morning Operational Update March 24, 2011 ISAF Joint Command - Afghanistan 2011-03-S-081 For Immediate Release KABUL, Afghanistan (March 24, 2011) A separate Afghan and coalition security force targeted a Taliban IED cell leader in Kandahar today. The leader is responsible for planning, preparing and executing explosive-device attacks on Afghan civilians, Afghan and coalition security forces. The joint security force targeted the leader’s suspected compound in Kandahar City based on tips from citizens. The security team contained the area and detained several suspected insurgents. There were no shots fired and no damage done to the targeted compound.Train: ISAF Joint Command Operational Update Dec. 22 ISAF Joint Command - Afghanistan 2010-12-S-267 2699, 2935, 3022, 3078 For Immediate Release Download PDF KABUL, Afghanistan (Dec. 22) – Several insurgents were killed by Afghan National Security and International Security Assistance Forces in separate clearing operations in southern Afghanistan over the last 24 hours. An Afghan Army and ISAF patrol spotted some insurgents emplacing an improvised explosive device in Sangin district, Helmand province today. After gaining positive identification, combined forces engaged the enemy position, killing two insurgents.

blackice_cowboy11 months ago

sva_11 months ago

Clickbait headline

XiphiasX11 months ago

1) beat at what? 2) do they beat Claude 3.5 Sonnet?

评论 #40844699 未加载

评论 #40845314 未加载

评论 #40844697 未加载

24 comments

kcorbitt11 months ago

评论 #40847580 未加载

评论 #40846880 未加载

评论 #40848053 未加载

评论 #40858784 未加载

评论 #40849673 未加载

评论 #40845896 未加载

gillesjacobs11 months ago

评论 #40846797 未加载

评论 #40844518 未加载

dimask11 months ago

评论 #40845383 未加载

评论 #40844975 未加载

scosman11 months ago

And that’s the point of fine tuning models.Still good to see someone walk through their fine tuning process, with a mix of hosted and local options.

评论 #40844586 未加载

评论 #40845272 未加载

botro11 months ago

评论 #40844622 未加载

mewpmewp211 months ago

评论 #40857457 未加载

denhaus11 months ago

For anyone interested, we wrote a paper on a similar topic: <a href="https://www.nature.com/articles/s41467-024-45563-x" rel="nofollow">https://www.nature.com/articles/s41467-024-45563-x</a>

courseofaction11 months ago

Really interesting. Could the potentially controversial content of the target news article have an effect on ChatGPT's ability to summarize it?

评论 #40845334 未加载

评论 #40844486 未加载

jrm411 months ago

评论 #40846377 未加载

michaelortega0111 months ago

mewpmewp211 months ago

toisanji11 months ago

soist11 months ago

simonw11 months ago

Tiberium11 months ago

Did you release the dataset and the code for testing? It would be interesting to check how 3.5 Sonnet performs on this task.

评论 #40845847 未加载

w4nderlust11 months ago

blueboo11 months ago

Why would you set temperature=1 for this task?

visarga11 months ago

What is a good fine-tuning script for Mistral and LLaMA3 on an A100?

评论 #40844496 未加载

评论 #40845085 未加载

animanoir11 months ago

Anything beats GPT-4 nowdays to be honest.

uptownfunk11 months ago

Remember folks there is no free lunch :)

pcwelder11 months ago

blackice_cowboy11 months ago

sva_11 months ago

Clickbait headline

XiphiasX11 months ago

1) beat at what? 2) do they beat Claude 3.5 Sonnet?

评论 #40844699 未加载

评论 #40845314 未加载

评论 #40844697 未加载