TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Try to guess if code is real or GPT2-generated

201 pointsby AlexDenisovover 4 years ago

34 comments

moyixover 4 years ago
Hi, author here! Some details on the model:<p>* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).<p>* I used the [Megatron-LM](<a href="https:&#x2F;&#x2F;github.com&#x2F;NVIDIA&#x2F;Megatron-LM" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;NVIDIA&#x2F;Megatron-LM</a>) code for training. Training took about 1 month on 4x RTX8000 GPUs.<p>* You can download the trained model here: <a href="https:&#x2F;&#x2F;moyix.net&#x2F;~moyix&#x2F;csrc_final.zip" rel="nofollow">https:&#x2F;&#x2F;moyix.net&#x2F;~moyix&#x2F;csrc_final.zip</a> and the dataset&#x2F;BPE vocab here: <a href="https:&#x2F;&#x2F;moyix.net&#x2F;~moyix&#x2F;csrc_dataset_large.json.gz" rel="nofollow">https:&#x2F;&#x2F;moyix.net&#x2F;~moyix&#x2F;csrc_dataset_large.json.gz</a> <a href="https:&#x2F;&#x2F;moyix.net&#x2F;~moyix&#x2F;csrc_vocab_large.zip" rel="nofollow">https:&#x2F;&#x2F;moyix.net&#x2F;~moyix&#x2F;csrc_vocab_large.zip</a><p>Happy to answer any questions!
评论 #26246190 未加载
评论 #26245828 未加载
评论 #26242822 未加载
评论 #26242960 未加载
评论 #26247980 未加载
Felkover 4 years ago
I got a function that assigned the same expression to three variables. Then it declared a void function with documentation stating &quot;returns true on success, false otherwise&quot;. Apparently that code was written by a human, which makes me either doubt the correctness of that website, or the quality of the code it was fed with
评论 #26243359 未加载
评论 #26243777 未加载
评论 #26243354 未加载
评论 #26243600 未加载
et1337over 4 years ago
This looks like overfitting to me. Some of the GPT samples were definitely real code, or largely real code. One looked like something from Xorg, another like it was straight from the COLLADA SDK. It’s really hard to define what “truly new code” is, if it’s just the same code copy pasted in different order. Blah blah Ship of Theseus etc.
评论 #26242905 未加载
评论 #26243319 未加载
评论 #26243925 未加载
评论 #26243433 未加载
Aardwolfover 4 years ago
There was some code about TIFF headers, and it was apparently GPT2 generated<p>TIFF is a real thing, so some human was involved in some part of that code, it has just been garbled up by GPT2... In other words, the training set is showing quite visibly in the result
ironmagmaover 4 years ago
Would be nice if the back button worked so you could see what you guessed wrong. This is a good example of where POST is used unnecessarily and URLs should be idempotent.
评论 #26245672 未加载
klik99over 4 years ago
For the ones that were just part of the header file, listing a bunch of instance variables and function names, it seems impossible. But for the actual code, it is possible but still quite difficult, though I spent too long in finding some logical inconsistency that gave it away.
damenutover 4 years ago
This was so much harder than I thought it was going to be. I would get a few right and then be absolutely sure of the next one and be wrong. After a while I felt like I was noticing more aesthetic differences between the gpt and real, rather than distinguishing between the two based on their content. Very interesting...
efferifickover 4 years ago
I wonder how likely are code invariants found in the training set preserved in GPT-2&#x2F;3. In other words, if I train GPT-2 with C source produce by Csmith (a program generator which I believe produces C programs without undefined behaviour) would programs produced by GPT-2&#x2F;3 also do not exhibit undefined behaviour?<p>I understand that GPT-2&#x2F;3 is just a very smart parrot that has no semantic knowledge of what it is outputting. Like I guess let&#x27;s take a very dumb markov chain that was &quot;trained&quot; on the following input:<p>a.c ``` int array[6]; array[5] = 1; ```<p>b.c<p>``` int array[4]; array[3] = 2; ```<p>I guess a markov chain could theoretically produce the following code:<p>out.c<p>``` int array[4]; array[5] = 1; ```<p>which is undefined behaviour. But it is produced from two programs where no undefined behaviour is present. A better question would be, how can we guarantee that certain program invariants (like lack of undefined behaviour) can be preserved in the produced code? Or if there are no guarantees, can we calculate a probability? (Sorry, not an expert on machine learning, just excited about a potentially new way to fuzz programs. Technically, one could just instrument C programs with sanitizers and produce backwards static slices from the C sanitizer instrumentation to the beginning of the program and you get a sample of a program without undefined behaviour... so there is already the potential for some training set to be expanded beyond what Csmith can provide.)<p>EDIT: I don&#x27;t know how to format here...
technologiaover 4 years ago
This was a fun exercise, definitely think this could be difficult to suss out for greener devs or even more experienced ones. It’d be hilarious to have this model power a live screensaver in lieu of actually being busy at times.
iconaraover 4 years ago
I was confused on most examples I got because they started in the middle of a block comment. That&#x27;s clearly wrong, but was it an artefact of the presentation rather than the generation?
reillyseover 4 years ago
It&#x27;s easy after a while... all the terribly written code is human made and the clean tidy code is GPT2, I for one welcome our new programming overlords.
评论 #26247509 未加载
thewarriorover 4 years ago
This is actually quite impressive. Try reading the comments in the code. The comments often make perfect sense in the local context even if it’s GPT-2 gibberish.<p>The real examples have worse comments at times.<p>The only flaw is that it shows fake code most of the time so you can game it that way.
评论 #26243583 未加载
Gravitylossabout 4 years ago
A lot of the real code seems superficially nonsensical as well!<p>Functions with lots of arguments while the body consists of &quot;return true;&quot;<p>I guess it tells what AI often tells us about ourselves: That what we do makes much less sense than we think it does. It is thus easy to fake.<p>How is it possible to churn out so much music or so many books, or so much software? Well, because most creative works are either not very original or are quite procedural or random.<p>And this kind of work could be automated indeed (or examined if it needs to be done in the first place).
ivraatiemsover 4 years ago
I found this impressively hard at first glance. It just goes to show how difficult getting into context is in an unfamiliar codebase. I think with any amount of knowledge of anything allegedly involved (or, you know, a compiler), these examples would fall apart, but it&#x27;s still an achievement.<p>I&#x27;m also pretty sure there are formatting, commenting, and in-string-text &quot;tells&quot; that indicate whether something is GPT2 reliably. Maybe I should try training an AI to figure that out...
评论 #26242503 未加载
crypticaover 4 years ago
I was always able to correctly identify GPT2 but on a few occasions, I misidentified human-written code as being written by GPT2. Usually when the code was poorly written or the comments were unclear.<p>GPT2&#x27;s code looks like correct code at a glance but when you try to understand what it&#x27;s doing, that&#x27;s when you understand that it could not have been written by a human.<p>It&#x27;s similar to the articles produced by GPT3; they have the right form but no substance.
评论 #26248735 未加载
blueblimpabout 4 years ago
It&#x27;s fascinating that the main reason this is hard is that some of the human code is bad enough that it&#x27;s hard to believe it&#x27;s not GPT-2 output. (The first time this happened for me, I had to look it up to convince myself it&#x27;s really human code.)<p>It reminds me of how GPT-3 is good at producing a certain sort of bad writing.<p>My guess as to why this happens: we humans have abilities of logical reasoning, clarity, and purposefulness that GPT doesn&#x27;t have. When we use those abilities, we produce writing and code that GPT can&#x27;t match. If we don&#x27;t, though, our output isn&#x27;t much better than GPT&#x27;s.
_coveredInBeesover 4 years ago
I got 4&#x2F;4 GPT-2 guesses right. It is impressive but the &quot;tell&quot; I&#x27;ve found so far is just poor structure in the logic of how something is arranged. For example: a bunch of `if` statements in sequence without any `else` clauses with some directly opposing prior clauses. Another example was repeating the same operation a few times in individual lines of code which most human programmers would write in a simpler manner.<p>It&#x27;s harder to do with some of the smaller excerpts though, and I&#x27;m sure there are probably examples of terrible human programmers who write worse code than GPT-2.
评论 #26243733 未加载
TehCorwizover 4 years ago
The two factors that seemed like dead giveaways were comments that didn&#x27;t relate to the code, and sequences of repetition with minor or no variations.
评论 #26243148 未加载
nickysielickiover 4 years ago
This is difficult... because these models are just regurgitating after training on real code. Fun little site but I hope nobody reads too much into this.
评论 #26242800 未加载
Aeronwenover 4 years ago
Got 40&#x2F;50 just smashing the GPT2 button.
评论 #26243104 未加载
评论 #26243005 未加载
thebean11over 4 years ago
6&#x2F;6, quitting while I&#x27;m ahead
评论 #26243335 未加载
jeff-davisover 4 years ago
Cool! Sadly, it&#x27;s really hard to tell whether OOP boilerplate is real or generated.
xingpedover 4 years ago
The giveaway seems to be the comments. Some comments a computer could obviously generate and some are obviously written by humans. The code seemed to be irrelevant in making a determination in my playing around with it.
评论 #26245644 未加载
mhh__about 4 years ago
I may have delusions of grandeur because I mainly work on compilers and libraries rather than business code but there is some diabolically awful code in this training set
评论 #26248566 未加载
hertzratover 4 years ago
The goal when writing code is to be pretty machine like and to keep things extremely simple. People also write dead or off topic comments. That’s why this is so hard
azhenleyover 4 years ago
The snippet I&#x27;m looking at isn&#x27;t code at all. It is Polish, and there isn&#x27;t any comment tokens or anything.
IceWreckabout 4 years ago
Did afew, got all of the right. GPT2 generated code always seems to follow a pattern
The_rationalistover 4 years ago
How much of it is just regurgitating the training set and therefore chunks of real code?
tpoacherover 4 years ago
There is a &quot;codes&quot; top-level domain? Codes? CODES??<p>What&#x27;s next? Advices? Feedbacks? Rests?<p>I give ups.
评论 #26247692 未加载
评论 #26247985 未加载
评论 #26247293 未加载
neologover 4 years ago
The black background on white background makes it annoying to read.
theurbandragonover 4 years ago
How long before we can just write specs instead of code?
jmpeaxover 4 years ago
An interesting approach to lossy compression.
AnssiHover 4 years ago
Ah, 0&#x2F;5, I give up :)
评论 #26243552 未加载
aviparsover 4 years ago
why not use GPT-3 for next time?