TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Reproducing Hacker News writing style fingerprinting

325 点作者 grep_it27 天前

48 条评论

mtlynch27 天前
&gt;<i>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.</i><p>This is actually super easy. The data is available in BigQuery.[0] It&#x27;s up to date, too. I tried the following query, and the latest comment was from yesterday.<p><pre><code> SELECT id, text, `by` AS username, FORMAT_TIMESTAMP(&#x27;%Y-%m-%dT%H:%M:%SZ&#x27;, TIMESTAMP_SECONDS(time)) AS timestamp FROM `bigquery-public-data.hacker_news.full` WHERE type = &#x27;comment&#x27; AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025 ORDER BY time DESC LIMIT 100 </code></pre> <a href="https:&#x2F;&#x2F;console.cloud.google.com&#x2F;bigquery?ws=!1m5!1m4!4m3!1sbigquery-public-data!2shacker_news!3sfull" rel="nofollow">https:&#x2F;&#x2F;console.cloud.google.com&#x2F;bigquery?ws=!1m5!1m4!4m3!1s...</a>
评论 #43712632 未加载
评论 #43716223 未加载
Frieren27 天前
It works for me. The accounts I used long time ago are there in high positions. I guess that my style is very distinctive.<p>But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.<p>So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.<p>My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.<p>It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.<p>Thanks for sharing the projects. It is really interesting.<p>Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.
评论 #43707693 未加载
评论 #43709375 未加载
评论 #43708711 未加载
评论 #43767333 未加载
hammock27 天前
The &quot;analyze&quot; feature works pretty well.<p>My comments underindex on &quot;this&quot; - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use &quot;this&quot; less frequently that I would otherwise.<p>They also underindex on &quot;should&quot; - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer &quot;ought to&quot;)<p>My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.<p>In case anyone cares.
评论 #43712142 未加载
评论 #43707540 未加载
评论 #43707287 未加载
评论 #43707331 未加载
评论 #43711763 未加载
评论 #43707457 未加载
评论 #43713281 未加载
评论 #43707581 未加载
评论 #43711357 未加载
xnorswap27 天前
I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.<p>You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.<p>( The nothing the accuracy is already good of course. I am indeed user eterm. I think I&#x27;ve said on this account or that one before that I don&#x27;t sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )
评论 #43714297 未加载
jedberg27 天前
Maybe I talk too much on HN. :)<p>When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.<p>Probably would be good to exclude those most common words.
评论 #43711472 未加载
评论 #43714573 未加载
nomilk27 天前
For visibility, here&#x27;s the tool where you can enter your hn username:<p><a href="https:&#x2F;&#x2F;antirez.com&#x2F;hnstyle?username=pg&amp;threshold=20&amp;action=search" rel="nofollow">https:&#x2F;&#x2F;antirez.com&#x2F;hnstyle?username=pg&amp;threshold=20&amp;action=...</a>
评论 #43713361 未加载
keepamovin27 天前
This is great example of what&#x27;s possible and how true anonymity, even online, is only &quot;technological threshold&quot; anonymity. People obsessed with biometrics might not consider this is another biometric.<p>Instead of just HN, now do it with the whole internet, imagine what you&#x27;d find. Then imagine that it&#x27;s not being done already.
评论 #43713869 未加载
评论 #43714098 未加载
评论 #43713577 未加载
paxys27 天前
It did find my &quot;alt&quot; (really an old account with a lost password), but the rest of the list – all users with very high match scores (0.8+) – is random.<p>Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the <i>topic</i> of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There&#x27;s only so many unique words you can use when talking about a technical topic.<p>I see in your post that you try to mitigate this by reducing the number of words compared, but I don&#x27;t think that is enough to do the job.
评论 #43709544 未加载
评论 #43710961 未加载
chrismorgan27 天前
I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with <i>straight</i> quotes show up among the bottom ten in our analyses. (Also etc, because I like to write <i>&amp;c.</i>)<p>I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.
评论 #43708109 未加载
weinzierl27 天前
How does it find the high similarity between &quot;dang&quot; and &quot;dangg&quot; when the &quot;dangg&quot; account has no activity (like comments) at all?<p><a href="https:&#x2F;&#x2F;antirez.com&#x2F;hnstyle?username=dang&amp;threshold=20&amp;action=search" rel="nofollow">https:&#x2F;&#x2F;antirez.com&#x2F;hnstyle?username=dang&amp;threshold=20&amp;actio...</a>
评论 #43707194 未加载
评论 #43710262 未加载
评论 #43710232 未加载
keepamovin27 天前
We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:<p>- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data<p>- remove stop words (NLP definition of stop words)<p>- perform stemming&#x2F;tokenization&#x2F;depluralization etc (again, NLP standard)<p>- implement commutativity and transitivity in the similarity function<p>- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity<p>- consider word bigrams, etc<p>- weight variations and misspellings higher as distinguishing signals<p>What are your ideas ?
declan_roberts27 天前
This is exactly why HN needs to allow us to delete accounts.
评论 #43714005 未加载
qsort27 天前
Have you tried to analyze whether there is a correlation between &quot;closeness&quot; according to this metric and how often users chat in the same thread? I recognize some usernames that are reported as being similar to me, I wonder if there&#x27;s some kind of self-selection at play.
评论 #43712770 未加载
MivLives27 天前
Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?
seabombs27 天前
This is a bit tangential but I&#x27;ve noticed lots of comments aping the style of Matt Walsh. Not just on HN either, but probably more here than other places I visit.<p>Anyway, I guess this would be useful cluster the &quot;Matt Walsh&quot;-y commenters together.
评论 #43712995 未加载
wild_egg27 天前
Very cool. Also a bit surprising — two of my matches are people I know IRL.
评论 #43708184 未加载
LinuxBender27 天前
I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?
ziddoap27 天前
I noticed that in my top 20 similar users, the similarity rank&#x2F;score&#x2F;whatever are all &gt;~0.83. However, randomly sampling from users in this thread, some top 20s are all &lt;~0.75, or all roughly 0.8, etc.<p>Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?<p>Also, someone like tptacek has a top 20 with matches all &gt;0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
评论 #43708680 未加载
lnauta27 天前
That makes me wonder two things. Firstly, if your can use this to find LLM generated content, which I guess would need similar instructions. Imagine instructing it to talk like a pirate, it would be quite different from a generic response.<p>Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?
giancarlostoro27 天前
I tried my name, and I don&#x27;t think a single &quot;match&quot; is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?
评论 #43707108 未加载
评论 #43706943 未加载
评论 #43713271 未加载
SnorkelTan27 天前
I remember the original post the author is referring to. I was captivated by it and thought it was cool. When I ran the original mentioned in the post, it detected my one of my alt&#x27;s that I forgot about. OP&#x27;s newer implementation using different methodologies did not detect the alt. For reference, the alt was created in 2010 and the last post was in 2012. Perhaps my writing style has changed?
评论 #43712038 未加载
wruza27 天前
Dang&#x27;s analysis was funny:<p><i>don&#x27;t site comment we here post that users against you&#x27;re</i><p>Quite a stance, man :)<p>And me clearly inarticulate and less confident than some:<p><i>it may but that because or not and even these</i><p>I noticed that randomly remembered usernames tend to produce either <i>lots</i> of utility words like the above, or very few of them. Interestingly, it doesn&#x27;t really correlate with my overall impression about them.
Boogie_Man27 天前
No matches higher than .7something and no mutual matches let&#x27;s go boys I&#x27;m a special unique snowflake
morkalork27 天前
I wonder if such an analysis could tease apart the authors of intentionally anonymous publications. Things like peer review notes for papers or legal opinions (afaik in countries that are not the USA, the authors of a dissenting supreme court decision are not named).
atiedebee27 天前
It looks like I don&#x27;t use the word &quot;and&quot; very often. I do notice that I tend to avoid concatenating sentences like that, lthough it is likely that there just isn&#x27;t enough data on my account as I haven&#x27;t been on HN for that long.
0xWTF27 天前
There are some interesting similarities in o.g. accounts aaronsw, pg, and jedberg.<p><pre><code> - aaronsw and jedberg share danielweber - aronsw and jedberg share wccrawford - aaronsw and pg share Natsu - aaronsw and pg share mcphage</code></pre>
byearthithatius27 天前
This is so cool. The user who talks most like me, and I can confirm he does, is ajb257
nottorp27 天前
Interesting, the top 3 similar accounts to me are two USers and an Australian. I&#x27;m Romanian (and living in Romania). I probably read too many books and news in English :)<p>Well, and worked a lot with americans over text based communication...
jmward0127 天前
I think an interesting use of this is potentially finding LLMs trained to have the style of a person. Unfortunately now, just because a post has my style it doesn&#x27;t mean it was me. I promise I am not a bot. Honest.
formerly_proven27 天前
I&#x27;m surprised no one has made this yet with a clustered visualization.
评论 #43707110 未加载
评论 #43706883 未加载
评论 #43706909 未加载
Lerc27 天前
Used More Often by dang.<p>don&#x27;t +0.9339
GenshoTikamura27 天前
Such a nice scientific way to detect and mute those who go against the agenda&#x27;s grain, oh I mean don&#x27;t contribute anything meaningful to the community
Uptrenda27 天前
I knew that this was possible but I always thought it took much more... effort? How do we mitigate this, then? Run our posts through an LLM?
throAwOfCou27 天前
I rotate hn accounts every year or two. In my top 4, I found 3 old alts.<p>This is impressive and scary. Obviously I had to create a throwaway to say this.
alganet27 天前
Cool tool. It&#x27;s a shame I don&#x27;t have other accounts to test it.<p>It&#x27;s also a tool for wannabe impersonators to hoan their writing style mimic skills!
评论 #43707512 未加载
wizzwizz427 天前
PhasmaFelis and mikeash have all matches mutual for the top 20, 30, 50 and 100. Are there other users like this? If so, how many? What&#x27;s the significance of this, in terms of the shape of the graph?<p>tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the &quot;20, 30, 50, 100&quot; scale. Is there a way to describe the <i>degree</i> to which a user has this &quot;I&#x27;m a relatively closer neighbour to them than they are to me&quot; property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?
tptacek27 天前
This is an interesting and well-written post but the data in the app seems pretty much random.
评论 #43706832 未加载
rcpt27 天前
Searched my nearest neighbor and found someone who agrees with my political views.
srhtftw27 天前
Did not find any of the alt accounts I&#x27;ve used since 2007. Which is good.
LoganDark27 天前
we have Dissociative Identity Disorder, I wonder if our different personalities would also have different fingerprints? we do have different writing styles
评论 #43711492 未加载
johnea27 天前
I wonder if it could help improve my karma? 8-&#x2F;
brap27 天前
My highest match was ChatGPT. Oh well<p>Edit: ChatGTP, my bad
3827 天前
this got two accounts that I used to use
评论 #43706935 未加载
konstantinua0027 天前
so the website processes only comments older than 2023?<p>not very useful for more newer users like me :&#x2F;
评论 #43714306 未加载
gfd27 天前
I don&#x27;t mind revealing my alts since none of them seem to link back to my main. But the top 4 results were all correct for me:<p><a href="https:&#x2F;&#x2F;antirez.com&#x2F;hnstyle?username=gfd&amp;threshold=20&amp;action=search" rel="nofollow">https:&#x2F;&#x2F;antirez.com&#x2F;hnstyle?username=gfd&amp;threshold=20&amp;action...</a><p>zawerf (Similarity: 0.7379)<p>ghj (Similarity: 0.7207)<p>fyp (Similarity: 0.7197)<p>uyt (Similarity: 0.7052)<p>I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I&#x27;m now very self conscious about the words I overuse...
评论 #43713526 未加载
tinix27 天前
fun project! but it didn&#x27;t get any of my alts.
andrewmcwatters27 天前
Well, well, well, cocktailpeanuts. :spiderman_pointing:<p>I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.<p>cocktailpeanuts and I for example, mutually share some words like:<p>because, people, you&#x27;re, don&#x27;t, they&#x27;re, software, that, but, you, want<p>Unfortunately, this is a forum where people will use words like &quot;because, people, and software.&quot;<p>Because, well, people here talk about software.<p>&lt;=^)<p>Edit: Neat work, nonetheless.
评论 #43707560 未加载
评论 #43707475 未加载
scoresomefeed27 天前
The original version nailed all of my accounts with terrifying accuracy. Since then I make a new account every few days or weeks. Against the rules I know. And I’ve learned a lot about HN IP tracking and funny shadowbanning-like tricks they play but dont cop to. Like I get different error messages based on the different banned ips I use. And j see different behavior and inconsistency with flagged messages (like one that got upvoted a day after it was flagged and not visible to other users).
评论 #43725459 未加载
评论 #43713151 未加载