Reproducing Hacker News writing style fingerprinting

325 点作者 grep_it27 天前

48 条评论

>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.<pre><code> SELECT id, text, `by` AS username, FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp FROM `bigquery-public-data.hacker_news.full` WHERE type = 'comment' AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025 ORDER BY time DESC LIMIT 100 </code></pre> <a href="https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbigquery-public-data!2shacker_news!3sfull" rel="nofollow">https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...</a>

评论 #43712632 未加载

评论 #43716223 未加载

Frieren27 天前

It works for me. The accounts I used long time ago are there in high positions. I guess that my style is very distinctive.But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.Thanks for sharing the projects. It is really interesting.Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.

评论 #43707693 未加载

评论 #43709375 未加载

评论 #43708711 未加载

评论 #43767333 未加载

hammock27 天前

The "analyze" feature works pretty well.My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.In case anyone cares.

评论 #43712142 未加载

评论 #43707540 未加载

评论 #43707287 未加载

评论 #43707331 未加载

评论 #43711763 未加载

评论 #43707457 未加载

评论 #43713281 未加载

评论 #43707581 未加载

评论 #43711357 未加载

xnorswap27 天前

I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )

评论 #43714297 未加载

jedberg27 天前

Maybe I talk too much on HN. :)When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.Probably would be good to exclude those most common words.

评论 #43711472 未加载

评论 #43714573 未加载

nomilk27 天前

For visibility, here's the tool where you can enter your hn username:<a href="https://antirez.com/hnstyle?username=pg&threshold=20&action=search" rel="nofollow">https://antirez.com/hnstyle?username=pg&threshold=20&action=...</a>

评论 #43713361 未加载

keepamovin27 天前

This is great example of what's possible and how true anonymity, even online, is only "technological threshold" anonymity. People obsessed with biometrics might not consider this is another biometric.Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.

评论 #43713869 未加载

评论 #43714098 未加载

评论 #43713577 未加载

paxys27 天前

It did find my "alt" (really an old account with a lost password), but the rest of the list – all users with very high match scores (0.8+) – is random.Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.

评论 #43709544 未加载

评论 #43710961 未加载

chrismorgan27 天前

I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with straight quotes show up among the bottom ten in our analyses. (Also etc, because I like to write &c.)I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.

评论 #43708109 未加载

weinzierl27 天前

How does it find the high similarity between "dang" and "dangg" when the "dangg" account has no activity (like comments) at all?<a href="https://antirez.com/hnstyle?username=dang&threshold=20&action=search" rel="nofollow">https://antirez.com/hnstyle?username=dang&threshold=20&actio...</a>

评论 #43707194 未加载

评论 #43710262 未加载

评论 #43710232 未加载

keepamovin27 天前

We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data- remove stop words (NLP definition of stop words)- perform stemming/tokenization/depluralization etc (again, NLP standard)- implement commutativity and transitivity in the similarity function- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity- consider word bigrams, etc- weight variations and misspellings higher as distinguishing signalsWhat are your ideas ?

declan_roberts27 天前

This is exactly why HN needs to allow us to delete accounts.

评论 #43714005 未加载

qsort27 天前

Have you tried to analyze whether there is a correlation between "closeness" according to this metric and how often users chat in the same thread? I recognize some usernames that are reported as being similar to me, I wonder if there's some kind of self-selection at play.

评论 #43712770 未加载

MivLives27 天前

Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?

seabombs27 天前

This is a bit tangential but I've noticed lots of comments aping the style of Matt Walsh. Not just on HN either, but probably more here than other places I visit.Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.

评论 #43712995 未加载

wild_egg27 天前

Very cool. Also a bit surprising — two of my matches are people I know IRL.

评论 #43708184 未加载

LinuxBender27 天前

I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?

ziddoap27 天前

I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?

评论 #43708680 未加载

lnauta27 天前

That makes me wonder two things. Firstly, if your can use this to find LLM generated content, which I guess would need similar instructions. Imagine instructing it to talk like a pirate, it would be quite different from a generic response.Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?

giancarlostoro27 天前

I tried my name, and I don't think a single "match" is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?

评论 #43707108 未加载

评论 #43706943 未加载

评论 #43713271 未加载

SnorkelTan27 天前

I remember the original post the author is referring to. I was captivated by it and thought it was cool. When I ran the original mentioned in the post, it detected my one of my alt's that I forgot about. OP's newer implementation using different methodologies did not detect the alt. For reference, the alt was created in 2010 and the last post was in 2012. Perhaps my writing style has changed?

评论 #43712038 未加载

wruza27 天前

Dang's analysis was funny:don't site comment we here post that users against you'reQuite a stance, man :)And me clearly inarticulate and less confident than some:it may but that because or not and even theseI noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.

Boogie_Man27 天前

No matches higher than .7something and no mutual matches let's go boys I'm a special unique snowflake

morkalork27 天前

I wonder if such an analysis could tease apart the authors of intentionally anonymous publications. Things like peer review notes for papers or legal opinions (afaik in countries that are not the USA, the authors of a dissenting supreme court decision are not named).

atiedebee27 天前

It looks like I don't use the word "and" very often. I do notice that I tend to avoid concatenating sentences like that, lthough it is likely that there just isn't enough data on my account as I haven't been on HN for that long.

0xWTF27 天前

There are some interesting similarities in o.g. accounts aaronsw, pg, and jedberg.<pre><code> - aaronsw and jedberg share danielweber - aronsw and jedberg share wccrawford - aaronsw and pg share Natsu - aaronsw and pg share mcphage</code></pre>

byearthithatius27 天前

This is so cool. The user who talks most like me, and I can confirm he does, is ajb257

nottorp27 天前

Interesting, the top 3 similar accounts to me are two USers and an Australian. I'm Romanian (and living in Romania). I probably read too many books and news in English :)Well, and worked a lot with americans over text based communication...

jmward0127 天前

I think an interesting use of this is potentially finding LLMs trained to have the style of a person. Unfortunately now, just because a post has my style it doesn't mean it was me. I promise I am not a bot. Honest.

formerly_proven27 天前

I'm surprised no one has made this yet with a clustered visualization.

评论 #43707110 未加载

评论 #43706883 未加载

评论 #43706909 未加载

Lerc27 天前

Used More Often by dang.don't +0.9339

GenshoTikamura27 天前

Such a nice scientific way to detect and mute those who go against the agenda's grain, oh I mean don't contribute anything meaningful to the community

Uptrenda27 天前

I knew that this was possible but I always thought it took much more... effort? How do we mitigate this, then? Run our posts through an LLM?

throAwOfCou27 天前

I rotate hn accounts every year or two. In my top 4, I found 3 old alts.This is impressive and scary. Obviously I had to create a throwaway to say this.

alganet27 天前

Cool tool. It's a shame I don't have other accounts to test it.It's also a tool for wannabe impersonators to hoan their writing style mimic skills!

评论 #43707512 未加载

wizzwizz427 天前

PhasmaFelis and mikeash have all matches mutual for the top 20, 30, 50 and 100. Are there other users like this? If so, how many? What's the significance of this, in terms of the shape of the graph?tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?

tptacek27 天前

This is an interesting and well-written post but the data in the app seems pretty much random.

评论 #43706832 未加载

rcpt27 天前

Searched my nearest neighbor and found someone who agrees with my political views.

srhtftw27 天前

Did not find any of the alt accounts I've used since 2007. Which is good.

LoganDark27 天前

we have Dissociative Identity Disorder, I wonder if our different personalities would also have different fingerprints? we do have different writing styles

评论 #43711492 未加载

johnea27 天前

I wonder if it could help improve my karma? 8-/

brap27 天前

My highest match was ChatGPT. Oh wellEdit: ChatGTP, my bad

3827 天前

this got two accounts that I used to use

评论 #43706935 未加载

konstantinua0027 天前

so the website processes only comments older than 2023?not very useful for more newer users like me :/

评论 #43714306 未加载

gfd27 天前

I don't mind revealing my alts since none of them seem to link back to my main. But the top 4 results were all correct for me:<a href="https://antirez.com/hnstyle?username=gfd&threshold=20&action=search" rel="nofollow">https://antirez.com/hnstyle?username=gfd&threshold=20&action...</a>zawerf (Similarity: 0.7379)ghj (Similarity: 0.7207)fyp (Similarity: 0.7197)uyt (Similarity: 0.7052)I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...

评论 #43713526 未加载

tinix27 天前

fun project! but it didn't get any of my alts.

andrewmcwatters27 天前

Well, well, well, cocktailpeanuts. :spiderman_pointing:I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.cocktailpeanuts and I for example, mutually share some words like:because, people, you're, don't, they're, software, that, but, you, wantUnfortunately, this is a forum where people will use words like "because, people, and software."Because, well, people here talk about software.<=^)Edit: Neat work, nonetheless.

评论 #43707560 未加载

评论 #43707475 未加载

scoresomefeed27 天前

The original version nailed all of my accounts with terrifying accuracy. Since then I make a new account every few days or weeks. Against the rules I know. And I’ve learned a lot about HN IP tracking and funny shadowbanning-like tricks they play but dont cop to. Like I get different error messages based on the different banned ips I use. And j see different behavior and inconsistency with flagged messages (like one that got upvoted a day after it was flagged and not visible to other users).

评论 #43725459 未加载

评论 #43713151 未加载

48 条评论

mtlynch27 天前

评论 #43712632 未加载

评论 #43716223 未加载

Frieren27 天前

评论 #43707693 未加载

评论 #43709375 未加载

评论 #43708711 未加载

评论 #43767333 未加载

hammock27 天前

评论 #43712142 未加载

评论 #43707540 未加载

评论 #43707287 未加载

评论 #43707331 未加载

评论 #43711763 未加载

评论 #43707457 未加载

评论 #43713281 未加载

评论 #43707581 未加载

评论 #43711357 未加载

xnorswap27 天前

评论 #43714297 未加载

jedberg27 天前

评论 #43711472 未加载

评论 #43714573 未加载

nomilk27 天前

评论 #43713361 未加载

keepamovin27 天前

评论 #43713869 未加载

评论 #43714098 未加载

评论 #43713577 未加载

paxys27 天前

评论 #43709544 未加载

评论 #43710961 未加载

chrismorgan27 天前

评论 #43708109 未加载

weinzierl27 天前

评论 #43707194 未加载

评论 #43710262 未加载

评论 #43710232 未加载

keepamovin27 天前

declan_roberts27 天前

This is exactly why HN needs to allow us to delete accounts.

评论 #43714005 未加载

qsort27 天前

评论 #43712770 未加载

MivLives27 天前

seabombs27 天前

评论 #43712995 未加载

wild_egg27 天前

Very cool. Also a bit surprising — two of my matches are people I know IRL.

评论 #43708184 未加载

LinuxBender27 天前

I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?

ziddoap27 天前

评论 #43708680 未加载

lnauta27 天前

giancarlostoro27 天前

I tried my name, and I don't think a single "match" is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?

评论 #43707108 未加载

评论 #43706943 未加载

评论 #43713271 未加载

SnorkelTan27 天前

评论 #43712038 未加载

wruza27 天前

Boogie_Man27 天前

No matches higher than .7something and no mutual matches let's go boys I'm a special unique snowflake

morkalork27 天前

atiedebee27 天前

0xWTF27 天前

byearthithatius27 天前

This is so cool. The user who talks most like me, and I can confirm he does, is ajb257

nottorp27 天前

jmward0127 天前

formerly_proven27 天前

I'm surprised no one has made this yet with a clustered visualization.

评论 #43707110 未加载

评论 #43706883 未加载

评论 #43706909 未加载

Lerc27 天前

Used More Often by dang.don't +0.9339

GenshoTikamura27 天前

Such a nice scientific way to detect and mute those who go against the agenda's grain, oh I mean don't contribute anything meaningful to the community

Uptrenda27 天前

I knew that this was possible but I always thought it took much more... effort? How do we mitigate this, then? Run our posts through an LLM?

throAwOfCou27 天前

I rotate hn accounts every year or two. In my top 4, I found 3 old alts.This is impressive and scary. Obviously I had to create a throwaway to say this.

alganet27 天前

Cool tool. It's a shame I don't have other accounts to test it.It's also a tool for wannabe impersonators to hoan their writing style mimic skills!

评论 #43707512 未加载

wizzwizz427 天前

tptacek27 天前

This is an interesting and well-written post but the data in the app seems pretty much random.

评论 #43706832 未加载

rcpt27 天前

Searched my nearest neighbor and found someone who agrees with my political views.

srhtftw27 天前

Did not find any of the alt accounts I've used since 2007. Which is good.

LoganDark27 天前

we have Dissociative Identity Disorder, I wonder if our different personalities would also have different fingerprints? we do have different writing styles

评论 #43711492 未加载

johnea27 天前

I wonder if it could help improve my karma? 8-/

brap27 天前

My highest match was ChatGPT. Oh wellEdit: ChatGTP, my bad

3827 天前

this got two accounts that I used to use

评论 #43706935 未加载

konstantinua0027 天前

so the website processes only comments older than 2023?not very useful for more newer users like me :/