My toughest job-interview question ever

45 点作者 techdog大约 16 年前

14 条评论

cperciva大约 16 年前

The author is an idiot. He suggests using an excessively slow algorithm while using a ridiculously large amount of RAM (he would do better to use a tree structure, sorting-and-aggregating, or at very least a more appropriately sized hash table). I was willing to accept this as merely being a sign that he was a one-trick pony -- someone who knew one data structure and tried to apply it everywhere -- but he completely lost me when I reached this point:We're hashing a million words into a table that can hold four billion. The load factor on the table is negligible. If we're getting collisions it means we need a better hash algorithm.There is absolutely no excuse for using a hash table without understanding the birthday paradox. If you hash a million words to four billion values, you WILL get hash collisions -- a few hundred of them.The author is right about one thing, though; this is a good interview question. It identified him as someone who picked a poor algorithm yet didn't even understand that algorithm properly, marking him -- in my opinion at least -- as an immediate no-hire.

评论 #605338 未加载

评论 #605411 未加载

评论 #605482 未加载

brown9大约 16 年前

Am I the only one that read this and was left wondering what the heck was up with the interviewer, not the candidate?Why drill a candidate for a technical writer position on hash functions and memory needs and speed and all that?It seems to me like whichever company this was (I'm guessing Google) that the interviewer only knows how to ask one type of question, almost as if he/she was reading from a script.

评论 #605644 未加载

评论 #605650 未加载

评论 #605661 未加载

spitfire大约 16 年前

Wow. Arrogance. He wouldn't have been hired with me.I particularly like the jab at the interviewer for not having instant recall of every single CS algorithm. I keep TAOCP on my desk, and /because/ I don't know every single algorithm.

评论 #605156 未加载

评论 #605277 未加载

评论 #605170 未加载

评论 #605188 未加载

trevelyan大约 16 年前

I think the focus on algorithms and data storage is totally misguided. You don't need to count every single word on the Internet. This is a question about statistical inference:(1) Use Zipf's law (or a more suitable distribution) to identify how many times we expect the 10,000th word to occur in a random distribution of documents. That gives us the maximum interval at which the 10,000th word will repeat itself at any level of statistical confidence.(2) Don't count every word. Only track the top 10,000 words plus a tail of whatever size is necessary to capture this maximum interval. With Zipf's law this means less than 20,000 words. The exact number can be calculated easily enough.(3) Store words in a data structure that provides fast access by alphabetical search (tree) and also by frequency of occurrence (list). Bump words up the list as we run into them. When you run into a word that isn't in your tree, grab the entry at the bottom of the frequency list and re-insert it as your new word by repositioning. There is no need to traverse much data(4) Lower frequency content will naturally drop off the list. Higher frequency content will naturally move up. And we've structured our program so that we can make statistical inferences about the words that are left....(5) Remember to normalize incoming data with a Porter-Stemmer algorithm or something. Lots of small details like that, but it is really just icing on the cake.Result? You get a statistically significant test you can run multiple times at whatever level of accuracy you desire. The software doesn't require huge computational muscle power or massive amounts of memory, since we throw out the long tail of statistically irrelevant content on an ongoing basis. That coupled with our elegant data structure - we're mostly swapping pointers - keeps the software small and fast.Is there a better approach?

Tichy大约 16 年前

I don't think I would have been so patient as an interviewer. He tried very hard to evade the questions several times. I would expect "fun" meetings in the future, dragging on for hours.

评论 #605159 未加载

评论 #605189 未加载

mkelly大约 16 年前

The best interview question I've ever had or thought about was also of this style: one concise question that requires knowledge of many different domains, and spawns a conversation that lasts the whole interview. The interviewer also pursued the same strategy of pulling actual numbers (for storage requirements) out of me. It was a tremendously enjoyable and satisfying interview (as far as interviews go). Also given by a well-known search company, interestingly enough.I didn't give nearly as much pushback as the author, though, since I assumed that wasn't the point.

yason大约 16 年前

While the interviewer probably wanted a typical "programming answer", I wonder how large a dataset you could do by simply using the command line. That's programming, too.Just use sort and uniq.Or if it's slow, sort in parallel, then merge into one stream, and extract the frequencies.

评论 #605357 未加载

mixmax大约 16 年前

Interesting that a candidate for a post as technical writer needs to be grilled so hard on the technical stuff. IMO a good technical writer needs to be much better at psychology, explaining and people-skills than the technical side of it.Maybe this is why most technical documentation sucks, particyularly helppages oriented towards end-users.

评论 #605307 未加载

asmdsr大约 16 年前

"What will you do in the event of hash collisions?" the professor asked.The interviewer was priming for this particular answer: Keep the buckets in a linked list. Each time you increment a bucket, move it to the front of the list. When you're counting frequencies, this approach will mean that the most frequently occurring keys will be closer to the front of the list.I know this because I was asked a similar question in an interview with a certain well-known PC software company. I didn't know the answer either, but still got an offer.

menloparkbum大约 16 年前

Although the author is interviewing to be a tech writer, and this question may or may not be relevant to the job, it sounds like he knows a lot about computer science and even why this question would be asked at an interview. It's a typical programming interview question and is asked so the interviewer learns how you solve problems. I got the feeling the interviewee knew all this and decided to be a pain in the ass about answering it, anyway.

pookleblinky大约 16 年前

Who can forget this: <a href="http://exold.com/article/stupid-interview-questions" rel="nofollow">http://exold.com/article/stupid-interview-questions</a>Either you love it and want to hire the guy in a moment, or you think he's an unbearable smartass.

ulf大约 16 年前

Why would he want to implement a frequency-sorted list as a hashtable!?

评论 #605154 未加载

known大约 16 年前

<pre><code> Interview != Quiz</code></pre> Most interviewers are NOT aware of this.

评论 #605881 未加载

TheSOB88大约 16 年前

What's up with that weird hash function? Seems to me you could just convert the string into its base 26 representation.Edit: I would actually like to know if/why the method used is more useful than the one I gave. I'm not just trying to be arrogant.

评论 #605401 未加载

评论 #605994 未加载