> I don’t mind getting the naive solution first, but I really want to see my candidate having that <i>aha moment</i> that O(n²) is probably never good in any problem. And I want that aha moment to come pretty quickly and without hints. No great engineer should ever settle for an O(n²) algorithm, unless bound by memory or some other unmovable constraint.<p>But that is purely a cultural convention. O(n²) is great for many important problems. For example, parsing a sentence in natural language is Ω(n³); getting it in O(n²) would be evidence that the candidate was a deity.<p>Why select for familiarity with the details and conventions of the interviewing process? How is that supposed to be helpful?<p>> Candidates then switch it around to have CustomerId as the Key, and PageId as the Value of the Map. But that’s not particularly great either because it overlooks the fact that you can have many pages per customer, not just one. Some candidates have the intuition that they need a Collection of pages as the Value of the Map<p>But this is wrong. You're completely fine having a map from CustomerId to a single PageId, because the problem statement specifies that you're collecting customers who have visited more than one page. If you process a record that says CustomerId = X, PageId = Y, and you look up CustomerId in your map, these are the possibilities:<p>1. map[CustomerId] has no entry. You write Y as the new entry for that CustomerId.<p>2. map[CustomerId] is already Y. You do nothing.<p>3. map[CustomerId] has an entry that is not Y. You now know that CustomerId represents one of the customers you're trying to find.<p>At no point did you need the map values to represent more than one page.<p>> The condition <i>“customers that visited at least 2 unique pages”</i> tends to be a little harder for candidates to get right, so if they’re stuck I throw a little hint: you have a Set of pages from Day1, and a single page from Day2… how can you determine that this is at least two unique pages?<p>> Poor candidates will loop through the elements in the Set to check if the page from Day2 is in there. This turns your O(n) algorithm into O(n²) again. The number of candidates who have done this is surprising.<p>> Better candidates will do a <i>.contains()</i> on the Set which is an O(1) operation on a hash set. But there is a catch with the logic.<p>> The intuition to get this right is this: If you are inside that If loop and the customer visited at least two pages in Day1, and they visited any page in Day2, they’re loyal, regardless of which page they visit in Day2. Otherwise, they only visited only one page in Day1, so the question is: is this a different page? If so they’re loyal, else it’s a duplicate so you don’t know and should keep going. So your If statement has an Or:<p>> [code sample involving the interviewer's terrible solution]<p>> There’s a need for attention to detail, like using “>” instead of “>=” or missing the “!” in the second statement. I saw these fairly often. I didn’t worry. Great candidates spotted them quickly as they double-checked the algorithm when they were done. Good candidates spotted them after a little bit of hinting. That gave me a good signal on <i>debugging skills</i>.<p>Why in the world is this presented as a desirable solution? You have a Set of visited pages from day 1 and a single visited page from day 2. You want to know whether the total number of visited pages is more than 1.<p>Add the page from day 2 to the Set [O(1)], and then count the Set [also O(1)].<p>> <i>What if you pre-processed the files and sorted them by CustomerId, then by PageId?</i><p>> If the files are sorted, then the problem is easier and it’s just a two-pointer algorithm that you can execute in O(n) with O(1) of memory.<p>> Since the second sort key is by PageId, you follow another two-pointer algorithm to determine that there are at least two unique pages. So it’s a 2-pointer algorithm within a 2-pointer algorithm. It’s kind of a fun problem! I’ll leave the actual implementation as an exercise for the viewer.<p>> If you want to make the problem even more interesting, you can add a third file. I will leave that as an exercise for the reader as well!<p>If you're preprocessing the files, why not concatenate them before (or while) sorting them? The asymptotic resource requirements are the same and you end up with <i>one</i> file that can be processed in O(n) time and O(1) space. (Though the result of the algorithm necessarily takes up O(n) space, so I'm not sure how much this should count as an improvement in terms of space requirements...)<p>This additional preprocessing step makes the generalization to three files trivial. The algorithm is identical: concatenate the files, sort the monofile, and then walk through the sorted entries.