TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Improving Stack Overflow jobs search with machine learning and R

154 pointsby slashdotdashalmost 8 years ago

5 comments

rattrayalmost 8 years ago
I found it got off to a bit of a slow start, but was a fun, rewarding, and very real-world read.<p>In particular, I appreciated admissions like this:<p>&gt; Writing our own genetic algorithm in C# was a bad idea. It took us weeks to implement, test, and optimize. Not to mention all the time spent waiting for results. There was a better solution available all along (the optim function in R). Because we didn’t do proper research, we overlooked it and lost time. Sigh.<p>... which far too few eng blogs overlook &#x2F; fail to mention.
gaiusalmost 8 years ago
R is a very well supported language on Windows (since MS bought Revolution Analytics), very well integrated with SQL Server and Azure, and R Server solves a pain point that a lot of real-world users have. And of course it&#x27;s open source.
stevenwualmost 8 years ago
Question to spark discussion and for me to fill potential gaps in my knowledge, not to criticize the article as I very much appreciate the transparency and build-up from the simple naive initial approach to the final approach used in production:<p>Is anyone else bothered by the claim that there &quot;is a 100% chance that the new version is better than the current one&quot; shown by using bootstrap? Maybe I&#x27;ve just never come across such a use of bootstrap through my encounters with statistics. I know it as a tool for resampling from a population to build up properties of your estimator (mean, variance, what have you) when all you have is a dataset and no clue about the actual distribution. When I saw bootstrap with that probabilistic claim, I thought the author would calculate a bootstrapped (100-x)% confidence interval for both the current and the new weights: and if the intervals didn&#x27;t overlap with one another then you can claim with (100-x)% certainty that one is better than the other. But the author creates a new statistic that is a function of both datasets; Z_i = 1 if new is better than current on iteration i (on a random subset of data) else 0, and for all N=10000 iterations Z_i = 1. The chance&#x2F;probabilistic claim made of new being better than current is based on the fact that no variation was seen on Z_i (I&#x27;m also kind of skeptical that out of so many iterations with random subsets that each time the new weights were better than the current). I think at most you can say that you simulated subsets of the data and 100% of the time new &gt; current; the current claim leads me to believe there&#x27;s inference that isn&#x27;t there.<p>Maybe I should just ask one of my past stats profs. Open to someone enlightening me.
评论 #14450867 未加载
评论 #14452126 未加载
EternalDataalmost 8 years ago
Great read. I loved the honest balance between engineering features and how much they ultimately ended up mattering to users. Oftentimes, the features we select can be quite arbitrary -- it&#x27;s good to do gut checks by running real-time validation of results as often as possible. Fortunately, at Stack, you&#x27;ve got the userbase to do just that :)
amenodalmost 8 years ago
&gt; Genetic algorithm running on 56-core machine<p>Wow, just... wow. I wonder why they didn&#x27;t utilize (multiple) GPUs instead? I would guess it would be far more efficient in all aspects. Especially now that there is TensorFlow &amp; co.
评论 #14451181 未加载
评论 #14452405 未加载
评论 #14450686 未加载