Show HN: Basilica – word2vec for anything

153 pointsby hiphipjorgeover 6 years ago

15 comments

mlucyover 6 years ago

Hey all,I did a lot of the ML work for this. Let me know if you have any questions.The title might be a little ambitious since we only have two embeddings right now, but it really is our goal to have embeddings for everything. You can see some of our upcoming embeddings at <a href="https://www.basilica.ai/available-embeddings/" rel="nofollow">https://www.basilica.ai/available-embeddings/</a>.We basically want to do for these other datatypes what word2vec did for NLP. We want to turn getting good results with images, audio, etc. from a hard research problem into something you can do on your laptop with scikit.

评论 #18349140 未加载

评论 #18349876 未加载

评论 #18347371 未加载

评论 #18349004 未加载

评论 #18347671 未加载

评论 #18347686 未加载

评论 #18351421 未加载

e_ameisenover 6 years ago

Interesting idea, but seems to very much fall within the category of something you would often want to build in-house. I always imagined the right level of abstraction was closer to spacy's, a framework that lets you easily embed all the things.If you are interested in how to build and use embeddings for search and classification yourself, I wrote a completely open source tutorial here: <a href="https://blog.insightdatascience.com/the-unreasonable-effectiveness-of-deep-learning-representations-4ce83fc663cf" rel="nofollow">https://blog.insightdatascience.com/the-unreasonable-effecti...</a>

projectramoover 6 years ago

What is the use case for this? (And this is a general point for AI cloud APIs)Specifically, I am trying to think of an example where the user cares about a vector representation of something, but doesn't care about how that vector representation was obtained.I can think of why it would be useful: the ML examples given, or perhaps a compression application.However, in each of these cases, it would seem that the user has the skill to spin up their own, and a lot of motivation to do so and understand it.

评论 #18347960 未加载

评论 #18347854 未加载

ASpringover 6 years ago

How do you plan to counter the harmful societal biases that embeddings embody?See Bolukbasi (<a href="https://arxiv.org/pdf/1607.06520.pdf" rel="nofollow">https://arxiv.org/pdf/1607.06520.pdf</a>) and Caliskan (<a href="http://science.sciencemag.org/content/356/6334/183" rel="nofollow">http://science.sciencemag.org/content/356/6334/183</a>)While these examples are solely language based, it is easy to imagine the transfer to other domains.

评论 #18348105 未加载

gugagoreover 6 years ago

Aren't these embeddings task-specific? For example a word2vec embedding is found by letting the embedder participate in a task to predict a word given words around it, on a particular corpus of text.The embedding of sentences are trained on translation tasks. A embedding that works both for images and sentences is found by training for a picture captioning task.The point I'm asking about is that there may be many ways to embed a "data type", depending on what you might want to use the embedding for. Someone brought up board game states. You could imagine embedding images of board games directly. That embedding would only contain information about the game state if it was trained for the appropriate task.

评论 #18350021 未加载

piccolboover 6 years ago

You quote a target of 200ms per embedding, not sure if it's one type of embedding in particular. I am using Infersent (a sentence embedding from FAIR <a href="https://github.com/facebookresearch/InferSent" rel="nofollow">https://github.com/facebookresearch/InferSent</a>) for filtering and they quote a number of 1000/sentences per second on generic GPU. That's 200 times faster than your number, but it is a local API so I am comparing apples to oranges. Yet it's hard to imagine you are spending 1ms embedding and 199 on API overhead. I am sure I have missed a 0 here or there, but I don't see where, other than theirs is a batch number (batch size 128) and maybe yours is a single embedding number. Can you please clarify? Thanks

评论 #18445569 未加载

jdolinerover 6 years ago

How much does this depend on the data type? I.e. do you need people to specify: this is an image, this is a resume, this is an English resume, etc. Could you ever get to a point where you can just feed it general data, not knowing more than that it's 1s and 0s?

评论 #18349304 未加载

pkayeover 6 years ago

Slightly different topic but what are some approaches to categorize webpages. Like I have 1000s of web links I want to organize with tags. Is there software technique to group them by related topics?

评论 #18348930 未加载

Lercover 6 years ago

Is this actually 'for anything'? I see references to sentences and images. If I, for example, wanted to compare audio samples, how would it work?

评论 #18348849 未加载

kolleykibberover 6 years ago

Hi Lucy. Looks great. Do you have any production use cases you can tell us about? Are you a YC company?

评论 #18349948 未加载

mslaover 6 years ago

So the actual code is closed-source?

评论 #18347952 未加载

captn3m0over 6 years ago

Do you think board game states might be a good target later?

评论 #18347740 未加载

评论 #18347735 未加载

asdfghjlover 6 years ago

How are you embedding images?

评论 #18347791 未加载

评论 #18349890 未加载

aaaaaaaaaabover 6 years ago

>Job Candidate Clustering>Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.Wonderful! We were in dire need for yet another black-box criteria based on which employers can reject candidates.“We’re sorry to inform you that we choose not to go on with your application. You see, for this position we’re looking for someone with a different embedding.”

评论 #18348035 未加载

评论 #18348236 未加载

评论 #18350181 未加载

评论 #18347969 未加载

mathenaover 6 years ago

Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications.The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.I'd be curious if the embedding is "aligned", in the sense that an embedding of the word "cat" is close to the embedding of a picture of cat. I think that would be interesting and useful. I don't see how Basilica solve that problem by taking the top layers off ResNet though.I appreciate the developer API etc, but as an ML practitioner this feels like a troll.

评论 #18351028 未加载

评论 #18351455 未加载