Hey all,<p>I did a lot of the ML work for this. Let me know if you have any questions.<p>The title might be a little ambitious since we only have two embeddings right now, but it really is our goal to have embeddings for <i>everything</i>. You can see some of our upcoming embeddings at <a href="https://www.basilica.ai/available-embeddings/" rel="nofollow">https://www.basilica.ai/available-embeddings/</a>.<p>We basically want to do for these other datatypes what word2vec did for NLP. We want to turn getting good results with images, audio, etc. from a hard research problem into something you can do on your laptop with scikit.
Interesting idea, but seems to very much fall within the category of something you would often want to build in-house. I always imagined the right level of abstraction was closer to spacy's, a framework that lets you easily embed all the things.<p>If you are interested in how to build and use embeddings for search and classification yourself, I wrote a completely open source tutorial here: <a href="https://blog.insightdatascience.com/the-unreasonable-effectiveness-of-deep-learning-representations-4ce83fc663cf" rel="nofollow">https://blog.insightdatascience.com/the-unreasonable-effecti...</a>
What is the use case for this? (And this is a general point for AI cloud APIs)<p>Specifically, I am trying to think of an example where the user cares about a vector representation of something, but doesn't care about how that vector representation was obtained.<p>I can think of why it would be useful: the ML examples given, or perhaps a compression application.<p>However, in each of these cases, it would seem that the user has the skill to spin up their own, and a lot of motivation to do so and understand it.
How do you plan to counter the harmful societal biases that embeddings embody?<p>See Bolukbasi (<a href="https://arxiv.org/pdf/1607.06520.pdf" rel="nofollow">https://arxiv.org/pdf/1607.06520.pdf</a>)
and Caliskan (<a href="http://science.sciencemag.org/content/356/6334/183" rel="nofollow">http://science.sciencemag.org/content/356/6334/183</a>)<p>While these examples are solely language based, it is easy to imagine the transfer to other domains.
Aren't these embeddings task-specific? For example a word2vec embedding is found by letting the embedder participate in a task to predict a word given words around it, on a particular corpus of text.<p>The embedding of sentences are trained on translation tasks. A embedding that works both for images and sentences is found by training for a picture captioning task.<p>The point I'm asking about is that there may be many ways to embed a "data type", depending on what you might want to use the embedding for. Someone brought up board game states. You could imagine embedding images of board games directly. That embedding would only contain information about the game state if it was trained for the appropriate task.
You quote a target of 200ms per embedding, not sure if it's one type of embedding in particular. I am using Infersent (a sentence embedding from FAIR <a href="https://github.com/facebookresearch/InferSent" rel="nofollow">https://github.com/facebookresearch/InferSent</a>) for filtering and they quote a number of 1000/sentences per second on generic GPU. That's 200 times faster than your number, but it is a local API so I am comparing apples to oranges. Yet it's hard to imagine you are spending 1ms embedding and 199 on API overhead. I am sure I have missed a 0 here or there, but I don't see where, other than theirs is a batch number (batch size 128) and maybe yours is a single embedding number. Can you please clarify? Thanks
How much does this depend on the data type? I.e. do you need people to specify: this is an image, this is a resume, this is an English resume, etc. Could you ever get to a point where you can just feed it general data, not knowing more than that it's 1s and 0s?
Slightly different topic but what are some approaches to categorize webpages. Like I have 1000s of web links I want to organize with tags. Is there software technique to group them by related topics?
Is this actually 'for anything'? I see references to sentences and images. If I, for example, wanted to compare audio samples, how would it work?
>Job Candidate Clustering<p>>Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.<p>Wonderful! We were in dire need for yet another black-box criteria based on which employers can reject candidates.<p>“We’re sorry to inform you that we choose not to go on with your application. You see, for this position we’re looking for someone with a different <i>embedding</i>.”
Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?<p>There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.<p>For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications.<p>The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.<p>I'd be curious if the embedding is "aligned", in the sense that an embedding of the word "cat" is close to the embedding of a picture of cat. I think that would be interesting and useful. I don't see how Basilica solve that problem by taking the top layers off ResNet though.<p>I appreciate the developer API etc, but as an ML practitioner this feels like a troll.