I've been doing a whole blog series on doing this also: <a href="http://bitquabit.com/post/having-fun-python-and-elasticsearch-part-1/" rel="nofollow">http://bitquabit.com/post/having-fun-python-and-elasticsearc...</a> . It's intereting to see a different take on it.
This is a totally shameless plug but if you'd like to learn Elasticsearch from scratch, I've got an introductory course up on Pluralsight. <a href="http://www.pluralsight.com/courses/elasticsearch-for-dotnet-developers" rel="nofollow">http://www.pluralsight.com/courses/elasticsearch-for-dotnet-...</a>
It's the first time I see github's Readme's being used as a blogging tool. Is this common? I've started to link to a Vagrant/Ansible repo for my setup / code intensive posts, but having the code and the text encapsulated as a repo is quite novel.
There are a couple of libraries listed below. Would using any of them make life easier with ElasticSearch + Python?<p>- <a href="https://github.com/elasticsearch/elasticsearch-py" rel="nofollow">https://github.com/elasticsearch/elasticsearch-py</a> (low level lib, from ES)<p>- <a href="https://github.com/elasticsearch/elasticsearch-dsl-py" rel="nofollow">https://github.com/elasticsearch/elasticsearch-dsl-py</a> (high level lib, from ES)<p>- <a href="https://github.com/mozilla/elasticutils" rel="nofollow">https://github.com/mozilla/elasticutils</a> (high level lib from Mozilla)<p>There are a few more, but they are either obsolete or don't have much traction. There's also django-haystack, but that's specific to django.
I've been thinking about making my own email searchable with elasticsearch. The main thing holding me back is security. With elasticsearch listening on localhost:9200, anyone with local access can read all your mail. Even if you would do this on a computer over which you have full control, even a tiny breach would leak all your mails.<p>I realize this tutorial is just meant to get started with elasticsearch and not meant as a tool to make your email searchable. Still would be interesting to take this to the next level.
Not sure if people are still here. I tried moving through this and it appears to be failing on the import... I am running a vagrant and get everything installed just fine.<p>I don't know how to invoke the script properly...<p>I've tried so many ways. This seems like it would give results... though it does nothing much.<p>python index_emails.py test.mbox<p>Any help or tips are appreciated! This has been a fun project so far. Stumbling at the end. Thanks!
Just a word of caution: elasticsearch allows everyone access to the indexed data, by default. If you're doing this on a world-reachable machine with sensitive data, you should probably lock it down or make sure it's locked down.<p>There are a number of authentication solutions, and they will require additional configuration -plugins like jetty and elasticsearch-http-basic.
The whole point of GMail was supposed to be that it was searchable. Did Google break that, or what?<p>If there's a demand for this, it might be worthwhile to build IMAP servers with more indexing. It's easy to request searches with IMAP, but the performance can be a problem for IMAP servers that aren't real databases.
Very interesting. This is a very useful and practical way of learning new things instead of reading an article about it. I don't know python programming but I was able to understand each and every bit of it and I will be coming back to this if I ever need to incorporate Elasticsearch.
The 'notmuch' mail indexing system uses Xapian. I can grep through my 200k messages in seconds.<p><a href="http://notmuchmail.org/" rel="nofollow">http://notmuchmail.org/</a><p>Since it's implemented as a "library" of sorts, there are interfaces for emacs, command line, GTK, mutt, ...
Analysing the "Turn mbox into JSON" section<p><a href="http://paste.lisp.org/display/145050" rel="nofollow">http://paste.lisp.org/display/145050</a>
Couldn't this be "Indexing your mbox files"? It seems applicable to any mailbox that is in or can be in that format. Except for the x-gmail-labels part, of course.<p>Anyway if you do feel like you want to accomplish the stated purpose of finding which emails are taking up space, you can search in gmail with the word "larger", as in "larger:20MB".
so when should you use elasticsearch? can't you get away with doing<p><pre><code> SELECT id FROM pages WHERE title LIKE "%elastic"</code></pre>