i'm training a machine learning model using SVM in python and it took aages for it to happen on my local machine. (with 10% of the data that i have)
i'm getting a 80-90% correct prediction score on the same subjects data so now want to add in the rest of the data. (11 more subjects)<p>i thought of offloading it to my ec2 instance but i'm on a budget so cant just take a 30CPU instance..
on top of everything the code just uses 1 CPU at 100% always so i'm not sure about how effective it would be.<p>what do you guys use to train these models?
Speed comes from two things: implementation and algorithm. Algorithmically, the way to learn quickly is to use some sort of stochastic gradient method, i.e. learn from examples one-by-one, as opposed to as a batch.<p>As far as implementation goes, you need dense arrays. A native Python implementation will usually be lists of Python objects, which is very slow.<p>If you just need an SVM implementation, libsvm is pretty good. I'm assuming you need a non-linear kernel. If you're using a linear kernel then there's not really a difference between SVM and MaxEnt (well, there is but not much).<p>If your data is very sparse then there aren't many general-purpose implementations that are any good. The scipy.sparse module has some key stuff implemented in pure Python, and doesn't interoperate properly with the rest of the PyData ecosystem. I had to implement my own sparse data structures, in Cython.
One approach is to convert the code to use parallelism. For an example of how to do it in Python using joblib see this article: <a href="http://blog.dominodatalab.com/simple-parallelization/" rel="nofollow">http://blog.dominodatalab.com/simple-parallelization/</a><p>Even if you can't afford a 32-core instance, you might get to use 4 cores in your laptop.
How much data and how long are you talking about? If it fits in memory, then the slowness is likely due to other coding errors causing a bottleneck, not the SVM training. (Unless you wrote that as well.)