Accelerating scikit-learn is a smart move. At the algorithmic level for every ML use case there is probably x 10 non-ML data science projects. Also, it is good to have a true community framework that does not depend on the success of the metaverse for funding ;-)<p>The lock-in is an important consideration, but if the scikit-learn API is fully respected it would seem less relevant. It also suggests a pattern for how other hardware vendors could accelerate scikit-learn as a genuine contribution?
Hi all,<p>Currently some work is being done to improve computational primitives of scikit-learn to enhance its overhaul performances natively.<p>You can have a look at this exploratory PR: <a href="https://github.com/scikit-learn/scikit-learn/pull/20254" rel="nofollow">https://github.com/scikit-learn/scikit-learn/pull/20254</a><p>This other PR is a clear revamp of this previous one:
<a href="https://github.com/scikit-learn/scikit-learn/pull/21462" rel="nofollow">https://github.com/scikit-learn/scikit-learn/pull/21462</a><p>Cheers,
Julien.
Intel seems 6 years too late to the party CUDA started. That said, it could pick up traction: academics have increasingly been using pytorch.<p>EDIT: Perhaps its my inexperience, but is anyone else confused by the OneAPI rollout? There isn't exactly backwards compatiblity with the Classic Intel compiler, and an embarassing amount of time elapsed until I realized "Data Parallel C++" doesn't refer to parallel programming in C++, but rather an Intel-developed API built atop C++.
Just tried the patch in Google Colab and results for the example code were actually about 20% slower than without the patch.<p><a href="https://imgur.com/a/7EmlYJy" rel="nofollow">https://imgur.com/a/7EmlYJy</a><p>What am I missing?<p>edit: it seems my instance was using AMD EPYC.
The syntax and usability of<p><pre><code> from sklearnex import patch_sklearn
# The names match scikit-learn estimators
patch_sklearn("SVC")
</code></pre>
seems quite clunky. I'd have preferred a syntax like<p><pre><code> from sklearnex import SVC
</code></pre>
Then, maintenance would be substantially easier. If sklearnex had import-level compatibility with sklearn it'd be as simple as some simple replacements,<p><pre><code> import sklearn --> import sklearnex as sklearn
from sklearn.cluster import KMeans --> from sklearnex.cluster import KMeans
</code></pre>
which seems much easier / clearer.
A 5000x boost in KNN inference is not bad.<p>Generally speaking the distribution-packaged versions of python and all its scientific libraries and their support libraries are best ignored. That stuff should always be rebuilt to suit your actual production hardware, instead of a 2007-era Opteron.
As cool as this is, why would you lock yourself into Intel?<p>Especially with cloud providers making arm processors available at lower prices.<p>At the same time:
"Intel® Extension for Scikit-learn* is a free software AI accelerator that brings over 10-100X acceleration across a variety of applications."<p>Maybe their free software could be extended to all processors?
> oneAPI Data Analytics Library (oneDAL) is a powerful machine learning library that helps speed up big data analysis. oneDAL solvers are also used in Intel Distribution for Python for scikit-learn optimization.<p>> oneDAL is part of oneAPI.<p>So oneAPI is cross industry but this only works with Intel CPUs?<p>Hmm. Not sure I’m buying this Intel. Sounds like you’re claiming to be open but locking people into Intel only libraries.
<a href="https://github.com/intel/scikit-learn-intelex" rel="nofollow">https://github.com/intel/scikit-learn-intelex</a><p>CuML is similar to Intel Extension for Scikit-Learn in function?
<a href="https://github.com/rapidsai/cuml" rel="nofollow">https://github.com/rapidsai/cuml</a><p>> <i>cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects. cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn. For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.</i>
Is there a specific “test” to run as a performance standard for scikit? I noticed this the other day that my Mac mini M1 absolutely blows away my MacBook Air 2020 with an i7. I was always curious if there was a good way to gauge performance.