<p><pre><code> from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
random_state=1729,
test_size=X.shape[0] - (10 * 20))
model = MLPClassifier(random_state=1729)
model.fit(X_train, y_train)
p = model.predict(X_test)
print(accuracy_score(y_test, p))
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
random_state=1729,
test_size=X.shape[0] - (10 * 200))
model = MLPClassifier(random_state=1729)
model.fit(X_train, y_train)
p = model.predict(X_test)
print(accuracy_score(y_test, p))
</code></pre>
This gets you 0.645 and 0.838 accuracy score respectively (versus 62% and 76% in the paper). Sure, different validation (I validate on all the remaining data, they do 20x 70% 30% splits on the 200 and 2000 samples, which needlessly lowers the number of training samples, fairer comparison is 0.819 with 1400 samples), but the scores seem at least comparable. Cool method though, I can dig this and look beyond benchmarks (Though Iris and Wine are really toy datasets by now.)