The commoditization of data science
The Factory at Asnieres, Van Gogh, 1887
One of my favorite subreddits is (sorry in advance) r/programmingcirclejerk, because it offers a place to call out some of the more ridiculous, lofty statements that people make about various programming languages. Sometimes it can be mean-spirited, but often it’s right on the mark. One of the recent posts was a quote from a blog post that said,
I often think Python is too easy. Can you really call it "programming" if you can generate classification predictions with only 6 lines of code? Especially if 3 of those lines are a dependency and your training data, I would argue that someone else did the real programming.
The discussion of the quote centered around the fact that it’s ridiculous to rebuild programming APIs from scratch, when there is already a full set of them pre-built by a community that specializes in that particular problem.
This cut to the heart of a trend I’ve been thinking about recently: how the process of data science itself is becoming a commodity.
To be clear, not analysis. Data analysis will never be able to be automated because it involves too much business logic, trial and error, and human involvement. But the data science models and the underlying algorithms, the pieces of code that go something like this:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:]