Coming from a programming background where you write code, test, deploy, run.. I'm trying to wrap my head around the concept of "training a model" or a "trained model" in data science, and deploying that trained model.
I'm not really concerned about the deployment environment, automation, etc.. I'm trying to understand the deployment unit.. a trained model. What does a trained model look like on a file system, what does it contain?
I understand the concept of training a model, and splitting a set of data into a training set and testing set, but lets say I have a notebook (python / jupyter) and I load in some data, split between training/testing data, and run an algorithm to "train" my model. What is my deliverable under the hood? While I'm training a model I'd think there'd be a certain amount of data being stored in memory.. so how does that become part of the trained model? It obviously can't contain all the data used for training; so for instance if I'm training a chatbot agent (retrieval-based), what is actually happening as part of that training after I'd add/input examples of user questions or "intents" and what is my deployable as far as a trained model? Does this trained model contain some sort of summation of data from training or array of terms, how large (deployable size) can it get?
While the question may seem relatively simple "what is a trained model", how would I explain it to a devops tech in simple terms? This is an "IT guy interested in data science trying to understand the tangible unit of a trained model in a discussion with a data science guy".
Thanks
It depends on the model. For example linear regression, training will give you the coefficients of the slope and the intercept (generally). These are the "model parameters". When deployed, traditionally, these coefficients get fed into a different algorithm (literally y=mx+b), and then when queried "what should y be, when I have x", it responds with the appropriate value.
Kmeans clustering on the other hand the "parameters" are vectors, and the predict algorithm calculates distance from a vector given to the algorithm, and then returns the closest cluster - note often times these clusters are post processed, so the predict algorithm will say "shoes" not "[1,2,3,5]", which is again an example of how these things change in the wild.
Deep learning returns a list of edge weights for a graph, various parametric systems (as in maximum likelihood estimation), return the coefficients to describe a particular distribution, for example uniform distribution is number of buckets, Gaussian/Normal distribution is mean and variance, other more complicated ones have even more, for example skew and conditional probabilities.