QBoard » Advanced Visualizations » Viz - Python » Train Data & Test Data in Data science

Train Data & Test Data in Data science

  • I am relatively new Data science in python and was exploring some competition on data science, i am getting confused with "Training data Set" and "Test Data Set" . Some projects have merged both and some they have kept separate. What is the rationale behind having two data sets. Any advise will be helpful thanks

     
      September 11, 2021 1:50 PM IST
    0
  • The “training” data set is the general term for the samples used to create the model, while the “test” or “validation” data set is used to qualify performance. Perhaps traditionally the dataset used to evaluate the final model performance is called the “test set”.
      December 2, 2021 2:58 PM IST
    0
  • "Training data" and "testing data" refer to subsets of the data you wish to analyze. If a supervised machine learning algorithm is being used to do something to your data (ex. to classify data points into clusters), the algorithm needs to be "trained".

    Some examples of supervised machine learning algorithms are Support Vector Machines (SVM) and Linear Regression. They can be used to classify or cluster data that has many dimensions, allowing us to clump data points that are similar together.

    These algorithms need to be trained with a subset of the data (the "training set") being analyzed before they are used on the "test set". Essentially, the training provides an algorithm an opportunity to infer a general solution for some new data it gets presented, much in the same way we as humans train so we can handle new situations in the future.

    Hope this helps!

      September 13, 2021 12:19 PM IST
    0
  • A Dataset is a list of rows and can be split into training and test segments. The reason this is done is to keep a CLEAR separation between the rows of data that are used during the training process of the code (think of it like flashcards that you use to "train" a baby to learn objects) and the rows of data that are used (when you are testing the baby to learn objects). You want them to be separate in order to get an accurate score for how well the algorithm performed (e.g. the baby got 9/10 correct when tested). If you mixed the training rows and the testinrows you won't know if the baby just memorized the training results or actually knew how to recognize 9/10 new images.

    Generally, datasets are given as one set because during code execution it is good to randomly select training and test sets by selecting rows randomly. That way you can run the training a few times and the test various times and can take the average. For example, the baby might get 9/10 the first time,6/10 the next, and 7/10 the last. The average accuracy would then be 73.3%. This is a better representation than just trying it once (which as you can see is not completely accurate).

      October 20, 2021 1:06 PM IST
    0