Reviews

  • 1
  • 2
  • 3
  • 4
  • 5
Editor Rating
  • 1
  • 2
  • 3
  • 4
  • 5
User Ratings
Based on 0 reviews

Major Concepts

Articles Home » Introduction to Data Science and A.I. » Data Science Life Cycle and it's stages.

Data Science Life Cycle and it's stages.

The Data Science Life Cycle can be mapped out in the following stages:

1. Data Gathering:
This is the process of acquiring data it can be done from various data sources such as databases, web, third party channels, etc. The data can be structured or unstructured like text files, images, sensors data, csv or excel files, etc.



2. Data Pre-Processing:
In this process, the data is cleaned and transformed into a useable format. These first two steps are important for the data science life cycle to function properly and accurately. For getting more knowledge and understanding of the data, data scientists can achieve it by working alongside people experienced in that domain often called SME(Subject Matter Experts). The following image shows the process that is comprised in this stage:




a. Identify Data elements:

It deals with feature engineering of the data where you have to answer the questions like:
What to predict?
On what information does the prediction depend upon?

b. Impute missing values:

There are two types of missing data:

MCAR - missing completely at random: This is the desirable scenario in case of missing data. This is most likely not a serious process issue and the root cause is difficult to establish.

MNAR - missing not at random: Missing not at random data is a more serious issue and in this case, it might be wise to check the data gathering process further and try to understand why the information is missing rather than imputing the values. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?

For further explanation of Imputation read this article

c. Dimensionality Reduction:

The Three Secret Laws of Dimensionality

1.The Curse of Dimensionality
When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.

2.Principle (or law) of Parsimony
Occam's razor also known as the law of parsimony is a problem-solving principle attributed to William of Ockham (c. 1287-1347) an English Scholar. It states that simpler theories are preferable to more complex ones because they are more testable and interpretable.

3.Overfitting the curve
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted has poor predictive performance, as it overreacts to minor fluctuations in the training data.

The most preferred method used to reduce dimensions are:
* PCA
* Factor Analysis. (For knowing more about this go to the link - https://www.cluzters.ai/article/244/factor-analysis )

d. Describe your data:

In this, you should know your data before further analysis. Which can be achieved by asking the following sets of questions:

Question set-1

What is the distribution of Data?
Is the data accurate?
Is there a trend that can be discovered?
Are there any missing elements?

Question set-2

What is an outlier?
What are the types of outliers?
What are the causes of outliers?
What is the impact of outliers on the dataset?
How to detect outliers?
How to remove outlier?

e. Data Transformation:

The transformation of the data can be used to make highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable.

Types of transformations:

Logarithmic Transformation
* Natural Log: Ln Transformation (log (X) base e)) (e=2.71828182846)
* Common Log: (Log (X) base 10))
To treat Right Skewed distributions

Exponential Transformation
* Exp Transformation: E(X)
* Square/Cube Transformation
* To treat Left Skewed distributions

f. Partitioning of Data:

The training/test partitioning typically involves the partitioning of the data into a training set and a test set in a specific ratio, mostly 70% of the data are used as the training set and 30% of the data are used as the test set. The partitioning can vary depending upon the size and type of data, problem statement, algorithms, etc.

g. Data is ready for modeling.
After going through all this process the data is now ready for the next stage.


3. Modeling and Hypothesis:
This stage is commonly comprised of statistical modeling in data science. However, mostly it is applied, via machine learning, to all types of data. It is at this stage of the process that training sets and models are created. Validation or test sets are also produced now for checking the accuracy of the model in the next stage. Various algorithms and techniques are decided in this stage by looking at and understanding the data.

4. Evaluation and Interpretation of Data:
Once the modeling has taken place, data is constantly tested, reevaluated based on accuracy, and different metrics for checking if the model is working accurately, and reshaped. After getting the desired accuracy a useable model is created.

5. Deployment:
Once a usable model has been created it is deployed by using various techniques and tools available. The deployment of a usable model is first done on a trial basis to see for any problem that may occur in the final phase of deployment. If any improvements are needed they are made in this stage.

6. Operations:
After the model passed the final phase and is optimized it can be rolled out into larger operations. Even though the model is deployed its performance is still monitored and evaluated.

7. Optimization:
As the model is operational it is still constantly improved. The more data a model works on the more it learns and gets more accurate along the line. The more accurate model is better the predictions it gives.


0 Comments   |   Edmond Rubim de Araujo and 1 other like this.

User Reviews