Reviews

  • 1
  • 2
  • 3
  • 4
  • 5
Editor Rating
  • 1
  • 2
  • 3
  • 4
  • 5
User Ratings
Based on 0 reviews

Major Concepts

Articles Home » Introduction to Data Science and A.I. » Data Science - Building Blocks

Data Science - Building Blocks

Data science is a multidisciplinary field which requires knowledge of math, technology and domain.  






Based on the business requirements, the analysis needed are:



  • Exploratory analysis is the process of analyzing the dataset to summarize or  get an overview of it. It is often done  with  visual methods using libraries like matplotlib, d3.js and applications like tableau.

  • Predictive analysis is the major branch of data science where  models are created using existing data to make predictions on future or unknown data.

  • Prescriptive analysis is like an extension of predictive analysis in the sense that it not only predicts what will happen, it also suggests decision options to change the outcome.

  • IPA  - Intelligent Process Automation (IPA) is the collection of technologies that come together to manage, automate and integrate digital processes.




Tools/Products



  • Visualization - For exploratory analysis, tableau is a popular tool to create interactive data visualizations. D3.js is an open source library which is used to create visualizations inside web pages.

  • Programming Languages - Python, R are the most used languages by data scientists. Python is useful to create end-to-end product as it can be used to create websites. R is preferred for research purposes.

  • For dealing with large amounts of data, open source big data tools like spark, hive, hadoop are useful.




Data Science Lifecycle


Business Requirement


The first step is to define the objective by discussing with customers or stakeholders to identify the business problems and define the target metric for the project.


Collecting the data


The next step  is to acquire the relevant data by direct sources like analytics or from third party sources if necessary. High quality data is an important requirement of a data science project.


Understanding the data


Before training a model, it is important to explore the data first. Most of the data in production has missing values and errors, they should be dealt with domain knowledge and available algorithms. The data may also be normalized and transformed for better model training.


Creating a model


Out of all the columns available in the dataset, choosing the relevant columns is an important task, this is called feature engineering. It needs exploration of data and domain expertise to decide on the features to use for training the model.


Based on the problem statement of the project, there are different types of models available to choose. The models can be compared with each other by metrics like accuracy.


The model creation includes the following steps:



  • Split the data randomly into train, validation and test sets. Most commonly used approach is to use 50% - 70% of the data for training, 20% for validation and 10% for testing but this can vary based on the dataset

  • Build the model using training dataset and use the validation data to fine tune hyper parameters and retrain the model on training data.

  • Evaluate the model - After the model is finalised using training and validation dataset, evaluate the model accuracy on the test data. 




Deploying the model


Decide whether the accuracy of the model is sufficient to use in production. If not, try training the data on different models and collect more data if necessary. Once the model is finalized, deploy the model to web to facilitate users to get predictions using their data. APIs can be used to get predictions from other applications as well.


User Reviews