This is a continuation of Chapter 1 summary of Python Data Analytics by Fabio Nelli. Click here for Part I.

The Data Analysis Process

Data analysis is nothing more than a sequence of steps:

Problem definition

Data extraction

Data preparation: Cleaning

Data preparation: Transformation

Data exploration and visualization

Predictive modeling

Model validation/test

Deployment: visualization and interpretation of results

Deployment: deployment of solutions

Problem Definition

“Data analysis always starts with a problem to be solved.” A study of the system is conducted and is designed to be able to make informed predictions or choices.

“Building a good team is certainly one of the key factors leading to success in data analysis.” Fabio recommends an effective cross-disciplinary team.

Data Extraction

As much as possible, sample data must reflect the real world. In addition to data selection, extracting and using the best data sources is another issue to keep in mind.

Data Preparation

Data preparation comprises of obtaining, cleaning, normalizing, transforming, and optimizing a data set. Although it may seem that data preparation is less problematic, it actually requires the more resources and more time to be completed. Potential problems includes data values that are ambiguous, missing, replicated, or out of range.

Data Exploration/Visualization

Exploring data involves “searching the data in graphical or statistical presentation to find patterns, connections, and relationships. Data visualization is the best tool to highlight possible patterns.”

Summarization is the process where data are reduced without sacrificing important information. Clustering is used to find groups united by a common attributes. Another step of analysis focuses on identification of relationships, trends, and anomalies in the data.Other methods of data mining automatically extract important facts or rules from the data.

Predictive Modeling

Predictive modeling is used to create or choose a statistical model that predicts the probability of a result. The purpose of these models is to make predictions about the data values and to classify new data products.

The models can be divided into three types:

Classification models: if the result is categorical

Regression models: if the result is numerical

Clustering models: if the result is descriptive

Some of the methods include linear regression, logistical regression, classification and regression trees, and k-nearest neighbors.

Some models explain the characteristics of the system under study in a clear and simple way while some models have limited ability to explain the characteristics of systems but still make good predictions.

Model Validation

Validation of the model is the test phase. Data is called the training set when used to build model. It is called validation set when used to validate the model.

Comparing data enables us to evaluate the error and estimate the limits of validity.

This process allows you to numerically evaluate the effectiveness of the model and compare it with other existing models.

Deployment

This is the final step of the analysis process which aims to translate the result into a benefit. Normally, it consists of “writing a report for management or for the customer who requested the analysis.”

In the report, the following topics are discussed:

Analysis results

Decision deployment

Risk analysis

Measuring the business impact

We’ll conclude this summary by discussing quantitative/qualitative data analysis and open data sources in part III.

According to Merriam-Webster, data is “factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.” I usually just think of it is as anything that can be recorded or measured.

In the book, Fabio makes the distinction that “data actually are not information” and that “information is actually the result of processing.” He then proclaims that data analysis is the “process of extracting information from raw data.”

Data Analysis

“Data analysis allows you to forecast possible responses of systems and their evolution in time.” Its aim is not the mathematical models themselves but the quality of the its predictive power.

The search for data, their extraction, and preparation are also part of the data analysis process because of their importance in the critical role and influence in the success of the results.

All stages of data analysis employ different techniques of data visualizations. It’s all about the charts!

Knowledge Domains of the Data Analyst

Fabio also points out that data analysis is a multi-disciplinary field and is “well suited to many professional activities. He adds, “a good data analyst must be able to move and act in many different disciplinary areas.”

Not only is it necessary to know other disciplines, it is also imperative that a data analyst know “how to search not only for data, but also for information on how to treat that data.”

Computer Science

Knowledge of information technology is necessary to know how to use the various tools like applications and programming languages which in turn are needed to perform data analysis and visualization.

Mathematics and Statistics

Data analysis requires a lot of complex math. Statistics form the concepts that form the basis of data analysis. Bayesian methods, regression, and clustering are just some of the most commonly used techniques in data analysis.

Machine Learning and Artificial Intelligence

Machine learning analyzes data in order to recognize patterns, cluster, or trends and then extracts useful information in an automated way.

Professional Fields of Application

Better understanding of where the data comes from greatly improves their interpretation. It is good practice to find consultants to whom you can pose the right questions about your data.

Types of Data

Data is divided into two distinct categories:

Categorical (nominal and ordinal)

Numerical (discrete and continuous)

Categorical data are observations that can be divided into groups or categories. Nominal variables has no intrinsic order while ordinal variables has a predetermined order.

Numerical data are measured observations. Discrete variables can be counted while continuous values assume any value within a defined range.

Next in part II, we will explore the process of data analysis in detail.