Unit 16
Unit 16
Software Engineers
UNIT 16 DATA SCIENCE FOR SOFTWARE
ENGINEERS
Structure
16.0 Introduction
16.1 Objectives
16.2 Applications for Data science
16.3 Background of Data Science
16.4 Data Science Tools
16.5 Data science and Big data
16.6 Phases involved in Data science process
16.6.1 Requirements Gathering and Data Discovery
16.6.2 Data Preparation stage
16.6.3 Data exploration stage
16.6.4 Model development and Prediction stage
16.6.5 Data visualization stage
16.6.6 A sample case study
16.7 Data science Methods
16.7.1 Clustering Method
16.7.2 Collaborative Filtering or Similarity Matching Method
16.7.3 Regression Methods
16.7.4 Classification Methods
16.7.5 Other Methods
16.8 Data science process for predicting the probability of product purchase
16.9 Summary
16.10 Solutions/Answers
16.11 Further Readings
16.0 INTRODUCTION
We are living in data-driven world where making sense out of raw data is the
information super power. With each passing day, data is growing exponentially. With
every tweet, Facebook post, Instagram picture, YouTube video users are generating
massive amounts of data. In addition to the social media popularity, the data generated
by Sensors of IoT enabled devices and wearable also generates data at high velocity.
Data science is gaining popularity with the emergence of Big Data. Data analysis has
been around from last many decades. The data analysis methods have evolved over
the period of time. As data scientists can crunch massive amount of data to detect
anomalies, patterns, trends, organizations want to leverage data science to reduce cost,
explore cross-sell/upsell opportunities, new market opportunities, forecast,
recommend for gaining competitive advantage. Data science is an interdisciplinary
field consisting of statistics, computer science, Machine learning and others.
Data science helps businesses to make data driven decision making. Organizations can
apply the data collection, data preparation and data analysis methods to mine massive
volume of data to understand customers’ behavior and explore the business
opportunities to influence their customers.
29
Advanced Topics in
Software Engineering 16.1 OBJECTIVES
As depicted in Figure 16.1, through exploratory data analysis, we can describe the
events, phenomenon and scenarios. We can also look at the data dimensions and do
the diagnosis to understand why the event happened. After data preparation, data
analysis, model development we can predict what will happen in future and prescribe
what can be done to make it happen or what can be done to prevent it from happening.
Data science is closely related to the field of statistics and mainly uses statistical
methods for data analysis. As such data analysis and the analysis methods have been
since 1960s. The term “Data science” can be traced back to 1974 when it was coined
by Peter Naur and was later popularized by C.F.Jeff Wu in 1985.
Let us understand how the data analysis has evolved by looking at the
recommendation use case. In the early 2010 we used the formal and rigid business
rules for product recommendation; It was followed by linear regression model in
2011. In 2013 we achieved greater accuracy using logistical regression. Decision trees
were widely used in 2014. Organizations like Amazon and Netflix leveraged
collaborative filtering in 2015 for product recommendations. Bayesian network was
used for recommendation in 2017. From 2019 onwards organizations are heavily
using the machine learning and deep learning methods for product recommendation to
achieve better accuracy.
31
Advanced Topics in
Software Engineering 16.4 DEFINITION OF DATA SCIENCE
The proliferation of Big Data has propelled the increase in popularity of Data Science.
The sources that generate Big Data increase in volume, velocity (massive rate of
increase), veracity and variety (structured, semi-structure, and unstructured) posing
challenges in organizing and analyzing the Big Data. Examples of Big Data are sensor
data, user click data, tweet data and such. We cannot easily manage Big Data through
traditional databases. Hence Big Data systems don’t strictly adhere to ACID
(Atomicity, Consistency, Integrity and Durability) properties but aim for eventual
consistency as per CAP (Consistency, Availability, Partition) theorem. Some of the
popular Big Data systems are NoSQL products such as MongoDB, DynamoDB,
CouchDB and such.
As part of Data science development, we build, explore and fine tune various models
to handle the massive data sets.
Big Data provide the massive data management, data processing technologies such as
Apache Hadoop, and NoSQL databases for data science methods.
We have depicted various phases involved in the data science process in the Figure
16.3.
In data science the data is processed through various stages in the pipeline. We have
depicted the stages in Figure 16.3. During each of the stages in the pipeline, the raw
data undergoes series of analysis and changes. The lifecycle stages are cyclic as we
continuously use the feedback from each stage to its previous stage.
We shall look at each of the stages in next sections.
33
Advanced Topics in
Software Engineering
16.6.1 Requirements Gathering and Data Discovery
We need to understand the business requirements, the goals, focus areas and
challenges. We need to understand the key data sources, data inventories, existing data
analysis tools and technologies, the subject matter experts (SMEs), project sponsor,
main stakeholders, time and budget for the program. It is essential to also understand
any existing data analysis initiatives undertaken to learn from the past experience.
Domain experts, business analysts for the business domain help us to provide the
heuristics related to the business domain using which we can formulate the initial
hypothesis. Once we thoroughly understand the business domain we can then
formulate the problem for applying the data science methods.
Once we fully understand the requirements, we need to collect data from various
sources such as data lakes, relational databases, ERP systems, internal and external
social media systems, collaboration systems and such. For instance for the customer
behavioral analysis, we need to gather data from various sources such as CRM system
(user case details), user profile details, order history, product review system (product
name, review details), email system, social media (user’s social media handle, user
details) and such.
Given below are some of the main activities as part of the data conditioning stage:
Removing duplicates
Bring the data to a common or standard format
Infering the missing values
Smoothing the data by removing the noise
Filtering and sampling the data
Making the data types consistent across the entire data set
Replace missing values with their mean or median values
The conditioned data will be used for exploration and model development.
During this stage, we extract the features from the data and process the data. Based on
the extracted features we can perform first cut analysis and form the initial hypothesis.
We also classify, cluster the data into the logical categories. We identify the
relationship between variables during this stage. We then select the key predictors that
are essential for the recommendation and for prediction.
For instance, in order to model the price of a used car, we need to examine massive
amount of historical used car sales transactions consisting of various attributes.
However key features such as car mileage, car age, kms driven, engine capacity,
purchase price have direct correlation with the car’s market price. These key features
directly influence the final price for the used car. Data scientists look at the data and
try to correlate the useful features and form the hypothesis. In the used car price
34
prediction use case, the hypothesis could be “The car age is inversely correlated to the Data Science for
Software Engineers
final price (with the increase in car age, the final price decreases)”.
16.6.4 Model development and Prediction stage
We design and develop the machine learning models in this stage. We train and fine
tune the model and test their performance on an iterative basis. Data scientists
evaluate various models for the selected data and finally the model that provides the
highest performance will be selected for prediction.
We use many statistical methods such as regression analysis, qualitative methods,
decision tree models, deep learning models for data analysis during this stage. We
evaluate the performance of the model using various data points such as the accuracy
of the model prediction, model performance on various training and test data sets and
such.
We then use the selected models for the prediction.
16.6.5 Data Visualization stage
In the final stage, we visually communicate the obtained insights using reports,
business intelligence (BI) tools to help the audience make informed decisions. We
also document the key findings, observations and the behavior of the model for
various test cases. Often we use the visual representations such as charts, infographics
to communicate the findings to the stakeholders.
Once the model is accepted by the stakeholders, we operationalize the model and use
it for business prediction. We need to continuously fine tune the model based on the
feedback and learning from the real world data.
Top 5 Top 5 Brand Buyer Buyer Buyer Buyer Produ Stor Sal
in last in last campai Age Age Age Inco ct e es
month year gn (Yes Group Group Group me
(Yes/N (Yes/N or No) 10-18 19-34 30-50 (10K
o) o) yrs. yrs. yrs. –
(Yes/N (Yes/N (Yes/N 30K)
o) o) o) Yes/
No
1 0 1 0 1 Nil 1 ABC Stor 500
e1
1 1 Yes 0 0 1 1 XYZ Stor 150
e1 0
0 1 ABC Stor 500
e1
Likewise we collect all the possible data from all stores for all combinations of
identified features
In the above example one of the values is “Yes” that needs to be coded to “1”.
Another value is “Nil” that will be converted to “0”. Third row of data is empty for
many features hence we will discard that row for the training purpose. The
conditioned data is depicted in table 16.2.
During this stage, we identify the prominent features that influence the overall
product’s popularity. In the above example, we identify that demographic data has
greatest influence on the overall product sales.
36
We understand the features that have positive/linear correlation with the product sales, Data Science for
Software Engineers
non-linear correlation with the product sales.
Prediction Stage
In this stage, we predict the product sales for each store. For this, we develop the
models that use the dependent features to predict the overall product sales for each
store. We can explore various models given below against the test data and compare
the accuracy and performance.
Once we finalize the model we can train it further and fine tune to achieve better
accuracy. We can then use the model to predict the top selling products.
In the data visualization stage, we provide the top predicted products for each store so
that the store owner can replenish the inventory for the popular products as depicted in
the Figure 16.5.
37
Advanced Topics in In this section we shall look at the most popular data analysis methods we use in the
Software Engineering data science projects.
38
Data Science for
Software Engineers
Time series prediction methods to analyze the events occurring at regular time
intervals
Text analysis and sentiment analysis using methods such as bag of words,
TFIDF, topic modelling to gain insights from the text.
Profiling involves characterizing the event or user or an object.
Reduction involves reducing the large data set into a smaller data set that
represents the key samples from the large data set.
In this sample case study let us look at various steps involved in the predicting the
probability of product purchase and understanding of the key features.
Requirement gathering and Data collection stage
39
Advanced Topics in The sample case study is aimed at understanding the key features that influence the
Software Engineering probability of product purchase and develop a prediction model given the key feature
values.
We have provided a sample data collected for eleven historical transactions in table
16.3 for reference.
Table 16.3 : Product purchase data
Age_LT30 Age_GT30 In_LT50 In_GT50 Loyal_CT onSale Prod_buy
0 1 0 1 1 1 1
0 1 0 1 0 1 1
1 0 0 1 1 1 1
0 1 0 1 1 1 1
1 0 Yes 0 1 1 0
0 1 Correct 0 1 1 1
1 0 Y 0 0 1 0
1 0 Less than 50 0 0 0 0
0 1 Confirmed 0 0 1 1
1 0 One 0 1 1 0
1 0 NA 0 1 1 0
In this stage, we identify the key features and its relation to the product purchase
decision. We apply the statistical methods to understand the data distribution and the
feature relationship. We can infer the following based on our analysis
Feature Age_GT30 is positively and linearly correlated to prod_buy.
Conversely Age_LT30 is inversely related to prod_buy.
Feature IN_GT50 is positively and linearly correlated to prod_buy.
Conversely In_LT50 is inversely related to prod_buy.
Loyal_CT and onSale are positively correlated to prod_buy
We can use these features and insights for building and training models.
Prediction Stage
In this stage we build and evaluate various models. We can evaluate various models
such as linear regression, logistic regression and others and understand their
performance. We have selected the decision tree model as it factors all the key
features. We can visualize the combination and impact of various features with the
decision tree model.
We validate the decision tree model with the test data to ensure that the decision tree
model does the accurate prediction.
Figure 16.6 provides sample decision tree for our product purchase prediction.
16.9 SUMMARY
In this unit, we started discussing main applications of the data science. We discussed
the interdisciplinary nature of the data science as it involves methods from various
fields such as computer science, big data, statistics, analytics, domain knowledge,
machine learning and others. We also looked the key tools of data science such as R,
Python, Matlab others. We discussed the relationship between data science and Big
data. In next section we had detailed discussion on various phases of the data science
lifecycle stages. The requirements gathering and data collection stage involves
gathering all the data sources, data preparation stage conditions the data and fixes the
errors. Data exploration stage involves close examination of the key features and
building the hypothesis. In Prediction stage we build models to predict the event.
Finally in the data visualization stage we document and showcase our findings. We
also had a deep dive discussion on various data science methods related to clustering,
collaborative filtering, regression and classification. Finally we look at a detailed case
study for using data science method for predicting a product purchase probability.
16.10 SOLUTIONS/ANSWERS
References
Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about
data mining and data-analytic thinking. " O'Reilly Media, Inc.".
42
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Data Science for
Software Engineers
Berlin, Heidelberg.
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., ... & Zimmermann, T. (2019,
May). Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st
International Conference on Software Engineering: Software Engineering in Practice (ICSE-
SEIP) (pp. 291-300). IEEE.
Agarwal, R., & Dhar, V. (2014). Big data, data science, and analytics: The opportunity and
challenge for IS research.
43