EBook - Data Science 4
EBook - Data Science 4
Contact Us:
TEKS Academy
Flat No: 501, 5th Floor, Amsri Faust Building,
SD Road, near Reliance Digital Mall,
Regimental bazaar,
Shivaji Nagar,
Secunderabad,
Telangana - 500025
Support: 1800-120-4748
Email: [email protected]
2
Data Science is a field that uses mathematical and statistical techniques to extract valuable insights from data. It is an
interdisciplinary field that involves computer science, statistics, and domain
expertise. Data scientists use data to answer complex business problems,
make predictions, and inform decision-making. They use techniques such
as machine learning, data mining, and predictive analytics to analyze and
interpret the data.
1. Business Understanding:
Business understanding refers to the process of gaining a deep understanding of the business problem or
question that needs to be addressed through data analysis. This phase is a crucial step in any data science project, as
it lays the foundation for the entire analytical process.
3
Operational objectives: Operational objectives include making sure that the logistical elements of your
business are up to scratch. For instance, it might mean ensuring your
supplies will arrive from a manufacturer at the same time each
month. These objectives keep the company running smoothly.
Productivity and performance: Employees are the lifeblood of a
business. Making sure that employees remain productive drives revenue and improves customer satisfaction.
Measuring employee satisfaction and setting goals for each team ensures efficiency and productivity.
2. Data Understanding:
Data understanding is a crucial phase in the data science lifecycle, and it involves gaining insights into the
data that will be used for analysis. This phase typically follows the initial data collection and precedes the data
preparation or preprocessing steps. The primary goal of data understanding is to familiarize data scientists with the
characteristics, structure, and content of the dataset.
By completing the Data Understanding phase, the stakeholders gain a deeper understanding of the data, its
quality, and its potential usefulness for the project. This knowledge is used to guide the next phase, Data Preparation,
where the data is transformed and prepared for modeling and analysis.
3. Data Preparation:
A common rule of thumb is that 80% of the project is data preparation. This phase, which is often referred to
as “data munging”, prepares the final data set(s) for modeling. It has five tasks:
Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage- in, garbage out. A
common practice during this task is to correct, impute, or remove erroneous values.
Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index
from height and weight fields.
Integrate data: Create new data sets by combining data from multiple sources.
Format data: Re-format data as necessary. For example, you might convert string values that store numbers
to numeric values so that you can perform mathematical operations.
4. Data Cleansing:
Data cleansing, also known as data cleaning or data scrubbing, is the process of identifying and correcting
errors, inconsistencies, and inaccuracies in datasets. The goal of data cleansing is to improve the quality of data by
removing or correcting errors that could adversely affect the accuracy and reliability of analysis, reporting, and
decision-making processes.
4
Imputation: Imputation is the process of replacing missing data values with estimated values based on the other
available data in a dataset. This is commonly done in statistical analysis and
machine learning to ensure the completeness of the dataset.
Variants
Missing Completely At Random (MCAR).
Missing at Random (MAR).
Missing Not At Random (MNAR).
Techniques
o Deletion Methods.
Simple Strategies.
Case-Wise Deletion or List-Wise Deletion or Complete Case Analysis.
Pair-Wise Deletion or Available Case Analysis.
o Single Imputation Methods.
Mean Imputation.
From sklearn.impute import SimpleImputer.
SimpleImputer(missing_values=np.nan, strategy='mean').
Mode Imputation Methods.
SimpleImputer(missing_values=np.nan, strategy='most_frequent').
Mean Imputation Methods.
SimpleImputer(missing_values=np.nan, strategy='median').
Random Imputation.
5. Model Building:
Model building in data science involves the process of creating statistical models that help analyze and
interpret data. These models can make predictions and help identify relationships among variables in the data.
Common types of models used in data science include regression models, classification models, and clustering
models.
5
Model Training: Using the selected modeling technique and the prepared data, a predictive model is
trained on a subset of the data known as the training set. The
model is optimized by adjusting hyper parameters to improve
model performance.
Model Evaluation: The performance of the model is evaluated
on a separate subset of the data known as the validation set. The model is
tested for accuracy, precision, recall, and other metrics to determine how well it performs on unseen
data.
Model Selection and Deployment: Based on the evaluation results, the best- performing model is
selected for deployment. The model is then deployed to a production environment where it can be used
to make predictions on new data.
6. Evaluation:
In data science, evaluation is the process of assessing the quality and performance of predictive models,
algorithms, and systems. It involves measuring various metrics, such as accuracy, precision, recall, F1-score, and area
under the curve, to determine how well a model performs on new, unseen data. Evaluation is an important step in
the data science process, as it helps researchers and practitioners select the best model for a particular task and
improve the performance of their models over time.
Ethical Evaluation: Evaluating models for ethical considerations, including bias and fairness assessments, to
ensure that models are not discriminatory and comply with ethical standards.
Business Impact Evaluation: Assessing the impact of model predictions on business objectives, such as
revenue, customer satisfaction, or cost savings.
Continuous Monitoring and Feedback Evaluation: Ongoing evaluation of models in production to ensure
they perform as expected and adapting models based on user feedback and changing data patterns.
7. Model Deployment:
Model deployment refers to the process of making a trained machine-learning model accessible and usable
in a production environment. This involves integrating the model into an existing software system, such as a website,
application, or service, so that it can be used to make predictions or decisions based on new data. There are various
tools and techniques available for model deployment, including containerization, cloud deployment, and on-premises
deployment.
6
Test the deployment: Before the model is deployed to a live production environment, we thoroughly test it
in a staging environment to ensure that it works as expected and can
handle the expected volume of traffic. We might use a testing
framework like pytest to automate the testing process and ensure
that the model meets our performance requirements.
Deploy the model: Once the model has been tested and is ready for production, we deploy it to the live
environment. This might involve using a tool like Docker or Kubernetes to manage the deployment and
scaling of the application.
Monitor and maintain the model: Once the model is deployed, it is important to continuously monitor its
performance and update it as needed to ensure that it continues to meet the needs of the 79 application and
its users. For example, we might use a logging and monitoring service like AWS CloudWatch or Datadog to
track the model's performance and detect any issues or errors.
7
Tableau
Tableau is a user-friendly platform for visualizing, analyzing, and sharing data. Its aim is to make easy-to-
understand solutions based on raw data.
The platform can be easily used by individual analysts or scaled across a large organization. Tableau makes it
simple for data from different individuals or departments to be combined or shared in one place. There are multiple
paid products produced by Tableau, each with unique services and customer segments:
Tableau Prep allows users to combine, shape, and clean their data from a variety of sources. Tableau Desktop
is the main platform for visualizing and analyzing data and turning them into interactive dashboards and reports.
Tableau Online allows you to publish your work online and share it with anyone you give access to. Tableau Server
gives you more control over who can see your work, and it is usually used to share data within an organization.
Tableau Public is a free service that allows you to create data visualizations from limited sources that must be
published to the public server.
It is important to ensure that data that is obtained from external sources is of good quality. If you choose to select
data from online sources, choose websites that are reputable. Sites such as Statistics Canada or the World Health
Organization are great resources. In the example throughout this tutorial we selected data from The Canadian
Institute for Health Information.
You are able to select from various file types when working with tableau.
Examples of acceptable data files include Excel files, PDF, or text files. Data can also be uploaded from servers
such as Google Sheets. The data needs to be in the correct format before uploading to Tableau the data you want to
use in Tableau should follow these guidelines.
The data should be granular as possible. This means that your data is detailed rather than just average
values.
Ensure that there are no aggregated data (no total values)
All extra titles and notes should be removed. This excludes data headers.
Ensure that there are no blank cells or rows.
The data should follow database format where it is row-oriented rather than column oriented. Tableau is
optimized to work with row-oriented tables. This can be done either in Tableau or before you upload your data.
Here’s an example of how to reformat the data original. In order to optimize the data for tableau, the null data cells
needs to be replaced with zeros. The extra notes at the top must also be deleted. Once all the data is in the correct
format, you can upload the file to Tableau. From the home screen, click on the corresponding file format in the blue
pane highlighted in the orange box.
8
Brief Overview of Tableau: Introduce Tableau as a powerful data visualization tool that helps users make
sense of their data through interactive and shareable dashboards.
Importance of Data Visualization: Discuss the significance of
visualizing data for better understanding, analysis, and decision-
making.
3. Data Preparation:
Importing Data into Tableau: Explore different methods of importing data into Tableau, including live
connections and data extracts.
Data Cleaning and Transformation: Demonstrate techniques for cleaning and transforming data within
Tableau to ensure accurate and meaningful visualizations.
Handling Null Values: Address strategies for handling null or missing values in datasets.
9
Best Practices for Layout and Formatting: Provide tips for arranging elements on a dashboard and formatting
them for clarity.
Incorporating Interactivity: Showcase how to add interactive
elements to dashboards for a dynamic user experience.
10
Joining the Tableau Community: Encourage readers to join the Tableau community for networking and
learning.
Online Forums and Knowledge Sharing: Provide information on
online forums and platforms where Tableau users share knowledge
and experiences.
Recommended Reading and Further Learning: Suggest books, blogs, and
other resources for readers looking to deepen their Tableau expertise.
POWER BI
Introduction:
Power BI is a suite of business analytics tools which connects to different data sources to analyze data and
share insights throughout your organization.
Power BI Desktop: it is a window desktop application (Report Authoring Tool) which lets you build queries, models,
and reports that visualize data.
Power BI Service: Power BI Service is cloud based software as a Service application which allows us to create
dashboard, setup schedule data refreshes, share the reports securely in the organization.
Power BI Mobile: It is an application on mobile devices which allows you to interact with the reports and dashboard
from Power BI service.
11
The flow of work in Power BI:
A common flow of work in Power BI begins in Power BI Desktop,
where a report is created. That report is then published to the Power BI
service, and then shared so users of Power BI Mobile apps can consume
the information.
It doesn’t always happen that way, and that’s okay, but we’ll use that flow to help you learn the various parts
of Power BI, and how they complement one another.
Power BI Desktop:
Power BI Desktop is report authoring tool that allows you to create reports, queries, Extract
Transform and Load the data from data sources and model the queries.
Ribbon: The Ribbon displays common tasks associated with reports and visualizations;
Pages: The Pages tab area along the bottom allows you to select or add a report page;
Visualizations: The Visualizations pane allows you to change visualizations, customize colors or axes, apply filters,
drag fields, and more;
Fields: The Fields pane, allows you to drag and drop query elements and filters onto the Report view, or drag to the
Filters area of the Visualizations pane;
Views Pane: There are three types of views in the views pane
Reports View – allows you to create any number of report pages with visualizations.
Data View – allows you to inspect, explore, and understand data in your Power BI Desktop model.
Relationship or Model view – allows you to show all of the tables, columns, and relationships in your model.
12
sharing in the cloud. To kick start your journey, download and install Power BI Desktop and create an account on the
Power BI Service to experience the seamless integration of these two
environments.
Transforming Data:
Once data is imported, the Power Query Editor becomes your ally for data transformation. Cleanse and
shape your data using a range of transformation techniques. Handle missing data, remove duplicates, and perform
advanced manipulations to ensure your dataset is optimized for analysis.
Creating Visualizations:
With your data modelled, it's time to visualize. Power BI offers a rich array of visualization options, including
charts, graphs, and maps. Learn how to customize visuals, adjust formatting, and employ themes and templates for
consistent and visually appealing design. Unlock the potential of storytelling through compelling data visuals.
Building Dashboards:
Dashboards consolidate your visuals into a cohesive narrative. Explore the use of slicers and filters to add
interactivity, allowing users to focus on specific aspects of the data. Utilize drill-through options for a deeper, more
detailed analysis, enhancing the overall user experience.
Advanced features:
13
Take your Power BI skills to the next level by exploring advanced features. Power BI Premium offers
enhanced capabilities, while datasets and data flows provide more flexibility
in managing your data. Incorporate artificial intelligence features to derive
even deeper insights from your datasets.
14