0% found this document useful (0 votes)
66 views14 pages

EBook - Data Science 4

Uploaded by

kahavor219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views14 pages

EBook - Data Science 4

Uploaded by

kahavor219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

A comprehensive e-book for your basic understanding

Data Science: A comprehensive e-book for your basic understanding

Data Science: A comprehensive e-book for your basic understanding


Course material: e-book
Copyright © 2023 TEKS Academy

Contact Us:
TEKS Academy
Flat No: 501, 5th Floor, Amsri Faust Building,
SD Road, near Reliance Digital Mall,
Regimental bazaar,
Shivaji Nagar,
Secunderabad,
Telangana - 500025

Support: 1800-120-4748
Email: [email protected]

2
Data Science is a field that uses mathematical and statistical techniques to extract valuable insights from data. It is an
interdisciplinary field that involves computer science, statistics, and domain
expertise. Data scientists use data to answer complex business problems,
make predictions, and inform decision-making. They use techniques such
as machine learning, data mining, and predictive analytics to analyze and
interpret the data.

1. Business Understanding:
Business understanding refers to the process of gaining a deep understanding of the business problem or
question that needs to be addressed through data analysis. This phase is a crucial step in any data science project, as
it lays the foundation for the entire analytical process.

The six main stages of business understanding are:


 Business Understanding: In this stage, the business problem or opportunity is defined, and the goals and
objectives of the project are established.
 Data Understanding: This stage involves understanding the available data, including its quality,
completeness, and relevance to the project.
 Data Preparation: In this stage, the data is cleaned, transformed, and prepared for analysis. This may involve
data integration, data selection, and data transformation.
 Modelling: This stage involves selecting and applying appropriate analytical techniques to the data, such as
regression analysis, decision trees, or clustering.
 Evaluation: In this stage, the performance of the models is assessed using appropriate evaluation techniques,
such as cross-validation or holdout testing.
 Deployment: In this final stage, the insights and recommendations from the analysis are presented to
stakeholders and implemented in the business.

Some examples of business objective:


 Revenue objectives: Maintaining consistent profitability is essential for any business. Companies cannot be
profitable without consistent profit. Measuring revenue is a great way to track the sustainability of a firm.

3
 Operational objectives: Operational objectives include making sure that the logistical elements of your
business are up to scratch. For instance, it might mean ensuring your
supplies will arrive from a manufacturer at the same time each
month. These objectives keep the company running smoothly.
 Productivity and performance: Employees are the lifeblood of a
business. Making sure that employees remain productive drives revenue and improves customer satisfaction.
Measuring employee satisfaction and setting goals for each team ensures efficiency and productivity.

2. Data Understanding:
Data understanding is a crucial phase in the data science lifecycle, and it involves gaining insights into the
data that will be used for analysis. This phase typically follows the initial data collection and precedes the data
preparation or preprocessing steps. The primary goal of data understanding is to familiarize data scientists with the
characteristics, structure, and content of the dataset.

The data understanding has 4 phases:


 Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
 Describe data: Examine the data and document its surface properties like data format, number of records, or
field identities.
 Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
 Verify data quality: How clean/dirty is the data? Document any quality issues.

By completing the Data Understanding phase, the stakeholders gain a deeper understanding of the data, its
quality, and its potential usefulness for the project. This knowledge is used to guide the next phase, Data Preparation,
where the data is transformed and prepared for modeling and analysis.

3. Data Preparation:
A common rule of thumb is that 80% of the project is data preparation. This phase, which is often referred to
as “data munging”, prepares the final data set(s) for modeling. It has five tasks:

 Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
 Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage- in, garbage out. A
common practice during this task is to correct, impute, or remove erroneous values.
 Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index
from height and weight fields.
 Integrate data: Create new data sets by combining data from multiple sources.
 Format data: Re-format data as necessary. For example, you might convert string values that store numbers
to numeric values so that you can perform mathematical operations.

4. Data Cleansing:
Data cleansing, also known as data cleaning or data scrubbing, is the process of identifying and correcting
errors, inconsistencies, and inaccuracies in datasets. The goal of data cleansing is to improve the quality of data by
removing or correcting errors that could adversely affect the accuracy and reliability of analysis, reporting, and
decision-making processes.

Imputation variants and techniques:

4
Imputation: Imputation is the process of replacing missing data values with estimated values based on the other
available data in a dataset. This is commonly done in statistical analysis and
machine learning to ensure the completeness of the dataset.

Variants
 Missing Completely At Random (MCAR).
 Missing at Random (MAR).
 Missing Not At Random (MNAR).

Techniques
o Deletion Methods.
Simple Strategies.
 Case-Wise Deletion or List-Wise Deletion or Complete Case Analysis.
 Pair-Wise Deletion or Available Case Analysis.
o Single Imputation Methods.
Mean Imputation.
 From sklearn.impute import SimpleImputer.
 SimpleImputer(missing_values=np.nan, strategy='mean').
Mode Imputation Methods.
 SimpleImputer(missing_values=np.nan, strategy='most_frequent').
Mean Imputation Methods.
 SimpleImputer(missing_values=np.nan, strategy='median').
Random Imputation.

Effect Coding Scheme:


Effect coding, also known as deviation coding or contrast coding, is another technique for encoding
categorical variables in statistics and machine learning. Like dummy coding, it represents categorical variables using
binary values (0 or 1) in additional columns. However, effect coding differs in how it assigns values to these columns,
making it particularly useful when you want to compare each category to the overall mean.

Discretization / Binning / Grouping:


Discretization, also known as binning or grouping, is a data preprocessing technique used to convert continuous
numerical data into discrete intervals or bins. This is done to simplify the data, make it more understandable, and
sometimes to deal with outliers or improve the performance of certain algorithms.

5. Model Building:
Model building in data science involves the process of creating statistical models that help analyze and
interpret data. These models can make predictions and help identify relationships among variables in the data.
Common types of models used in data science include regression models, classification models, and clustering
models.

Some of the Key Tasks involved in Model Building:


 Model Selection: Based on the project goals and data characteristics identified in the Business
Understanding and Data Understanding phases, suitable modeling techniques are selected. For example,
linear regression, decision trees, or neural networks.
 Variable Selection: Based on the data exploration and feature engineering conducted in the Data
Understanding phase, the most relevant and informative variables are selected for inclusion in the model.

5
 Model Training: Using the selected modeling technique and the prepared data, a predictive model is
trained on a subset of the data known as the training set. The
model is optimized by adjusting hyper parameters to improve
model performance.
 Model Evaluation: The performance of the model is evaluated
on a separate subset of the data known as the validation set. The model is
tested for accuracy, precision, recall, and other metrics to determine how well it performs on unseen
data.
 Model Selection and Deployment: Based on the evaluation results, the best- performing model is
selected for deployment. The model is then deployed to a production environment where it can be used
to make predictions on new data.

6. Evaluation:
In data science, evaluation is the process of assessing the quality and performance of predictive models,
algorithms, and systems. It involves measuring various metrics, such as accuracy, precision, recall, F1-score, and area
under the curve, to determine how well a model performs on new, unseen data. Evaluation is an important step in
the data science process, as it helps researchers and practitioners select the best model for a particular task and
improve the performance of their models over time.

 Ethical Evaluation: Evaluating models for ethical considerations, including bias and fairness assessments, to
ensure that models are not discriminatory and comply with ethical standards.
 Business Impact Evaluation: Assessing the impact of model predictions on business objectives, such as
revenue, customer satisfaction, or cost savings.
 Continuous Monitoring and Feedback Evaluation: Ongoing evaluation of models in production to ensure
they perform as expected and adapting models based on user feedback and changing data patterns.

7. Model Deployment:
Model deployment refers to the process of making a trained machine-learning model accessible and usable
in a production environment. This involves integrating the model into an existing software system, such as a website,
application, or service, so that it can be used to make predictions or decisions based on new data. There are various
tools and techniques available for model deployment, including containerization, cloud deployment, and on-premises
deployment.

List of steps to deploy the model:


 Prepare the model: We first save the trained model and its associated metadata in a format that can be
easily loaded into the production environment. This might involve using a library like joblib or pickle to
serialize the model and its parameters.
 Set up the production environment: We then set up the production environment, which might involve
configuring servers, databases, and other infrastructure components to support the model. For example, we
might use a cloud computing platform like AWS or GCP to host the model.
 Create an API: To make the model accessible from the production environment, we create an API that
defines the interface that other applications can use to interact with the model. For example, we might use a
web framework like Flask or Django to create a RESTful API that accepts input data (such as soil quality,
weather conditions, etc.) and returns predictions of crop yields.

6
 Test the deployment: Before the model is deployed to a live production environment, we thoroughly test it
in a staging environment to ensure that it works as expected and can
handle the expected volume of traffic. We might use a testing
framework like pytest to automate the testing process and ensure
that the model meets our performance requirements.
 Deploy the model: Once the model has been tested and is ready for production, we deploy it to the live
environment. This might involve using a tool like Docker or Kubernetes to manage the deployment and
scaling of the application.
 Monitor and maintain the model: Once the model is deployed, it is important to continuously monitor its
performance and update it as needed to ensure that it continues to meet the needs of the 79 application and
its users. For example, we might use a logging and monitoring service like AWS CloudWatch or Datadog to
track the model's performance and detect any issues or errors.

8. Monitoring and Maintenance:


Monitoring and maintenance in data science involve ongoing monitoring of data systems and infrastructure
to ensure their performance and reliability. This includes tasks such as data backups and restoration, system
upgrades, and software patching. Additionally, data scientists may monitor various metrics related to the
performance and health of their models to ensure they continue to meet the needs of the organization.

List of key activities involved in monitoring and maintenance:


 Performance monitoring: Regularly monitoring the performance of the model is essential to ensure that it is
functioning as expected. This might involve tracking key performance metrics such as accuracy, precision,
recall, and F1 score, and comparing these metrics against established thresholds or benchmarks.
 Data drift detection: Over time, the distribution of input data to the model may change, which can cause the
model's performance to degrade. To mitigate this risk, it is important to regularly monitor for data drift and
retrain or update the model as needed to ensure that it remains effective.
 Model retraining: As new data becomes available or the underlying data distribution changes, it may be
necessary to retrain the model to maintain its accuracy and effectiveness. This might involve periodically
retraining the model using new data, or using techniques such as online learning to update the model in real
time as new data becomes available.
 Bug fixing and maintenance: Like any software application, deployed machine learning models may
experience bugs or require maintenance over time. Regularly testing and debugging the application, as well
as performing routine maintenance tasks such as software updates and database backups, can help ensure
that the application remains stable and reliable.
 Security monitoring: Deployed machine learning models may also be vulnerable to security threats such as
data breaches or attacks on the underlying infrastructure. Regularly monitoring the application for security
vulnerabilities and implementing appropriate security measures, such as encryption and access controls, can
help mitigate these risks.

7
Tableau
Tableau is a user-friendly platform for visualizing, analyzing, and sharing data. Its aim is to make easy-to-
understand solutions based on raw data.

The platform can be easily used by individual analysts or scaled across a large organization. Tableau makes it
simple for data from different individuals or departments to be combined or shared in one place. There are multiple
paid products produced by Tableau, each with unique services and customer segments:

Tableau Prep allows users to combine, shape, and clean their data from a variety of sources. Tableau Desktop
is the main platform for visualizing and analyzing data and turning them into interactive dashboards and reports.
Tableau Online allows you to publish your work online and share it with anyone you give access to. Tableau Server
gives you more control over who can see your work, and it is usually used to share data within an organization.
Tableau Public is a free service that allows you to create data visualizations from limited sources that must be
published to the public server.

Formatting and uploading data:

It is important to ensure that data that is obtained from external sources is of good quality. If you choose to select
data from online sources, choose websites that are reputable. Sites such as Statistics Canada or the World Health
Organization are great resources. In the example throughout this tutorial we selected data from The Canadian
Institute for Health Information.

You are able to select from various file types when working with tableau.

Examples of acceptable data files include Excel files, PDF, or text files. Data can also be uploaded from servers
such as Google Sheets. The data needs to be in the correct format before uploading to Tableau the data you want to
use in Tableau should follow these guidelines.

 The data should be granular as possible. This means that your data is detailed rather than just average
values.
 Ensure that there are no aggregated data (no total values)
 All extra titles and notes should be removed. This excludes data headers.
 Ensure that there are no blank cells or rows.

The data should follow database format where it is row-oriented rather than column oriented. Tableau is
optimized to work with row-oriented tables. This can be done either in Tableau or before you upload your data.
Here’s an example of how to reformat the data original. In order to optimize the data for tableau, the null data cells
needs to be replaced with zeros. The extra notes at the top must also be deleted. Once all the data is in the correct
format, you can upload the file to Tableau. From the home screen, click on the corresponding file format in the blue
pane highlighted in the orange box.

A brief explanation of the topics in Tableau is listed below:

8
 Brief Overview of Tableau: Introduce Tableau as a powerful data visualization tool that helps users make
sense of their data through interactive and shareable dashboards.
 Importance of Data Visualization: Discuss the significance of
visualizing data for better understanding, analysis, and decision-
making.

2. Getting Started with Tableau:


 Download and Installation: Provide step-by-step instructions on how to download and install Tableau
Desktop, the primary authoring and publishing tool.
 Understanding the Tableau Interface: Explain the key components of the Tableau interface, including the
data pane, shelves, and cards.
 Connecting to Data Sources: Guide users on connecting Tableau to various data sources such as Excel,
databases, and cloud-based platforms.

3. Data Preparation:
 Importing Data into Tableau: Explore different methods of importing data into Tableau, including live
connections and data extracts.
 Data Cleaning and Transformation: Demonstrate techniques for cleaning and transforming data within
Tableau to ensure accurate and meaningful visualizations.
 Handling Null Values: Address strategies for handling null or missing values in datasets.

4. Exploring Basic Visualizations:


 Creating a Simple Bar Chart: Walk through the process of creating a basic bar chart using sample data.
 Line Charts for Time Series Data: Explore how to visualize time-based data using line charts.
 Pie Charts and Tree maps: Discuss the use cases and creation of pie charts and tree maps for categorical data
representation.

5. Advanced Chart Types:


 Heat Maps for Data Density: Explain the concept of heat maps and how to use them to represent data
density.
 Scatter Plots for Correlation Analysis: Demonstrate the creation of scatter plots to identify relationships
between two variables.
 Box Plots for Distribution Visualization: Introduce box plots for visualizing data distribution and identifying
outliers.
 Bullet Graphs for Performance Metrics: Discuss the use of bullet graphs for displaying key performance
indicators (KPIs).

6. Maps and Spatial Analysis:


 Geographic Mapping in Tableau: Cover the basics of creating geographical maps in Tableau using built-in
geocoding.
 Custom Geocoding and Map Layers: Explore advanced mapping features, including custom geocoding and
layering.
 Analyzing Spatial Data with Tableau: Discuss how Tableau can be used to analyze and visualize spatial data.

7. Dashboard Design Principles:


 Designing Effective Dashboards: Discuss principles of effective dashboard design, including layout, color
schemes, and typography.

9
 Best Practices for Layout and Formatting: Provide tips for arranging elements on a dashboard and formatting
them for clarity.
 Incorporating Interactivity: Showcase how to add interactive
elements to dashboards for a dynamic user experience.

8. Calculations and Expressions:


 Understanding Tableau Calculations: Explain the basics of calculations in Tableau and how they can be used
to create new fields.
 Writing Basic Formulas: Provide examples of basic calculations and formulas for common scenarios.
 Level of Detail (LOD) Expressions: Introduce LOD expressions for more advanced and flexible calculations.

9. Parameters and Filters:


 Utilizing Parameters for Dynamic Dashboards: Explore the use of parameters to create dynamic and user-
controlled dashboards.
 Filtering Data for Improved Analysis: Explain how to apply filters to focus on specific subsets of data.
 Combining Multiple Filters: Discuss strategies for combining multiple filters to refine data further.

10. Tableau Server and Tableau Online:


 Deploying Tableau Server: Provide an overview of deploying Tableau Server for collaborative work within
organizations.
 Collaborative Work with Tableau Online: Explore the features of Tableau Online for sharing and
collaborating on Tableau workbooks.
 Security and Access Control: Discuss best practices for securing Tableau Server and managing user access.

11. Integrating Tableau with Other Tools:


 Connecting Tableau with Excel: Explain the process of connecting Tableau with Excel for seamless data
integration.
 Using Tableau with R and Python: Introduce integration with statistical and programming languages for
advanced analytics.
 Embedding Tableau Visualizations in Web Pages: Discuss how Tableau visualizations can be embedded in
websites and applications.

12. Tips for Performance Optimization:


 Extracts vs. Live Connections: Compare the advantages and disadvantages of using data extracts versus live
connections.
 Optimizing Workbook Size: Provide tips for optimizing Tableau workbooks to improve performance.
 Efficiently Handling Large Datasets: Strategies for efficiently handling and visualizing large datasets in
Tableau.

13. Real-world Case Studies:


 Industry-specific Use Cases: Explore how Tableau is used in various industries, including finance, healthcare,
and marketing.
 Success Stories and Best Practices: Showcase success stories of organizations that have effectively utilized
Tableau for data-driven decision-making.
 Lessons Learned from Tableau Implementations: Highlight common challenges and lessons learned from
real-world Tableau implementations.

14. Community and Resources:

10
 Joining the Tableau Community: Encourage readers to join the Tableau community for networking and
learning.
 Online Forums and Knowledge Sharing: Provide information on
online forums and platforms where Tableau users share knowledge
and experiences.
 Recommended Reading and Further Learning: Suggest books, blogs, and
other resources for readers looking to deepen their Tableau expertise.

15. Future Trends in Tableau:


 Machine Learning Integration: Discuss emerging trends in integrating machine learning capabilities within
Tableau.
 Enhanced Collaboration Features: Explore future developments in collaborative features and sharing
capabilities.
 Cloud-based Innovations: Discuss how Tableau is evolving with cloud-based innovations for increased
flexibility and scalability.

POWER BI
Introduction:
Power BI is a suite of business analytics tools which connects to different data sources to analyze data and
share insights throughout your organization.

Parts of Power BI:

There are 3 parts of Power BI:


 Power BI Desktop.
 Power BI Service.
 Power BI Mobile.

Power BI Desktop: it is a window desktop application (Report Authoring Tool) which lets you build queries, models,
and reports that visualize data.

Power BI Service: Power BI Service is cloud based software as a Service application which allows us to create
dashboard, setup schedule data refreshes, share the reports securely in the organization.

Power BI Mobile: It is an application on mobile devices which allows you to interact with the reports and dashboard
from Power BI service.

11
The flow of work in Power BI:
A common flow of work in Power BI begins in Power BI Desktop,
where a report is created. That report is then published to the Power BI
service, and then shared so users of Power BI Mobile apps can consume
the information.

It doesn’t always happen that way, and that’s okay, but we’ll use that flow to help you learn the various parts
of Power BI, and how they complement one another.

Power BI Desktop:
Power BI Desktop is report authoring tool that allows you to create reports, queries, Extract
Transform and Load the data from data sources and model the queries.

Power BI Desktop Interface:


The Report has five main areas:

Ribbon: The Ribbon displays common tasks associated with reports and visualizations;
Pages: The Pages tab area along the bottom allows you to select or add a report page;
Visualizations: The Visualizations pane allows you to change visualizations, customize colors or axes, apply filters,
drag fields, and more;
Fields: The Fields pane, allows you to drag and drop query elements and filters onto the Report view, or drag to the
Filters area of the Visualizations pane;
Views Pane: There are three types of views in the views pane
 Reports View – allows you to create any number of report pages with visualizations.
 Data View – allows you to inspect, explore, and understand data in your Power BI Desktop model.
 Relationship or Model view – allows you to show all of the tables, columns, and relationships in your model.

Getting Started with Power BI:


Power BI's ecosystem comprises two primary components: Power BI Desktop and Power BI Service. Power BI
Desktop serves as the desktop application for designing reports, while Power BI Service enables collaboration and

12
sharing in the cloud. To kick start your journey, download and install Power BI Desktop and create an account on the
Power BI Service to experience the seamless integration of these two
environments.

Connecting to Data Sources:


Power BI offers versatile connectivity, allowing you to import data from various sources. From traditional
Excel and CSV files to databases like SQL Server and MySQL, Power BI seamlessly integrates with your preferred data
repositories. Additionally, harness the capabilities of cloud-based sources such as Azure and SharePoint for real-time
data access and analysis.

Transforming Data:

Once data is imported, the Power Query Editor becomes your ally for data transformation. Cleanse and
shape your data using a range of transformation techniques. Handle missing data, remove duplicates, and perform
advanced manipulations to ensure your dataset is optimized for analysis.

Data Modeling in Power BI:


Data modeling is the backbone of effective analysis in Power BI. Understand the relationships between
tables, create calculated columns, and dive into the world of DAX, the language that powers dynamic data
calculations. These foundational concepts are crucial for building robust, interactive reports.

Creating Visualizations:
With your data modelled, it's time to visualize. Power BI offers a rich array of visualization options, including
charts, graphs, and maps. Learn how to customize visuals, adjust formatting, and employ themes and templates for
consistent and visually appealing design. Unlock the potential of storytelling through compelling data visuals.

Building Dashboards:
Dashboards consolidate your visuals into a cohesive narrative. Explore the use of slicers and filters to add
interactivity, allowing users to focus on specific aspects of the data. Utilize drill-through options for a deeper, more
detailed analysis, enhancing the overall user experience.

Power BI publishing and sharing:


Effortlessly share your insights with colleagues by publishing reports to the Power BI Service. Configure
sharing settings and permissions to control access. Collaborate in real-time and embed reports on websites and
applications for wider dissemination.

Power BI Mobile app:


Stay connected to your data on the move with the Power BI mobile app. Access reports and dashboards
anytime, anywhere, ensuring a consistent and user-friendly experience across devices. Explore the app's
functionalities to optimize your mobile analytics experience.

Advanced features:

13
Take your Power BI skills to the next level by exploring advanced features. Power BI Premium offers
enhanced capabilities, while datasets and data flows provide more flexibility
in managing your data. Incorporate artificial intelligence features to derive
even deeper insights from your datasets.

14

You might also like