Chapter 2

Uploaded by

Shible Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views13 pages

Chapter 2

Uploaded by

Shible Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

A Beginners Guide To

DATA SCIENCE

How to dive into the data ocean without drowning

ENAMUL HAQUE
All rights reserved. This book or any portion thereof may not be reproduced or used in any
manner whatsoever without the publisher's express written permission except for the use of
brief quotations in a book review or scholarly journal.

COPYRIGHT © 2021 ENAMUL HAQUE

First Printing Edition, April 2021 (Revision 1)

ISBN 9798731261074
CHAPTER TWO:
THERE IS NO BUSINESS LIKE DATA
BUSINESS
“For me, the thing about data science that makes it so exciting to the modern world is its
unparalleled ubiquity — data science is everywhere. It is ultimately just a set of skills derived from
computer science and mathematics, and this set of skills can be universally applied to learn from the
past and improve future performances in any discipline you can think of. That’s what makes data
science so relevant: its enormous scope and potential to improve life across a wide variety of sectors. I’m
excited to think about a future where data-driven decisions become more and more commonplace all
around the world.” - Vishnu Subramanian, Founder @ Jarvislabs.ai 1-click GPU cloud platform
Data Science Business Strategy

It's essential to precisely understand how to manage your data life cycle and maintain the corporate
data model. You need to understand what data the company has to develop the right strategy for
working with them. Then it will be valuable for business. In parallel with the first pilots, a roadmap
for creating a platform for data science, including developing the storage platform and approaches
to working with machine learning models, is being developed and refined.

First pilot projects

At this stage, pilot models of machine learning are being built, including recommendation and
evaluation. In the pilot's implementation, recommendations are made on whether to use machine
learning models for this task and possible ways to improve the proposed models' quality. They are
assessed in terms of the potential economic impact and complexity of implementation. It also
determines how tasks are prioritised. In the development of pilots, the requirements for the future
ecosystem of data and models are clarified. Data and IT specialists are immersed in the specifics of
production, and the production itself is introduced to new tools.

Introducing data lake as one of the first steps

Cheaper and more functional than traditional lake storages, the lakes allow for rapid data
processing through analytics and machine learning. Their concept will enable you to start
accumulating data before specific tasks are defined. This, in turn, allows historical data to be used
for machine learning models. Deploying a storage layer and downloading an accessible history leads
to faster testing and the introduction of new models.

Accumulation of data
It goes either way, even if the data is not used in existing models. It is important to organise
storage space and engage in minimal structuring; otherwise, the "lake" will turn into a useless
"swamp". In addition, data lake needs to be linked to the company's analytical ecosystem and
information security: it should not leak or cause problems with regulators.
Figure 1 - A data lake is a centralised repository that allows you to store all your structured and unstructured data
at any scale.

Create a layer of models and put them into production

You don't have to wait for a long accumulation and time-consuming structuring of data to use
the lake. Using data lake and modern virtualisation technologies, we can quickly deploy the layer for
models and create them in the target architecture. Over time, technology and data composition
change, the quality of the model can fall because it needs to be modified or created a new one. Over
time, there may be several models that may be more or less effective in different situations. Therefore,
simultaneously with the development of service models, tools are created to manage their life cycle.

Figure 2 - Data modelling creating opportunities

Comprehensive data work, machine learning initiatives, and process digitisation enable any
manufacturing company to become more efficient. This provides an opportunity to maximise
profits by reducing the cost of production, facilitating and accelerating specialists' work, improving
production safety, improving the situation with overspending of raw materials, the percentage of
optimisation, and equipment maintenance. And in the long term - to ensure the transition to fully
autonomous production.
Approaches and Methods of Social Media Data
Analysis

Social media is a good source of data, and it is vital to be able to work effectively with that data. Let's
take a look at a few features and approaches to how social media data works. It is worth noting that
there is a separate direction - Social Mining. This is applying data mining methods and algorithms to
find and detect dependencies and knowledge on social networks (or areas of knowledge where data
can be presented as networks/graphs). The applications are pretty comprehensive.
In general, almost all the practical tasks of analysing social media data are reduced to the
following basic:
• Analysis of social network information flows, structure and metrics
• Analysis of the tone of messages (emotional colouring)
• Analysis and extraction of topics (as written in social networks)
• Image analysis
There are also combinations of these tasks.

Info stream analysis

This class of methods allows identifying opinion leaders in social networks, managing the media
campaign, and evaluating users' attitude to this or that information. The challenges to this are:
• Search for most communication objects.
• Search for objects with the most connections.
• Search for the most "authoritative" objects.
• Search for objects that serve as a bridge between communities.
The most commonly used tool for analysis and visualisation in a given area is a graph where
nodes (actors) are people or groups. The edges demonstrate relationships (connections) or
information flows between nodes.
One of the most essential tasks in the analysis of social networks is to find "important" (from
different points of view) participants of the social graph. To do this, the researchers calculate
different types of metrics: Degree centrality (by the number of related nodes; necessary to those with
many friends; useful for highlighting opinion leaders), Closeness centrality (by proximity; how close
a participant is to everyone else on the network; most often used in the task of finding influence
groups and "grey cardinals", betweenness centrality (in between; the number of shortcuts passing
through the participant; how often the information passes through that person).
Tonality analysis
This class of methods allows you to evaluate users' attitude to certain information (object,
person, event, etc.). The tasks to be solved here: assessing the emotional colouring of messages,
highlighting named entities and assessing their emotional colouring.

Theme analysis
This class of methods allows you to identify the most popular topics in the community and
most often discussed in it (at a particular time). Solved tasks: highlighting topics (topic modelling),
assessing emotional colouring by themes, highlighting entities related to the topic.

Image analysis
It allows you to identify what types of photo-content place different segments of users. Solved
tasks: the kind of object in the photo, the type of location in the image, people's emotions,
verification and identification (to compare the person found in a physical location with his profile
on the social network).
If the task is aimed at the level of analysis of a particular person, that is, such directions:
• Personalisation of proposals
• Analysis of the structure of the social network
• Analysis of human content on the social network
Offering personification allows you to provide the user with the content that is most relevant
to them. Tasks: collecting and enriching user information; Clustering and segmenting users User
classification based on the built model personalised provision of information.
What does Google know about you? You can find information about yourself here:
google.com/settings/ads
Current and promising research in the field of social media analysis
• Semi-supervised learning on social media
• Social media sustainability and design
• Predicting the spread of information on social networks
• A synergy of spatial data and social media data
Data Science Project Management

As the volume of data increases day by day in all areas and industries, it is essential for any company,
industry, or domain to know about it and use it appropriately to grow enormously. No business
wants to slow down growth, and then they do not know what the root of the problem is and how to
solve it and develop it. Often when we talk about data science projects, it seems that no one can
provide a clear explanation of how the whole process is going. From data collection to analysis and
presentation of results. In the previous section, we saw the data science lifecycle, and now we will
apply them in the data science project

Problem statement
There are two ways in the problem statement-based data science approach: dive into the
problem and solve. First, you need to know if your goal in this data is a numerical or categorical
decision. For example, your problem statement is whether a drug has shown the desired results or
not, whether customers are satisfied with a new product released, or whether sales will rise or fall in
the future. This is a definite answer, i.e., simply yes or no, possibly or not. If your job is to predict
the future sales price or home prices, or what dosage is required. They all give numerical values based
on the data provided. So, first, you need to identify the problem and find the best solution for it.

Understanding data or business

The problem arises in different areas or areas, and understanding the terminology and having
experience in the area of understanding helps us find a better solution. In this way, we learn many
other suggestions based only on business understanding or business knowledge in this region.
Figure 3 - Data science project deliverables

Collecting data
Now the data processing starts, and the data is collected from various sources and placed in a
specific location (database). All data required to solve this problem is collected.

Data cleansing
The collected data is correctly installed and checked for any missing data, anomalies and data
distribution. The data is cleared and processed with all the payload.

Exploratory data analysis

As all data is cleared and the necessary part is removed, leaving unnecessary things. The data is
now analysed and studied along with all statistics.

Data visualisation
Since most of the collected data is now cleaned up, explored and well understood and presented
visually with some graphs, graphs using the Scikit-learn library in Python or visualisation can be
created in Tableau and in some visualisation software or in something else. In this way, ideas are well
extracted with perfect images that anyone can see, which can be well explained.

Feature design and selection

Implement some statistical or dimensionality reduction techniques or some other techniques,
as appropriate, to add some useful columns from existing or new columns and provide only the data
you need here and not any others. Otherwise, there is a possibility of misinterpretation.
Model building
During the model building phase, the data is split into two parts, one of which is used for
training and the other for validation, because if you use the same data, there is a chance that the
machine will be overfitted (instead of just study the data, ideally studying the subject or data theory).
Machine learning comes in different types and is used differently, depending on the data and
requirements. Types: supervised, unsupervised, and reinforcement learning. So, the required models
are implemented, and the best model is selected.

Customisation and model selection

We do not know which model is suitable and the right one to choose from. So, after building
the model, they are evaluated and additionally adjusted with some other parameters, and then a
model is selected that has proven itself well.

Figure 4 - Data science technology stack

Deployment and feedback

The required machine learning algorithm has been selected and is now deployed. This can be
done in many different ways, and there are many tools for that, like Flask, AWS, Google Cloud,
Django etc. Once deployed, it is used by a company or customers and feedback is collected; if it
works well, the problem is resolved; otherwise, it will be retrieved by the data analysis team again for
further improvements, so this will be done by rechecking.

Chapter 9 Business Analytics Assignements
No ratings yet
Chapter 9 Business Analytics Assignements
7 pages
Data Analytics
No ratings yet
Data Analytics
54 pages
BIGuidebook Templates - BI Requirements
No ratings yet
BIGuidebook Templates - BI Requirements
18 pages
The Basics of Business Intelligence
No ratings yet
The Basics of Business Intelligence
3 pages
Research Data Strategy
No ratings yet
Research Data Strategy
9 pages
PDF Manual Solution Kendall Sad 9 Im 11 Compress
No ratings yet
PDF Manual Solution Kendall Sad 9 Im 11 Compress
44 pages
The ABCs of Big Data
No ratings yet
The ABCs of Big Data
6 pages
Data Analytics Guide
100% (1)
Data Analytics Guide
5 pages
Data Visualization: Nanodegree Program Syllabus
No ratings yet
Data Visualization: Nanodegree Program Syllabus
14 pages
Dashboard Design Requirements Questionnaire
No ratings yet
Dashboard Design Requirements Questionnaire
1 page
Alation 1 Pager
No ratings yet
Alation 1 Pager
2 pages
Membership 2013: European Scout Region
No ratings yet
Membership 2013: European Scout Region
76 pages
The 2019 Membership Marketing Benchmarking Report
No ratings yet
The 2019 Membership Marketing Benchmarking Report
80 pages
The Analytic Enterprise
No ratings yet
The Analytic Enterprise
7 pages
Modern Data Science - Best Practices For Predictive Analytics
No ratings yet
Modern Data Science - Best Practices For Predictive Analytics
11 pages
RiskMgt Chap12
No ratings yet
RiskMgt Chap12
30 pages
DI CDO Playbook Effective Data Story
No ratings yet
DI CDO Playbook Effective Data Story
6 pages
Creating A Modern Analytics Architecture
No ratings yet
Creating A Modern Analytics Architecture
18 pages
Insurance DataWare House Design Vechiles
No ratings yet
Insurance DataWare House Design Vechiles
2 pages
Limitations of Big Data
No ratings yet
Limitations of Big Data
9 pages
Data Mining: Concepts & Techniques
100% (1)
Data Mining: Concepts & Techniques
29 pages
Business Intelligence Analyst Roles & Responsibilities
No ratings yet
Business Intelligence Analyst Roles & Responsibilities
1 page
Data Science in 2021
No ratings yet
Data Science in 2021
28 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
22 pages
2015.126753.the Untold Story Text
No ratings yet
2015.126753.the Untold Story Text
522 pages
Data Governance - KT - 1 Data and Information
No ratings yet
Data Governance - KT - 1 Data and Information
9 pages
Reducing Inefficiency and Increasing The Value of Analytics and Business Intelligence
No ratings yet
Reducing Inefficiency and Increasing The Value of Analytics and Business Intelligence
22 pages
Bad Data Good Companies 106465
No ratings yet
Bad Data Good Companies 106465
47 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
40 Ways To Optimize Your Power BI Report Today
100% (1)
40 Ways To Optimize Your Power BI Report Today
42 pages
The It Roadmap For Data and Analytics
No ratings yet
The It Roadmap For Data and Analytics
13 pages
Data-Centric Artificial Intelligence
No ratings yet
Data-Centric Artificial Intelligence
39 pages
MRA - Big Data Analytics - Its Impact On Changing Trends in Retail Industry
No ratings yet
MRA - Big Data Analytics - Its Impact On Changing Trends in Retail Industry
4 pages
1-Big Data Analytics
No ratings yet
1-Big Data Analytics
37 pages
Data Analytics
No ratings yet
Data Analytics
18 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
Data Warehousing and BA
No ratings yet
Data Warehousing and BA
77 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Seminar
No ratings yet
Seminar
16 pages
Data Monetization: A Definition
No ratings yet
Data Monetization: A Definition
3 pages
Self Service BI
No ratings yet
Self Service BI
6 pages
ODI Open Data Stories 2014-03
No ratings yet
ODI Open Data Stories 2014-03
9 pages
Ibilolia New CV PDF
No ratings yet
Ibilolia New CV PDF
3 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
MDX Tutorial
100% (1)
MDX Tutorial
31 pages
11-12 Big Data Concepts and Tools
No ratings yet
11-12 Big Data Concepts and Tools
30 pages
Data Governance A Conceptual Framework in Order To Prevent Your Data Lake From Becoming A Data Swamp
No ratings yet
Data Governance A Conceptual Framework in Order To Prevent Your Data Lake From Becoming A Data Swamp
45 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
Data Analytics
100% (1)
Data Analytics
24 pages
Microsoft Modern Data Estate
No ratings yet
Microsoft Modern Data Estate
48 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
2011 ED03 Burbank Hoberman PDF
No ratings yet
2011 ED03 Burbank Hoberman PDF
49 pages
Real Time Object Detection Using Deep Learning Andmachine Learning Project
No ratings yet
Real Time Object Detection Using Deep Learning Andmachine Learning Project
56 pages
DICOM Conformance Statement For Philips DICOM Viewer 3.0 SP3
No ratings yet
DICOM Conformance Statement For Philips DICOM Viewer 3.0 SP3
24 pages
Rapid Fire BI: A New Approach To Business Intelligence Tableau
No ratings yet
Rapid Fire BI: A New Approach To Business Intelligence Tableau
16 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
53 pages
1 To 10
No ratings yet
1 To 10
10 pages
BeneHeart D3 - Service Training - V2.0 - EN
No ratings yet
BeneHeart D3 - Service Training - V2.0 - EN
42 pages
Notes9 - Class - 10 - Data Visualization Using MatPlotlib Notes
No ratings yet
Notes9 - Class - 10 - Data Visualization Using MatPlotlib Notes
5 pages
BlazeVideo HDTV Player v6.0R User's Manual DTR8101
No ratings yet
BlazeVideo HDTV Player v6.0R User's Manual DTR8101
16 pages
Ec2 Ug PDF
No ratings yet
Ec2 Ug PDF
1,317 pages
Components of Electronic Data Processing
No ratings yet
Components of Electronic Data Processing
17 pages
SPI - Training Manual
No ratings yet
SPI - Training Manual
110 pages
JD UI UX Product Designer
No ratings yet
JD UI UX Product Designer
2 pages
GitHub - Aimerou - Awesome-Ai-Papers - A Curated List of The Most Impressive AI Papers3
No ratings yet
GitHub - Aimerou - Awesome-Ai-Papers - A Curated List of The Most Impressive AI Papers3
1 page
vCloudPoint Brochure
No ratings yet
vCloudPoint Brochure
16 pages
Predator Manual
No ratings yet
Predator Manual
324 pages
Flux - Whitepaper 2.1
No ratings yet
Flux - Whitepaper 2.1
19 pages
Student Management System: REPORT (2022)
No ratings yet
Student Management System: REPORT (2022)
50 pages
SVR-440 Quickguide
No ratings yet
SVR-440 Quickguide
2 pages
Building Your Own PC A Comprehensive Guide
No ratings yet
Building Your Own PC A Comprehensive Guide
8 pages
B1 Hardware Requirements Guide With-Ea-logo
No ratings yet
B1 Hardware Requirements Guide With-Ea-logo
14 pages
DSE8003 MKII Data Sheet
No ratings yet
DSE8003 MKII Data Sheet
2 pages
Proposal For Smatphone Addiction Prediction
No ratings yet
Proposal For Smatphone Addiction Prediction
4 pages
62cc4acdc9f4879a8eb9ef42 - Zixian Jia Resume
No ratings yet
62cc4acdc9f4879a8eb9ef42 - Zixian Jia Resume
1 page
Pipelining - Part1
No ratings yet
Pipelining - Part1
26 pages
Laboratory Activity #1 Java Fundamentals: NCP 3106 (Software Design Laboratory)
No ratings yet
Laboratory Activity #1 Java Fundamentals: NCP 3106 (Software Design Laboratory)
18 pages
Jambox Pitch Deck by Slidesgo
No ratings yet
Jambox Pitch Deck by Slidesgo
52 pages
Video Editing Certificate: Course Outline
100% (1)
Video Editing Certificate: Course Outline
2 pages
Advanced HCI - COM719M1: Human Capabilities: Physiology HCI Accessibility Tasks and Contexts DR Joan Condell
No ratings yet
Advanced HCI - COM719M1: Human Capabilities: Physiology HCI Accessibility Tasks and Contexts DR Joan Condell
54 pages
Development of A Web-Based Clean and Healthy Life Behavior Monitoring System
No ratings yet
Development of A Web-Based Clean and Healthy Life Behavior Monitoring System
6 pages
Please Note That Cypress Is An Infineon Technologies Company
No ratings yet
Please Note That Cypress Is An Infineon Technologies Company
72 pages
Dell EMC Unisphere For PowerMax
No ratings yet
Dell EMC Unisphere For PowerMax
56 pages
Ellis - Horwood.c.for - Professional.programmers.1988.SCAN DARKCROWN
No ratings yet
Ellis - Horwood.c.for - Professional.programmers.1988.SCAN DARKCROWN
216 pages
ThinkCentre Neo 50s 11SY004HAC
No ratings yet
ThinkCentre Neo 50s 11SY004HAC
3 pages