0% found this document useful (0 votes)
9 views110 pages

Data Science Notes - Hamza

Uploaded by

hamzazahoor182
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views110 pages

Data Science Notes - Hamza

Uploaded by

hamzazahoor182
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 110

Hamza zahoor Whatsapp 0341-8377-917

NOTES

SUBJECT :
DATA SCIENCE CLASS :
BSCS 6th Semester COURSE
WRITTEN BY :
(CR) KASHIF MALIK OUTLINE DATA
SCIENCE
Hamza zahoor Whatsapp 0341-8377-917

PAST PAPER DATA SCIENCE


Hamza zahoor Whatsapp 0341-8377-917

INTRODUCTION TO DATA SCIENCE


What is Data Science?
Hamza zahoor Whatsapp 0341-8377-917

Data science is a deep study of the massive amount of data,


which involves extracting meaningful insights from raw,
structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to
manipulate the data so that you can find something new and
meaningful.
Data science uses the most powerful hardware, programming
systems, and most efficient algorithms to solve the data related
problems. It is the future of artificial intelligence.

In short, we can say that data science is all about:

• Asking the correct questions and analyzing the raw data.


• Modeling the data using various complex and efficient
algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding
the final results.
Hamza zahoor Whatsapp 0341-8377-917

Example:
Let suppose we want to travel from station A to station B by car.
Now, we need to take some decisions such as which route will
be the best route to reach faster at the location, in which route
there will be no traffic jam, and which will be cost-effective. All
these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data
is called the data analysis, which is a part of data science.

Need for Data Science:

Some years ago, data was less and mostly available in a


structured form, which could be easily stored in excel sheets,
and processed using BI tools.
But in today's world, data is becoming so vast, i.e.,
approximately 2.5 quintals bytes of data is generating on every
Hamza zahoor Whatsapp 0341-8377-917

day, which led to data explosion. It is estimated as per


researches, that by 2020, 1.7 MB of data will be created at every
single second, by a single person on earth. Every Company
requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task
for every organization. So to handle, process, and analysis of
this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into
existence as data Science. Following are some main reasons for
using data science technology:
• With the help of data science technology, we can convert
the massive amount of raw and unstructured data into
meaningful insights.
• Data science technology is opting by various companies,
whether it is a big brand or a startup. Google, Amazon,
Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer
experience.
• Data science is working for automating transportation such
as creating a self-driving car, which is the future of
transportation.
• Data science can help in different predictions such as
various survey, elections, flight ticket confirmation, etc.
Hamza zahoor Whatsapp 0341-8377-917

Types of Data Science Jobs


If you learn data science, then you get the opportunity to find the
various exciting job roles in this domain. The main job roles are
given below:
• Data Scientist
• Data Analyst
• Machine learning expert
• Data engineer
• Data Architect
• Data Administrator
• Business Analyst
• Business Intelligence Manager

Data Science Components:


Hamza zahoor Whatsapp 0341-8377-917

1. Statistics:
Statistics is one of the most important components of data
science. Statistics is a way to collect and analyze the numerical
data in a large amount and finding meaningful insights from it.
2. Domain Expertise:
In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a
particular area. In data science, there are various areas for which
we need domain experts.
3. Data engineering:
Hamza zahoor Whatsapp 0341-8377-917

Data engineering is a part of data science, which involves


acquiring, storing, retrieving, and transforming the data. Data
engineering also includes metadata (data about data) to the data.
4. Visualization:
Data visualization is meant by representing data in a visual
context so that people can easily understand the significance of
data. Data visualization makes it easy to access the huge amount
of data in visuals.
5. Advanced computing:
Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and
maintaining the source code of computer programs

6. Mathematics:
Hamza zahoor Whatsapp 0341-8377-917

Mathematics is the critical part of data science. Mathematics


involves the study of quantity, structure, space, and changes. For
a data scientist, knowledge of good mathematics is essential.
7. Machine learning:
Machine learning is backbone of data science. Machine learning
is all about to provide training to a machine so that it can act as a
human brain. In data science, we use various machine learning
algorithms to solve the problems.

Tools for Data Science


Following are some tools required for data science:
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R
Studio, MATLAB, Excel, RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop,
Informatica/Talend, AWS Redshift
• Data Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.
Hamza zahoor Whatsapp 0341-8377-917

• DATA SCIENCE LIFE CYCLE AND


PROCESS

Data Science is the amalgamation of two fields – Data and


Science. Data is any real or imaginary thing, and science is
nothing but the systematic study of the world, both physical and
natural. So Data Science is nothing but a systematic study of
data and derivation of knowledge using testable methods to do
predictions about the Universe. In simple words, it's applying
science to data that may be of any size and from any source.
Data has become a new oil that is driving businesses today.
That’s why understanding the data science project life cycle is
crucial. As a Data Scientist or Machine Learning Engineer or as
a Project Manager, you must be aware of the important steps. A
Data Science course will help you get a clear understanding of
the entire data science lifecycle.

What is a Data Science Life Cycle?


A data science lifecycle indicates the iterative steps taken to
build, deliver and maintain any data science product. All data
Hamza zahoor Whatsapp 0341-8377-917

science projects are not built the same, so their life cycle varies
as well. Still, we can picture a general lifecycle that includes
some of the most common data science steps. A general data
science lifecycle process includes the use of machine learning
algorithms and statistical practices that result in better
prediction models. Some of the most common data science steps
involved in the entire process are data extraction, preparation,
cleansing, modeling, evaluation etc. The world of data science
refers to this general process as the “Cross Industry Standard
Process for Data Mining”.
We will go through these steps individually in the subsequent
sections and understand how businesses execute these steps
throughout data science projects. But before that, let us take a
look at the data science professionals involved in any data
science project.
Who Are Involved in The Projects?
Hamza zahoor Whatsapp 0341-8377-917

• Domain Expert:
The data science projects are applied in different domains or
industries of real life like Banking, Healthcare, Petroleum
industry etc. A domain expert is a person who has experience
working in a particular domain and knows in and out about the
domain.

• Business analyst:
A business analyst is required to understand the business needs
in the domain identified. The person can guide in devising the
right solution and timeline for the same.
Data Scientist:
Hamza zahoor Whatsapp 0341-8377-917

A data scientist is an expert in data science projects and has


experience working with data and can work out the solution as
to what data is needed to produce the required solution.
Machine Learning Engineer:
A machine learning engineer can advise on which model to be
applied to get the desired output and devise a solution to
produce the correct and required output.

Data Engineer and Architect:


Data architects and Data engineers are the experts in the
modeling of data. Visualization of data for better understanding,
as well as storage and efficient retrieval of data, are looked after
by them.
The Lifecycle of Data Science
The major steps in the life cycle of a Data Science project are as
follows:
1. Problem identification
This is the crucial step in any Data Science project. The first
thing is understanding in what way Data Science is useful in the
domain under consideration and identification of appropriate
tasks which are useful for the same. Domain experts and Data
Scientists are the key persons in the problem identification of
problem. Domain expert has in depth knowledge of the
application domain and exactly what the problem is to be solved.
Data Scientist understands the domain and help in identification
of problem and possible solutions to the problems.
Hamza zahoor Whatsapp 0341-8377-917

2. Business Understanding
Understanding what customer exactly wants from the business
perspective is nothing but Business Understanding. Whether
customer wish to do predictions or want to improve sales or
minimize the loss or optimize any particular process etc forms
the business goals. During business understanding, two
important steps are followed:
KPI (Key Performance Indicator)
For any data science project, key performance indicators define
the performance or success of the project. There is a need to be
an agreement between the customer and data science project
team on Business related indicators and related data science
project goals. Depending on the business need the business
indicators are devised and then accordingly the data science
project team decides the goals and indicators. To better
understand this let us see an example. Suppose the business need
is to optimise the overall spendings of the company, then the
data science goal will be to use the existing resources to manage
double the clients. Defining the Key performance Indicators is
very crucial for any data science projects as the cost of the
solutions will be different for different goals.
SLA (Service Level Agreement)
Once the performance indicators are set then finalizing the
service level agreement is important. As per the business goals
the service level agreement terms are decided. For example, for
any airline reservation system simultaneous processing of say
1000 users is required. Then the product must satisfy this service
requirement is the part of service level agreement.
Hamza zahoor Whatsapp 0341-8377-917

Once the performance indicators are agreed and service level


agreement is completed then the project proceeds to the next
important step.
3. Collecting Data
Data Collection is the important step as it forms the important
base to achieve targeted business goals. There are various ways
the data will flow into the system as shown in figure 2.

The basic data collection can be done using the surveys.


Generally, the data collected through surveys provide important
insights. Much of the data is collected from the various
processes followed in the enterprise. At various steps the data is
recorded in various software systems used in the enterprise
which is important to understand the process followed from the
product development to deployment and delivery. The historical
Hamza zahoor Whatsapp 0341-8377-917

data available through archives is also important to better


understand the business. Transactional data also plays a vital
role as it is collected on a daily basis. Many atistical methods
are applied to the data to extract the important information
related to business. In data science project the major role is
played by data and so proper data collection methods are
important.

4. Pre-processing data
Large data is collected from archives, daily transactions and
intermediate records. The data is available in various formats
and in various forms. Some data may be available in hard copy
formats also. The data is scattered at various places on various
servers. All these data are extracted and converted into single
format and then processed. Typically, as data warehouse is
constructed where the Extract, Transform and Loading (ETL)
process or operations are carried out. In the data science project
this ETL operation is vital and important. A data architect role is
important in this stage who decides the structure of data
warehouse and perform the steps of ETL operations.
5. Analyzing data
Now that the data is available and ready in the format required
then next important step is to understand the data in depth. This
understanding comes from analysis of data using various
statistical tools available. A data engineer plays a vital role in
analysis of data. This step is also called as Exploratory Data
Analysis (EDA). Here the data is examined by formulating the
various statistical functions and dependent and independent
variables or features are identified. Careful analysis of data
Hamza zahoor Whatsapp 0341-8377-917

revels which data or features are important and what is the


spread of data. Various plots are utilized to visualize the data for
better understanding. The tools like Tableau, PowerBI etc are
famous for performing Exploratory Data Analysis and
Visualization. Knowledge of Data Science with Python and R
is important for performing EDA on any type of data.
6. Data Modelling
Data modelling is the important next step once the data is
analysed and visualized. The important components are retained
in the dataset and thus data is further refined. Now the important
is to decide how to model the data? What tasks are suitable for
modelling? The tasks, like classification or regression, which is
suitable is dependent upon what business value is required. In
these tasks also many ways of modelling are available. The
Machine Learning engineer applies various algorithms to the
data and generates the output. While modelling the data many a
times the models are first tested on dummy data similar to
actual data.
7. Model Evaluation/ Monitoring
As there are various ways to model the data so it is important to
decide which one is effective. For that model evaluation and
monitoring phase is very crucial and important. The model is
now tested with actual data. The data may be very few and in
that case the output is monitored for improvement. There may be
changes in data while model is being evaluated or tested and the
output will drastically change depending on changes in data. So,
while evaluating the model following two phases are important:
Hamza zahoor Whatsapp 0341-8377-917

Data Drift Analysis


Changes in input data is called as data drift. Data drift is
common phenomenon in data science as depending on the
situation there will be changes in data. Analysis of this change is
called Data Drift Analysis. The accuracy of the model depends
on how well it handles this data drift. The changes in data are
majorly because of change in statistical properties of data.
Model Drift Analysis
To discover the data drift machine learning techniques can be
used.
Also, more sophisticated methods like Adaptive Windowing,
Page Hinkley etc. are available for use. Modelling Drift Analysis
is important as we all know change is constant. Incremental
learning also can be used effectively where the model is exposed
to new data incrementally.
8. Model Training
Once the task and the model are finalised and data drift analysis
modelling is finalized then the important step is to train the
model. The training can be done is phases where the important
parameters can be further fine tuned to get the required
accurate output. The model is exposed to the actual data in
production phase and output is monitored.
9. Model Deployment
Once the model is trained with the actual data and parameters
are fine tuned then model is deployed. Now the model is
exposed to real time data flowing into the system and output is
generated. The model can be deployed as web service or as an
Hamza zahoor Whatsapp 0341-8377-917

embedded application in edge or mobile application. This is very


important step as now model is exposed to real world.
10. Driving insights and generating BI reports
After model deployment in real world, the next step is to find
out how the model is behaving in real-world scenario. The
model is used to get insights that aid in strategic decisions
related to business. The business goals are bound to these
insights. Various reports are generated to see how business is
driving. These reports help in finding out if key process
indicators are achieved or not.
11. Taking a decision based on insight
For data science to do wonders, every step indicated above has
to be done very carefully and accurately. When the steps are
followed properly, then the reports generated in the above step
help in making key decisions for the organization. The insights
generated help in taking strategic decisions for example, the
organization can predict that there will be a need for raw
materials in advance. Data science can be of great help in
making many important decisions related to business growth and
better revenue generation.

• BUILDING DATA PRODUCTS


What are Data Products?
Hamza zahoor Whatsapp 0341-8377-917

On the surface, the simple answer is “the use of data for decision
making or problem solving”. That answer, however, leaves us
looking for context, tangible examples, or implications that the
post exists to address.

The First Data Product: Analytics


Reports and dashboards have long been essential to data teams.
With accurate data, decisions related to revenue can yield
significantly higher results compared to relying on gut feel
alone. Data teams, including data engineers, data scientists, and
data analysts, are frequently tasked with building reports and
dashboards to address common questions and enable data-driven
decision-making; an example of this would be product analytics
to decide upcoming roadmap work.
Hamza zahoor Whatsapp 0341-8377-917

These are common questions either you or others have asked and
tasked data teams with building out. You’ll likely find that these
questions will come up again and again, increasing the
importance of these reports and dashboards created to address
them, and creating necessary maintenance work down the line.
Much like any other product, updates are needed to keep pace
with the market, potentially including:
• Adding or replacing data sources, such as new business
applications
• Creating filters for drilldowns that stakeholders are
interested in
• Deprecating or updating aspects of your dashboards to
reflect the changing nature of questions

Moving Beyond Analytics: Expanding Possibilities with Data


Engineering
Advancements in data engineering have propelled data products
beyond traditional analytics, and changed the landscape of what
data platforms look like. Technologies such as Hadoop file
storage systems and scalable storage and compute in cloud
warehouses enable the storage and processing of vast datasets,
leading more and more data sources to be included as part of the
larger data strategy. The rise of data lakes, coupled with
frameworks like Spark, Beam, and Flink, has further expanded
the use cases, encouraging "automated data decision making",
and the proliferation of data science initiatives.
Hamza zahoor Whatsapp 0341-8377-917

Automated Data Decision Making: Internal vs. EndCustomer


Facing Use Cases
Automated data decision making can be categorized into two
main use cases with different targets: internal-facing and enduser
facing. Internal-facing use cases, also known as operational
analytics, involve transforming data, including cleaning and
internal calculations for new metrics, and feeding it directly into
business applications without manual intervention. This
streamlines existing stakeholder workflows and enhances ease of
use.In these scenarios, data quality must be extremely high, as
there are less manual checkpoints in the lifecycle of the data
flows prior to reaching downstream data consumers. The data
architecture, including data pipelines and expanding beyond the
data warehouse, needs to be monitored closely for smooth
delivery.
Examples of Data Products
Internal Facing: A marketing dashboard monitoring campaign
spend or a simple report matching sales reps to quota
attainments can qualify.
External facing products: One example of this would be using
user interactions with your software, combined with marketing
materials they might have read, as proxies for their engagement
with your overall brand, combined with values assigned to each
action, to create a scoring system for how likely they are to
renew their contract for next year.
Hamza zahoor Whatsapp 0341-8377-917

Implications of Data Products


When launching a product, various steps must be taken,
including market research, user testing, supporting go-to-market
efforts, and incorporating feedback and market changes into
future iterations. Those experienced in product development
understand the intricacies and investments involved. Therefore,
it is exciting to see the term "data product" gain prominence.
Promoting data to the level of a product offers several benefits:
Market Research
Formalized research with stakeholders enables the creation of a
highly useful "product" that goes beyond initial requirements,
solving a broader range of use cases.
User Acceptance Testing
Both data teams and stakeholders share responsibility for the
success of a product created to meet customer needs. User
acceptance testing becomes crucial in ensuring alignment and
effectiveness.
Hamza zahoor Whatsapp 0341-8377-917

Go-to-Market Strategy
Data products represent a substantial investment, signaling an
internal commitment by a company to drive adoption. This shifts
the paradigm from neglected dashboards with minimal usage to
actively promoting data products to prevent wasted investments.
Iterative Updates
Similar to ad-hoc reports addressing new questions, existing
dashboards often become neglected. Treating data products as
products emphasizes the need for regular review and
improvement, ensuring their continued usefulness to the
business.

• INTRODUCTION TO DATA ,DATA TYPES


AND DATA SETS

What is data?

Data is a type of information (especially facts or numbers) that is


collected to be categorised, analysed, and/or used to help
decision-making.

Important data-related terms


Hamza zahoor Whatsapp 0341-8377-917

Like many topics, data practice has its own language. Here are
some of terms it is useful to know:

• Dataset - A particular collection of data, gathered for a


purpose
• Re-use - Using data for a purpose different to the original
one
• Metadata - Data that describes and gives the context for the
data (allowing discovery and re-use)
• Discovery - Through good metadata, being able to find the
data you are looking for.
• Statistics - A type of result from analysing and interpreting
raw data.

Why is data important?

Data is important because it:

• supports good decision-making and problem-solving


• informs research and policy
• enables an organisation to measure performance and
success
• results in products and services more aligned with customer
needs
• supports better policies and strategies
• provides a record of business activity.
Hamza zahoor Whatsapp 0341-8377-917

What is a data set?


A data set is an organized collection of data. They are generally
associated with a unique body of work and typically cover one
topic at a time. Information elements within a data set relate to
one another, and analysts often categorize types of data to create
relevant data sets that support important business processes, like
financial metrics or sales transactions.

In scientific and statistical professions, data sets can help


professionals like biologists analyze information about the
environment or climate of an area. In retail, a business may store
information related to their customers in a data set for analysis.
Researchers, scientists, mathematicians and analysts in finance,
economics, sales and marketing often use data sets regularly in
their jobs.
Difference between data set and database
Data sets are different from databases. Essentially, a database is
a collection of data sets. Therefore, databases are typically larger
and contain a lot more information than a data set. Databases
may cover a wider range of focus, whereas a data set typically
only stores information about one topic. To access and
manipulate databases, data scientists rely on sophisticated
computer systems.
What are the types of data sets?
Hamza zahoor Whatsapp 0341-8377-917

There are several types of data sets. What determines the type of
the data set is the information within it. Below are the types of
data sets you may see:

Numerical
A numerical data set is one in which all the data are numbers.
You can also refer to this type as a quantitative data set, as the
numerical values can apply to mathematical calculations when
necessary. Many financial analysis processes also rely on
numerical data sets, as the values in the set can represent
numbers in dollar amounts. Examples of a numerical data set
may include:
• The number of cards in a deck.
• A person's height and weight measurements.
• The measurements of interior living spaces.
• The number of pages in a book

Categorical
Categorical data sets contain information relating to the
characteristics of a person or object. Data scientists also refer to
categorical data sets as qualitative data sets because they contain
information relating to the qualities of an object. There are two
types of categorical data sets: dichotomous and polytomous.

In a dichotomous data set, each variable can only have one of


two values. For example, a data set containing answers to true
and false questions is dichotomous because it only supplies one
Hamza zahoor Whatsapp 0341-8377-917

result or the other. In a polytomous data set, there can be more


than two possible values for each variable. For example, a data
set containing a person's eye color can give you multiple results.
Bivariate
A data set with just two variables is a bivariate data set. In this
type of data set, data scientists look at the relationship between
the two variables. Therefore, these data sets typically have two
types of related data. For example, a data set containing the
weight and running speed of a track team represents two
separate variables, where you can look for a relationship
between the two.

Multivariate
Unlike a bivariate data set, a multivariate data set contains more
than two variables. For example, the height, width, length and
weight of a package you ship through the mail requires more
than two variable inputs to create a data set. Since each value is
unique, you can use different variables to represent each one.
For the dimensions of the example package, the values for each
measurement represent the variables.

Correlation
When there is a relationship between variables within a data set,
it becomes a correlation data set. This means that the values
depend on one another to exhibit change. For example, a
restaurant may find a correlation between the number of iced
teas customers purchase in a day and the high temperatures
Hamza zahoor Whatsapp 0341-8377-917

outside. Correlation can either be positive, negative or zero. In


positive correlations, the related variables move in the same
direction, whereas a negative correlation shows variables
moving in opposing directions. A zero correlation shows no
relationship.
What techniques can be used to represent data sets?
Having information stored in a data set often makes it easier to
perform math operations and analysis. Below are some common
techniques you can use on data sets to learn more about the
underlying data:

• Mean: The mean of a data set is the average of all the


observations. It's a ratio of the sum of the observations to
the number of elements.
• Median: When you list data in ascending order, the median
is the number that falls directly in the middle of the data
set.
• Range: The range is the difference between the highest and
lowest value within a data set, which tells you more about
how far a data set extends.
• Unique value count: A unique value count tells you what a
data set contains by counting each unique item within
categorical columns.
• Frequency count: The frequency count totals the number
of observations for each category you list in the rows of a
data set.
Hamza zahoor Whatsapp 0341-8377-917

• Histogram: A histogram is a graphical representation of a


data set that shows the frequency count throughout the
range of data.
Data set examples
Here are some examples you can review to help you better
understand what data sets are and how you might analyze
them:

Numerical data set example


Here’s an example of a company gathering a numerical data
set:

Tennent Industries wants to understand the average length of


pages of different instructional manuals that help machine
operators operate various equipment in the facility. This data
can help them improve their training programs and outline
their expectations for new hires. The company gathers the
following numerical data that represents the total length of the
company’s instructional manuals:

• Manual one: 35 pages


• Manual two: 23 pages
• Manual three: 46 pages
• Manual four: 12 pages
• Manual five: 10 pages
Analysts simplify this data set to reflect only the numbers or,
35, 23, 46, 12 and 10. Then they add these items to find the
Hamza zahoor Whatsapp 0341-8377-917

mean average of the data set, or 35+23+46+12+10, which


equals 25.2. This means the average length of the company’s
instructional manuals is 25.2 or 25 pages in length.

Categorical data set example


Here’s an example of a company using a categorical data set:

Crane and Jenkins Manufacturing wants to better understand


employee satisfaction and creates a survey with categorical
data to help them rate employee satisfaction. The survey asks
about their overall satisfaction followed with:

• Very poor • Poor • Neutral • Good • Very good


Employees can only choose one of these five options, which
makes the data categorical because there are only a select
number of options. After completing the survey, the company
analyzes the final results. Many employees rated their
satisfaction in the good or very good rating, with an average
rating in the good category.

• DATA QUALITY
What is data quality?

Data quality measures how well a dataset meets criteria for


accuracy, completeness, validity, consistency, uniqueness,
timeliness, and fitness for purpose, and it is critical to all data
Hamza zahoor Whatsapp 0341-8377-917

governance initiatives within an organization. Data quality


standards ensure that companies are making data-driven
decisions to meet their business goals. If data issues, such as
duplicate data, missing values, outliers, aren’t properly
addressed, businesses increase their risk for negative business
outcomes. According to a Gartner report, poor data quality costs
organizations an average of USD 12.9 million each year1. As a
result, data quality tools have emerged to mitigate the negative
impact associated with poor data quality.
When data quality meets the standard for its intended use, data
consumers can trust the data and leverage it to improve
decisionmaking, leading to the development of new business
strategies or optimization of existing ones. However, when a
standard isn’t met, data quality tools provide value by helping
businesses to diagnose underlying data issues. A root cause
analysis enables teams to remedy data quality issues quickly and
effectively.

Data quality isn’t only a priority for day-to-day business


operations; as companies integrate artificial intelligence (AI) and
automation technologies into their workflows, high-quality data
will be crucial for the effective adoption of these tools. As the
old saying goes, “garbage in, garbage out”, and this holds true
for machine learning algorithms as well. If the algorithm is
learning to predict or classify on bad data, we can expect that it
will yield inaccurate results.

Data quality vs. data integrity vs. data profiling


Hamza zahoor Whatsapp 0341-8377-917

Data quality, data integrity and data profiling are all interrelated
with one another. Data quality is a broader category of criteria
that organizations use to evaluate their data for accuracy,
completeness, validity, consistency, uniqueness, timeliness, and
fitness for purpose. Data integrity focuses on only a subset of
these attributes, specifically accuracy, consistency, and
completeness. It also focuses on this more from the lens of data
security, implementing safeguards to prevent against data
corruption by malicious actors.
Data profiling, on the other hand, focuses on the process of
reviewing and cleansing data to maintain data quality standards
within an organization. This can also encompass the technology
that support these processes.

Dimensions of data quality


Data quality is evaluated based on a number of dimensions,
which can differ based on the source of information. These
dimensions are used to categorize data quality metrics:
• Completeness: This represents the amount of data that is
usable or complete. If there is a high percentage of missing
values, it may lead to a biased or misleading analysis if the
data is not representative of a typical data sample.
• Uniqueness: This accounts for the amount of duplicate
data in a dataset. For example, when reviewing customer
data, you should expect that each customer has a unique
customer ID.
• Validity: This dimension measures how much data matches
the required format for any business rules. Formatting
Hamza zahoor Whatsapp 0341-8377-917

usually includes metadata, such as valid data types, ranges,


patterns, and more.
• Timeliness: This dimension refers to the readiness of the
data within an expected time frame. For example,
customers expect to receive an order number immediately
after they have made a purchase, and that data needs to be
generated in real-time.
• Accuracy: This dimension refers to the correctness of the
data values based on the agreed upon “source of truth.”
Since there can be multiple sources which report on the
same metric, it’s important to designate a primary data
source; other data sources can be used to confirm the
accuracy of the primary one. For example, tools can check
to see that each data source is trending in the same
direction to bolster confidence in data accuracy.
• Consistency: This dimension evaluates data records from
two different datasets. As mentioned earlier, multiple
sources can be identified to report on a single metric. Using
different sources to check for consistent data trends and
behavior allows organizations to trust the any actionable
insights from their analyses. This logic can also be applied
around relationships between data. For example, the
number of employees in a department should not exceed
the total number of employees in a company.
• Fitness for purpose: Finally, fitness of purpose helps to
ensure that the data asset meets a business need. This
dimension can be difficult to evaluate, particularly with
new, emerging datasets. These
metrics help teams conduct data quality assessments across
Hamza zahoor Whatsapp 0341-8377-917

their organizations to evaluate how informative and useful


data is for a given purpose.

Why is data quality important?


Over the last decade, developments within hybrid cloud,
artificial intelligence, the Internet of Things (IoT), and edge
computing have led to the exponential growth of big data. As a
result, the practice of master data management (MDM) has
become more complex, requiring more data stewards and
rigorous safeguards to ensure good data quality.
Businesses rely on data quality management to support their data
analytics initiatives, such as business intelligence dashboards.
Without this, there can be devastating consequences, even
ethical ones, depending on the industry (e.g.
healthcare). Data quality solutions exist to help companies
maximize the use of their data, and they have driven key
benefits, such as:
• Better business decisions: High quality data allows
organizations to identify key performance indicators (KPIs)
to measure the performance of various programs, which
allows teams to improve or grow them more effectively.
Organizations prioritize data quality will undoubtedly have
an advantage over their competitors.
• Improved business processes: Good data also means that
teams can identify where there are breakdowns in
operational workflows. This is particularly true for the
supply chain industry, which relies on real-time data to
Hamza zahoor Whatsapp 0341-8377-917

determine appropriate inventory and location of it after


shipment.
• Increased customer satisfaction: High data quality
provides organizations, particularly marketing and sales
teams, with incredible insight into their target buyers. They
are able to integrate different data across the sales and
marketing funnel, which enable them to sell their products
more effectively. For example, the combination of
demographic data and web behavior can inform how
organizations create their messaging, invest their marketing
budget, or staff their sales teams to service existing or
potential clients.

• DATA PRE-PROCESSING STAGES


What is data preprocessing?
Data preprocessing, a component of data preparation, describes
any type of processing performed on raw data to prepare it for
another data processing procedure. It has traditionally been an
important preliminary step for the data mining process. More
recently, data preprocessing techniques have been adapted for
training machine learning models and AI models and for running
inferences against them.
Hamza zahoor Whatsapp 0341-8377-917

Data preprocessing transforms the data into a format that is more


easily and effectively processed in data mining, machine
learning and other data science tasks. The techniques are
generally used at the earliest stages of the machine learning and
AI development pipeline to ensure accurate results.

There are several different tools and methods used for


preprocessing data, including the following:

• sampling, which selects a representative subset from a


large population of data;
• transformation, which manipulates raw data to produce a
single input;
• denoising, which removes noise from data;
• imputation, which synthesizes statistically relevant data
for missing values;
• normalization, which organizes data for more efficient
access; and
• feature extraction, which pulls out a relevant feature
subset that is significant in a particular context.
These tools and methods can be used on a variety of data
sources, including data stored in files or databases and streaming
data.

Why is data preprocessing important?


Virtually any type of data analysis, data science or AI
development requires some type of data preprocessing to
Hamza zahoor Whatsapp 0341-8377-917

provide reliable, precise and robust results for enterprise


applications.

Real-world data is messy and is often created, processed and


stored by a variety of humans, business processes and
applications. As a result, a data set may be missing individual
fields, contain manual input errors, or have duplicate data or
different names to describe the same thing. Humans can often
identify and rectify these problems in the data they use in the
line of business, but data used to train machine learning or deep
learning algorithms needs to be automatically preprocessed.
Machine learning and deep learning algorithms work best when
data is presented in a format that highlights the relevant aspects
required to solve a problem. Feature engineering practices that
involve data wrangling, data transformation, data reduction,
feature selection and feature scaling help restructure raw data
into a form suited for particular types of algorithms. This can
significantly reduce the processing power and time required to
train a new machine learning or AI algorithm or run an inference
against it.

One caution that should be observed in preprocessing data: the


potential for reencoding bias into the data set. Identifying and
correcting bias is critical for applications that help make
decisions that affect people, such as loan approvals. Although
data scientists may deliberately ignore variables like gender,
race or religion, these traits may be correlated with other
variables like zip codes or schools attended, generating biased
results.
Hamza zahoor Whatsapp 0341-8377-917

Most modern data science packages and services now include


various preprocessing libraries that help to automate many of
these tasks.

What are the key steps in data preprocessing?


The steps used in data preprocessing include the following:

1. Data profiling. Data profiling is the process of examining,


analyzing and reviewing data to collect statistics about its
quality. It starts with a survey of existing data and its
characteristics. Data scientists identify data sets that are
pertinent to the problem at hand, inventory its significant
Hamza zahoor Whatsapp 0341-8377-917

attributes, and form a hypothesis of features that might be


relevant for the proposed analytics or machine learning task.
They also relate data sources to the relevant business concepts
and consider which preprocessing libraries could be used.

2. Data cleansing. The aim here is to find the easiest way to


rectify quality issues, such as eliminating bad data, filling in
missing data or otherwise ensuring the raw data is suitable for
feature engineering.

3. Data reduction. Raw data sets often include redundant


data that arise from characterizing phenomena in different ways
or data that is not relevant to a particular ML, AI or analytics
task. Data reduction uses techniques like principal component
analysis to transform the raw data into a simpler form suitable
for particular use cases.

4. Data transformation. Here, data scientists think about


how different aspects of the data need to be organized to make
the most sense for the goal. This could include things like
structuring unstructured data, combining salient variables when
it makes sense or identifying important ranges to focus on.

5. Data enrichment. In this step, data scientists apply the


various feature engineering libraries to the data to effect the
desired transformations. The result should be a data set
organized to achieve the optimal balance between the training
time for a new model and the required compute.
Hamza zahoor Whatsapp 0341-8377-917

6. Data validation. At this stage, the data is split into two


sets. The first set is used to train a machine learning or deep
learning model. The second set is the testing data that is used to
gauge the accuracy and robustness of the resulting model. This
second step helps identify any problems in the hypothesis used
in the cleaning and feature engineering of the data. If the data
scientists are satisfied with the results, they can push the
preprocessing task to a data engineer who figures out how to
scale it for production. If not, the data scientists can go back and
make changes to the way they implemented the data cleansing
and feature engineering steps.

Data preprocessing techniques


There are two main categories of preprocessing -- data cleansing
and feature engineering. Each includes a variety of techniques,
as detailed below.

Data cleansing
Techniques for cleaning up messy data include the following:

Identify and sort out missing data. There are a variety of


reasons a data set might be missing individual fields of data.
Data scientists need to decide whether it is better to discard
records with missing fields, ignore them or fill them in with a
probable value. For example, in an IoT application that records
temperature, adding in a missing average temperature between
the previous and subsequent record might be a safe fix.
Hamza zahoor Whatsapp 0341-8377-917

Reduce noisy data. Real-world data is often noisy, which can


distort an analytic or AI model. For example, a temperature
sensor that consistently reported a temperature of 75 degrees
Fahrenheit might erroneously report a temperature as 250
degrees. A variety of statistical approaches can be used to reduce
the noise, including binning, regression and clustering.

Identify and remove duplicates. When two records seem to


repeat, an algorithm needs to determine if the same
measurement was recorded twice, or the records represent
different events. In some cases, there may be slight differences
in a record because one field was recorded incorrectly. In other
cases, records that seem to be duplicates might indeed be
different, as in a father and son with the same name who are
living in the same house but should be represented as separate
individuals. Techniques for identifying and removing or joining
duplicates can help to automatically address these types of
problems.

Feature engineering
Feature engineering, as noted, involves techniques used by data
scientists to organize the data in ways that make it more efficient
to train data models and run inferences against them. These
techniques include the following:

Feature scaling or normalization. Often, multiple variables


change over different scales, or one will change linearly while
Hamza zahoor Whatsapp 0341-8377-917

another will change exponentially. For example, salary might be


measured in thousands of dollars, while age is represented in
double digits. Scaling helps to transform the data in a way that
makes it easier for algorithms to tease apart a meaningful
relationship between variables.

Data reduction. Data scientists often need to combine a variety


of data sources to create a new AI or analytics model. Some of
the variables may not be correlated with a given outcome and
can be safely discarded. Other variables might be relevant, but
only in terms of relationship -- such as the ratio of debt to credit
in the case of a model predicting the likelihood of a loan
repayment; they may be combined into a single variable.
Techniques like principal component analysis play a key role in
reducing the number of dimensions in the training data set into a
more efficient representation.

Discretization. It's often useful to lump raw numbers into


discrete intervals. For example, income might be broken into
five ranges that are representative of people who typically apply
for a given type of loan. This can reduce the overhead of training
a model or running inferences against it.

Feature encoding. Another aspect of feature engineering


involves organizing unstructured data into a structured format.
Unstructured data formats can include text, audio and video. For
example, the process of developing natural language processing
algorithms typically starts by using data transformation
algorithms like Word2vec to translate words into numerical
Hamza zahoor Whatsapp 0341-8377-917

vectors. This makes it easy to represent to the algorithm that


words like "mail" and "parcel" are similar, while a word like
"house" is completely different. Similarly, a facial recognition
algorithm might reencode raw pixel data into vectors
representing th e distances between parts of the face.

Data aggregation
Data aggregation is the process where raw data is gathered and
expressed in a summary form for statistical analysis.
For example, raw data can be aggregated over a given time
period to provide statistics such as average, minimum,
maximum, sum, and count. After the data is aggregated and
written to a view or report, you can analyze the aggregated data
to gain insights about particular resources or resource groups.
There are two types of data aggregation:
Time aggregation
All data points for a single resource over a specified time period.
Spatial aggregation
All data points for a group of resources over a specified time
period.

What is Sampling?

Let’s start by formally defining what sampling is.


Hamza zahoor Whatsapp 0341-8377-917

Sampling is a method that allows us to get information about the


population based on the statistics from a subset of the population
(sample), without having to investigate every individual.

The above diagram perfectly illustrates what sampling is. Let’s


understand this at a more intuitive level through an example.

Why do we need Sampling?


I’m sure you have a solid intuition at this point regarding the
question.

Sampling is done to draw conclusions about populations from


samples, and it enables us to determine a population’s
characteristics by directly observing only a portion (or sample) of
the population.

• Selecting a sample requires less time than selecting every


item in a population
Hamza zahoor Whatsapp 0341-8377-917

• Sample selection is a cost-efficient method

• Analysis of the sample is less cumbersome and more


practical than an analysis of the entire population

Steps involved in Sampling


I firmly believe visualizing a concept is a great way to ingrain it
in your mind. So here’s a step-by-step process of how sampling
is typically done, in flowchart form!
Hamza zahoor Whatsapp 0341-8377-917

• INTRODUCTION TO PYTHON DATA


SCIENCE STACK
Python
Python is a programming language widely used by Data
Scientists.

Python has in-built mathematical libraries and functions,


making it easier to calculate mathematical problems and to
perform data analysis.
There are 6 key libraries every Python analyst should be
aware of, and they are:

1 – NumPY
NumPY: Also known as Numerical Python, NumPY is an
open source Python library used for scientific computing.
NumPy gives both speed and higher productivity using
arrays and metrics. This basically means it’s super useful
when analyzing basic mathematical data and calculations.
This was one of the first libraries to push the boundaries
for Python in big data. The benefit of using something like
NumPY is that it takes care of all your mathematical
problems with useful functions that are cleaner and faster
Hamza zahoor Whatsapp 0341-8377-917

to write than normal Python code. This is all thanks to its


similarities with the C language.

2 – SciPY
Learn Programming & Development with a Packt
Subscription
SciPY: Also known as Scientific Python, is built on top of
NumPy. SciPy takes scientific computing to another level.
It’s an advanced form of NumPy and allows users to carry
out functions such as differential equation solvers, special
functions, optimizers, and integrations. SciPY can be
viewed as a library that saves time and has predefined
complex algorithms that are fast and efficient. However,
there are a plethora of SciPY tools that might confuse users
more than help them.

3 – Pandas
Pandas is a key data manipulation and analysis library in
Python. Pandas strengths lie in its ability to provide rich
data functions that work amazingly well with structured
data. There have been a lot of comparisons between pandas
and R packages due to their similarities in data analysis,
but the general consensus is that it is very easy for anyone
Hamza zahoor Whatsapp 0341-8377-917

using R to migrate to pandas as it supposedly executes the


best features of R and Python programming all in one.

4 – Matplotlib
Matplotlib is a visualization powerhouse for Python
programming, and it offers a large library of customizable
tools to help visualize complex datasets. Providing
appealing visuals is vital in the fields of research and data
analysis. Python’s 2D plotting library is used to produce
plots and make them interactive with just a few lines of
code. The plotting library additionally offers a range of
graphs including histograms, bar charts, error charts,
scatter plots, and much more.

5 – scikit-learn
scikit-learn is Python’s most comprehensive machine
learning library and is built on top of NumPy and SciPy.
One of the advantages of scikit-learn is the all in one
resource approach it takes, which contains various tools to
carry out machine learning tasks, such as supervised and
unsupervised learning.

6 – IPython
Hamza zahoor Whatsapp 0341-8377-917

IPython makes life easier for Python developers working


with data. It’s a great interactive web notebook that
provides an environment for exploration with prewritten
Python programs and equations. The ultimate goal behind
IPython is improved efficiency thanks to high
performance, by allowing scientific computation and data
analysis to happen concurrently using multiple third-party
libraries.

• RELATIONAL ALGEBRA AND SQL


Relational Algebra is a procedural query language.
Relational algebra mainly provides a theoretical
foundation for relational databases and SQL. The main
purpose of using Relational Algebra is to define
operators that transform one or more input relations into
an output relation. Given that these operators accept
relations as input and produce relations as output, they
can be combined and used to express potentially
complex queries that transform potentially many input
relations (whose data are stored in the database) into a
single output relation (the query results). As it is pure
mathematics, there is no use of English Keywords in
Hamza zahoor Whatsapp 0341-8377-917

Relational Algebra and operators are represented using


symbols.

Fundamental Operators
These are the basic/fundamental operators used in
Relational Algebra.

• Selection(σ)
• Projection(π)
• Union(U)
• Set Difference(-)
• Set Intersection(∩)
• Rename(ρ)
• Cartesian Product(X)

• SQL
What is SQL
SQL is a standard database language used to communicate
with databases. It allows easy access to the database and
is used to manipulate database data.
Hamza zahoor Whatsapp 0341-8377-917

SQL stands for Structured Query Language. It was


developed by IBM in the 1970s. By executing queries,
SQL can create, update, delete, and retrieve data in
databases like MySQL, Oracle, PostgreSQL, etc.

Need of SQL in Data Science


SQL is a fundamental tool in Data Science, essential for
storing and managing data, making it a foundational skill.
Proficiency in SQL is a prerequisite for any data science
project, as it is the backbone of data management and
analysis.

Reasons to Learn SQL for Data Science


SQL (Structured Query Language) is used to manipulate
data. By performing different operations on the data stored
in databases, such as updating, removing, creating and
altering tables, views, etc.
Using SQL as the primary API for relational databases by
big data platforms and organisations is standard.
Data science is the study of data in its entirety. We must
extract data from the database in order to work with it and
SQL helps us do that.
Hamza zahoor Whatsapp 0341-8377-917

A key component of data science is relational database


management. A data scientist can define, define, create,
and query the database using SQL commands.
Many different industries and organisations have used
NoSQL to manage their product data, yet SQL is still the
best choice for many. SQL Skills for Data Science
Following are the key topics and skills that you will learn
in this tutorial on SQL for Data Science. We have studied
industry trends and listed the most important skills you
need to learn in SQL for Data Science.

• Relational Database Model


• SQL Query Commands
• Handling Null Values
• Joins
• Key Constraints
• Working with SubQuery
• Creating Tables and Databases
• DATA WRANGLING
What is data wrangling?
Hamza zahoor Whatsapp 0341-8377-917

Data wrangling is the process of converting raw data into a


usable form. It may also be called data munging or data
remediation.
Data wrangling describes a series of processes designed to
explore, transform, and validate raw datasets from their
messy and complex forms into high-quality data. You can
use your wrangled data to produce valuable insights and
guide business decisions.

Data wrangling steps


There are four broad steps in the munging process:

• Discovery
• Transformation
• Validation
• Publishing

1. Discovery
In the discovery stage, you'll essentially prepare yourself
for rest of the process. Here, you'll think about the
questions you want to answer and the type of data you'll
need in order to answer them. You'll also locate the data
you plan to use and examine its current form in order to
Hamza zahoor Whatsapp 0341-8377-917

figure out how you'll clean, structure, and organize your


data in the following stages.
2. Transformation
During the transformation stage, you'll act on the plan
you developed during the discovery stage. This piece of
the process can be broken down into four components:
structuring, normalizing and denormalizing, cleaning, and
enriching.

Data structuring
When you structure data, you make sure that your various
datasets are in compatible formats. This way, when you combine
or merge data, it's in a form that's appropriate for the analytical
model you want to use to interpret the data.

Normalizing and denormalizing data


Data normalization involves organizing your data into a
coherent database and getting rid of irrelevant or repetitive
data. Denormalization involves combining multiple tables
or relational databases, making the analysis process
quicker. Keep your analysis goal and business users in
mind as you think about normalization and
denormalization.
Hamza zahoor Whatsapp 0341-8377-917

Data cleaning
During the cleaning process, you remove errors that might
distort or damage the accuracy of your analysis. This
includes tasks like standardizing inputs, deleting duplicate
values or empty cells, removing outliers, fixing
inaccuracies, and addressing biases. Ultimately, the goal is
to make sure the data is as error-free as possible.

Enriching data
Once you've transformed your data into a more usable
form, consider whether you have all the data you need for
your analysis. If you don't, you can enrich it by adding
values from other datasets. You also may want to add
metadata to your database at this point.

3. Validation
During the validation step, you essentially check the work
you did during the transformation stage, verifying that
your data is consistent, of sufficient quality, and secure.
This step may be completed using automated processes
and can require some programming skills.
Hamza zahoor Whatsapp 0341-8377-917

4. Publishing
After you've finished validating your data, you're ready to
publish it. When you publish data, you'll put it into
whatever file format you prefer for sharing with other team
members for downstream analysis purposes.

Importance of data wrangling


Data wrangling prepares your data for the data mining
process, which is the stage of analysis when you look for
patterns or relationships in your dataset that can guide
actionable insights.

Your data analysis can only be as good as the data itself. If


you analyze bad data, it's likely that you'll draw
illinformed conclusions and won't be able to make
reliable, data-informed decisions.

With wrangled data, you can feel more confident in the


conclusions you draw from your data. You'll get results
much faster, with less chance of errors or missed
opportunities.
Hamza zahoor Whatsapp 0341-8377-917

• EXPLORATORY DATA ANAYLSIS


What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a crucial initial step
in data science projects. It involves analyzing and
visualizing data to understand its key characteristics,
uncover patterns, and identify relationships between
variables refers to the method of studying and exploring
record sets to apprehend their predominant traits, discover
patterns, locate outliers, and identify relationships between
variables. EDA is normally carried out as a preliminary
step before undertaking extra formal statistical analyses or
modeling.
Key aspects of EDA include:
• Distribution of Data: Examining the distribution of
data points to understand their range, central
tendencies (mean, median), and dispersion (variance,
standard deviation).
• Graphical Representations: Utilizing charts such as
histograms, box plots, scatter plots, and bar charts to
visualize relationships within the data and
distributions of variables.
• Outlier Detection: Identifying unusual values that
deviate from other data points. Outliers can influence
Hamza zahoor Whatsapp 0341-8377-917

statistical analyses and might indicate data entry errors


or unique cases.
• Correlation Analysis: Checking the relationships
between variables to understand how they might affect
each other. This includes computing correlation
coefficients and creating correlation matrices.
• Handling Missing Values: Detecting and deciding
how to address missing data points, whether by
imputation or removal, depending on their impact and
the amount of missing data.
• Summary Statistics: Calculating key statistics that
provide insight into data trends and nuances.
• Testing Assumptions: Many statistical tests and
models assume the data meet certain conditions (like
normality or homoscedasticity). EDA helps verify
these assumptions.
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several
reasons, especially in the context of data science and
statistical modeling. Here are some of the key reasons why
EDA is a critical step in the data analysis process:
Hamza zahoor Whatsapp 0341-8377-917

1. Understanding Data Structures: EDA helps in


getting familiar with the dataset, understanding the
number of features, the type of data in each feature,
and the distribution of data points. This understanding
is crucial for selecting appropriate analysis or
prediction techniques.
2. Identifying Patterns and Relationships: Through
visualizations and statistical summaries, EDA can
reveal hidden patterns and intrinsic relationships
between variables. These insights can guide further
analysis and enable more effective feature engineering
and model building.
3. Detecting Anomalies and Outliers: EDA is essential
for identifying errors or unusual data points that may
adversely affect the results of your analysis. Detecting
these early can prevent costly mistakes in predictive
modeling and analysis.
4. Testing Assumptions: Many statistical models
assume that data follow a certain distribution or that
variables are independent. EDA involves checking
these assumptions. If the assumptions do not hold, the
conclusions drawn from the model could be invalid.
Hamza zahoor Whatsapp 0341-8377-917

5. Informing Feature Selection and Engineering:


Insights gained from EDA can inform which features
are most relevant to include in a model and how to
transform them (scaling, encoding) to improve model
performance.
6. Optimizing Model Design: By understanding the
data’s characteristics, analysts can choose appropriate
modeling techniques, decide on the complexity of the
model, and better tune model parameters.
7. Facilitating Data Cleaning: EDA helps in spotting
missing values and errors in the data, which are
critical to address before further analysis to improve
data quality and integrity.
8. Enhancing Communication: Visual and statistical
summaries from EDA can make it easier to
communicate findings and convince others of the
validity of your conclusions, particularly when
explaining data-driven insights to stakeholders without
technical backgrounds.
Hamza zahoor Whatsapp 0341-8377-917

• INTRODUCTION TO TEXT ANAYLSIS


What is Text Analytics?
Text Analytics is a process of analyzing and understanding
written or spoken language. It employs computer
algorithms and techniques to extract valuable information,
patterns, and insights from extensive textual data. In
simpler terms, text analytics empowers computers to
understand and interpret human language.
In simpler terms, text analytics helps computers understand
and interpret human language. Here’s a realworld
example to illustrate text analytics: Let’s say a company
receives customer reviews for its products online. These
reviews can be a goldmine of information, but it’s not
feasible for humans to read and analyze thousands of
reviews manually. This is where text analytics comes in.
The text analytics system can automatically analyze the
reviews, looking for patterns and sentiments. It can
identify common words or phrases that customers use to
express satisfaction or dissatisfaction. For example, it
might recognize that words like love, great, and excellent
often appear in positive reviews, while words like
disappointed, issues, and poor may appear in negative
reviews.
Hamza zahoor Whatsapp 0341-8377-917

Why is Text Analytics Important?


Text analytics has become a crucial tool in today’s
information age for two main reasons: the massive growth
of text data and its unique ability to extract valuable
insights hidden within that data.
1. Explosion of Text Data:
• We generate an immense amount of text data daily,
from emails and social media posts to customer
reviews, documents, and online articles.
• Traditional data analysis methods struggle with
this unstructured format. Text data lacks the neat
rows and columns of traditional databases.
• Text analytics bridges this gap, allowing us to unlock
meaning and value from this vast and ever-growing
resource.
2. Uncovering Hidden Gems:
• Text data is rich with information about people’s
opinions, experiences, and behaviors.
• By analyzing this data, we can discover hidden trends,
patterns, and emotions that wouldn’t be readily
apparent otherwise.
Hamza zahoor Whatsapp 0341-8377-917

• This can lead to significant benefits in various fields: o


Businesses: Improved decision-making through
understanding customer sentiment, identifying
market trends, and optimizing marketing
campaigns.
o Research: Faster scientific discoveries by
analyzing large volumes of research papers and
uncovering hidden connections.
o Society: Gaining a deeper understanding of social
issues by analyzing public discourse on social
media and identifying areas for improvement.
How Text Analytics Work?
Text Analytics process typically includes several key steps,
such as language identification, tokenization, sentence
breaking, part-of-speech tagging, chunking, syntax
parsing, and sentence chaining. Let’s briefly explore each
of these steps:

Steps of Text Analytics Process


Language Identification
Hamza zahoor Whatsapp 0341-8377-917

• Objective: Determine the language in which the text


is written.
• How it works: Algorithms analyze patterns within the
text to identify the language. This is essential for
subsequent processing steps, as different languages
may have different rules and structures.
Tokenization
• Objective: Divide the text into individual units, often
words or sub-word units (tokens).

How it works: Tokenization breaks down the text


into meaningful units, making it easier to analyze and
process. It involves identifying word boundaries and
handling punctuation.
Sentence Breaking
• Objective: Identify and separate individual
sentences in the text.
• How it works: Algorithms analyze the text to
determine where one sentence ends and another
begins. This is crucial for tasks that require
understanding the context of sentences.
Part of Speech Tagging
• Objective: Assign a grammatical category (part of
speech) to each token in a sentence.
• How it works: Machine learning models or
rulebased systems analyze the context and
relationships between words to assign appropriate
part-of-speech tags (e.g., noun, verb, adjective) to
each token.
Chunking

• Objective: Identify and group related words


(tokens) together, often based on the part-of-speech
tags.
How it works: Chunking helps in identifying phrases
or meaningful chunks within a sentence. This step is
useful for extracting information about specific
entities or relationships between words.
Syntax Parsing
• Objective: Analyze the grammatical structure of
sentences to understand relationships between
words.
• How it works: Syntax parsing involves creating a
syntactic tree that represents the grammatical
structure of a sentence. This tree helps in
understanding the syntactic relationships and
dependencies between words.
Sentence Chaining
• Objective: Connect and understand the relationships
between multiple sentences.
• How it works: Algorithms analyze the content and
context of different sentences to establish

connections or dependencies between them. This


step is crucial for tasks that require a broader
understanding of the text, such as summarization or
document-level sentiment analysis.
INTRODUCTION TO PREDICTION AND
INFERENCE (SUPERVISED AND
UNSUPERVISED ALGORITHMS)
in data science, prediction and inference are two
fundamental tasks, often accomplished using
supervised and unsupervised algorithms. Here’s a
detailed explanation of these concepts and how they
differ:

Prediction and Inference

1. *Prediction:*
- *Definition:* Prediction involves using existing
data to forecast future data points. It focuses on
building models that can make accurate
predictions on new, unseen data.

- *Goal:* The primary goal is to minimize the


difference between the predicted values and the
actual values.
- *Examples:* Predicting house prices,
forecasting weather, stock market predictions.
2. *Inference:*
- *Definition:* Inference is about understanding
the relationships within the data and drawing
conclusions about the underlying processes that
generate the data. It emphasizes model
interpretability and understanding.
- *Goal:* The goal is to derive insights and make
conclusions about the population or process from
which the data is drawn.
- *Examples:* Determining the impact of
education on salary, understanding the effect of
marketing campaigns on sales.

### Supervised and Unsupervised Algorithms

#### Supervised Learning

Supervised learning algorithms are used when the data


has labeled outcomes, meaning that each data point is
associated with a known result.
- *Algorithms:*
1. *Linear Regression:* Used for predicting a
continuous outcome based on one or more
predictor variables.
2. *Logistic Regression:* Used for binary
classification problems.
3. *Decision Trees:* Used for both classification
and regression tasks.
4. *Random Forests:* An ensemble method
using multiple decision trees for better accuracy.
5. *Support Vector Machines (SVM):* Used for
classification tasks.
6. *Neural Networks:* Used for complex
patterns in data, applicable in both regression and
classification.

- *Example:* Predicting house prices (regression)


where the training data includes features like size,
location, and price.

#### Unsupervised Learning


Unsupervised learning algorithms are used when the
data is not labeled. These algorithms try to identify
underlying patterns or structures within the data.

- *Algorithms:*
1. *K-Means Clustering:* Partitions the data
into K distinct clusters based on feature similarity.
2. *Hierarchical Clustering:* Builds a tree of
clusters.
3. *Principal Component Analysis (PCA):*
Reduces the dimensionality of the data while
retaining most of the variation in the data.
4. *t-Distributed Stochastic Neighbor
Embedding (tSNE):* Reduces dimensionality for
visualization purposes.
5. *Anomaly Detection Algorithms:* Identifies
data points that deviate significantly from the
majority of the data.

- *Example:* Grouping customers into different


segments based on purchasing behavior without
prior labels.
### Applying Supervised and Unsupervised
Algorithms for Prediction and Inference

- *Supervised Learning for Prediction:*


- Build a model using historical data with known
outcomes (training data).
- Use the model to predict outcomes for new data.
- Example: Using a trained linear regression model to
predict future house prices.

- *Supervised Learning for Inference:*


- Analyze the model coefficients or structure to
understand the relationship between predictors and
outcomes.
- Example: Using logistic regression to infer the effect
of various factors (e.g., age, income) on the likelihood
of defaulting on a loan.

- *Unsupervised Learning for Prediction:*


- Though less common, clustering can sometimes be
used indirectly for predictive tasks.
- Example: Using customer segments identified by K-
means clustering to predict the likelihood of
purchasing a product.

- *Unsupervised Learning for Inference:*


- Identify patterns, groups, or structures in the data.
- Example: Using PCA to determine the main factors
that explain the variance in customer purchase data.

• INTRODUCTION TO SCIKET LEARN


What Is Scikit-learn?
Scikit-learn is a popular and robust machine learning
library that has a vast assortment of algorithms, as well as
tools for ML visualizations, preprocessing, model fitting,
selection, and evaluation.

Building on NumPy, SciPy, and matplotlib, Scikit-learn


features a number of efficient algorithms for
classification, regression, and clustering. These include
support vector machines, rain forests, gradient boosting,
k-means, and DBSCAN.
How Does Scikit-learn Work?
Scikit-learn is written primarily in Python and uses
NumPy for high-performance linear algebra, as well as
for array operations. Some core Scikit-learn algorithms
are written in Cython to boost overall performance.

As a higher-level library that includes several


implementations of various machine learning algorithms,
Scikit-learn lets users build, train, and evaluate a model in
a few lines of code.

Scikit-learn provides a uniform set of high-level APIs for


building ML pipelines or workflows.
You use a Scikit-learn ML Pipeline to pass the data
through transformers to extract the features and an
estimator to produce the model, and then evaluate
predictions to measure the accuracy of the model.
• Transformer: This is an algorithm that transforms or
inputs the data for pre-processing.
• Estimator: This is a machine learning algorithm that
trains or fits the data to build a model, which can be
used for predictions.
• Pipeline: A pipeline chains Transformers and
Estimators together to specify an ML workflow.
• BIAS-VARIANCE TRADEOFF
What is bias?
Bias in machine learning refers to the difference between
a model’s predictions and the actual distribution of the
value it tries to predict. Models with high bias
oversimplify the data distribution rule/function, resulting
in high errors in both the training outcomes and test data
analysis results.

Bias is typically measured by evaluating the performance


of a model on a training dataset. One common way to
calculate bias is to use performance metrics such as mean
squared error (MSE) or mean absolute error (MAE),
which determine the difference between the predicted and
real values of the training data.
Bias is a systematic error that occurs due to incorrect
assumptions in the machine learning process, leading to
the misrepresentation of data distribution.
The level of bias in a model is heavily influenced by the
quality and quantity of training data involved. Using
insufficient data will result in flawed predictions. At the
same time, it can also result from the choice of an
inappropriate model.

High-bias model features


1. Underfitting. High-bias models often underfit the
data, meaning they oversimplify the solution based
on generalization. As a result, the proposed
distribution does not correspond to the actual
distribution.
2. Low training accuracy. The lack of proper
processing of training data results in high training
loss and low training accuracy.
3. Oversimplification. The oversimplified nature of
high-bias models limits their ability to identify
complex features in the training data, making them
inefficient for solving complicated problems.

What is variance?
Variance stands in contrast to bias; it measures how much
a distribution on several sets of data values differs from
each other. The most common approach to measuring
variance is by performing cross-validation experiments
and looking at how the model performs on different
random splits of your training data.
A model with a high level of variance depends heavily on
the training data and, consequently, has a limited ability to
generalize to new, unseen figures. This can result in
excellent performance on training data but significantly
higher error rates during model verification on the test
data. Nonlinear machine learning algorithms often have
high variance due to their high flexibility.
A complex model can learn complicated functions, which
leads to higher variance. However, if the model becomes
too complex for the dataset, high variance can result in
overfitting. Low variance indicates a limited change in
the target function in response to changes in the training
data, while high variance means a significant difference.
High-variance model features
• Low testing accuracy. Despite high accuracy on
training data, high variance models tend to perform
poorly on test data.
• Overfitting. A high-variance model often leads to
overfitting as it becomes too complex.
• Overcomplexity. As researchers, we expect that
increasing the complexity of a model will result in
improved performance on both training and testing
data sets. However, when a model becomes too
complex and a simpler model may provide the same
level of accuracy, it’s better to choose the simpler
one.

• MODEL EVALUATION AND


PERFORMANCE METRICS
What is model evaluation?
Model evaluation is the process of using different
evaluation metrics to understand a machine learning
model’s performance, as well as its strengths and
weaknesses. Model evaluation is important to assess the
efficacy of a model during initial research phases, and it
also plays a role in model monitoring.

To understand if your model(s) is working well with new


data, you can leverage a number of evaluation metrics.
Classification
The most popular metrics for measuring classification
performance include accuracy, precision, confusion
matrix, log-loss, and AUC (area under the ROC curve).

Performance Metrics for Classification


In a classification problem, the category or classes of data
is identified based on training data. The model learns from
the given dataset and then classifies the new data into
classes or groups based on the training. It predicts class
labels as the output, such as Yes or No, 0 or 1, Spam or
Not Spam, etc. To evaluate the performance of a
classification model, different metrics are used, and some
of them are as follows:
Accuracy
The accuracy metric is one of the simplest Classification
metrics to implement, and it can be determined as the
number of correct predictions to the total number of
predictions.
It can be formulated as:
To implement an accuracy metric, we can compare
ground truth and predicted values in a loop, or we can
also use the scikit-learn module for this.

Precision
The precision metric is used to overcome the limitation of
Accuracy. The precision determines the proportion of
positive prediction that was actually correct. It can be
calculated as the True Positive or predictions that are
actually true to the total positive predictions (True Positive
and False Positive).

IV. Recall or Sensitivity


It is also similar to the Precision metric; however, it aims
to calculate the proportion of actual positive that was
identified incorrectly. It can be calculated as True Positive
or predictions that are actually true to the total number of
positives, either correctly predicted as positive or
incorrectly predicted as negative (true Positive and false
negative).
The formula for calculating Recall is given below:

When to use Precision and Recall?


From the above definitions of Precision and Recall, we
can say that recall determines the performance of a
classifier with respect to a false negative, whereas
precision gives information about the performance of a
classifier with respect to a false positive.

F-Scores
F-score or F1 Score is a metric to evaluate a binary
classification model on the basis of predictions that are
made for the positive class. It is calculated with the help
of Precision and Recall. It is a type of single score that
represents both Precision and Recall. So, the F1 Score
can be calculated as the harmonic mean of both
precision and Recall, assigning equal weight to each of
them.
The formula for calculating the F1 score is given below:
• MAP REDUCE PARADIGM
What is MapReduce?
A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form. It was
developed in 2004, on the basis of paper titled as
"MapReduce: Simplified Data Processing on Large
Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the
mapper phase, and the reducer phase. In the Mapper, the
input is given in the form of a key-value pair. The output
of the Mapper is fed to the reducer as input. The reducer
runs only after the Mapper is over. The reducer too takes
input in key-value format, and the output of reducer is the
final output.
Steps in Map Reduce
oThe map takes data in the form of pairs and returns a
list of <key, value> pairs. The keys will not be unique
in this case. o Using the output of Map, sort and shuffle
are applied by the Hadoop architecture. This sort and
shuffle acts on these list of <key, value> pairs and
sends out unique keys and a list of values associated
with this unique key <key, list(values)>. o An output of
sort and shuffle sent to the reducer
phase. The reducer performs a defined function on a
list of values for unique keys, and Final output <key,
value> will be stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and
before the reducer. When the Mapper task is complete, the
results are sorted by key, partitioned if there are multiple
reducers, and then written to disk. Using the input from
each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the
form of <k2, list(v2)> is sent as input to reducer phase.
Usage of MapReduce
o It can be used in various application like document
clustering, distributed sorting, and web link-graph
reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning. o
It was used by Google to regenerate Google's index
of the World Wide Web. o It can be used in multiple
computing environments such as multi-cluster, multi-
core, and mobile environment.

• INTRODUCTION T0 R
What is R Programming Language?
R programming is a leading tool for machine learning,
statistics, and data analysis, allowing for the easy creation
of objects, functions, and packages. Designed by Ross
Ihaka and Robert Gentleman at the University of
Auckland and developed by the R Development Core
Team, R Language is platform-independent and
opensource, making it accessible for use across all
operating systems without licensing costs. Beyond its
capabilities as a statistical package, R integrates with
other languages like C and C++, facilitating interaction
with various data sources and statistical tools. With a
growing community of users and high demand in the Data
Science job market, R is one of the most sought-after
programming languages today. Originating as an
implementation of the S programming language with
influences from Scheme, R has evolved since its
conception in 1992, with its first stable beta version
released in 2000.
Why Use R Language?
The R Language is a powerful tool widely used for data
analysis, statistical computing, and machine learning.
Here are several reasons why professionals across various
fields prefer R:
1. Comprehensive Statistical Analysis:
• R language is specifically designed for statistical
analysis and provides a vast array of statistical
techniques and tests, making it ideal for data-driven
research.
2. Extensive Packages and Libraries:
• The R Language boasts a rich ecosystem of packages
and libraries that extend its capabilities, allowing
users to perform advanced data manipulation,
visualization, and machine learning tasks with ease.
3. Strong Data Visualization Capabilities:
• R language excels in data visualization, offering
powerful tools like ggplot2 and plotly, which enable
the creation of detailed and aesthetically pleasing
graphs and plots.
4. Open Source and Free:
• As an open-source language, R is free to use, which
makes it accessible to everyone, from individual
researchers to large organizations, without the need
for costly licenses.
5. Platform Independence:
• The R Language is platform-independent, meaning it
can run on various operating systems, including
Windows, macOS, and Linux, providing flexibility in
development environments.
6. Integration with Other Languages:
• R can easily integrate with other programming
languages such as C, C++, Python, and Java,
allowing for seamless interaction with different data
sources and statistical packages.
7. Growing Community and Support:
• R language has a large and active community of
users and developers who contribute to its continuous
improvement and provide extensive support through
forums, mailing lists, and online resources.
8. High Demand in Data Science:
• R is one of the most requested programming
languages in the Data Science job market, making it
a valuable skill for professionals looking to advance
their careers in this field.
Features of R Programming Language
The R Language is renowned for its extensive features
that make it a powerful tool for data analysis, statistical
computing, and visualization. Here are some of the key
features of R:
1. Comprehensive Statistical Analysis:
• R langauge provides a wide array of statistical
techniques, including linear and nonlinear modeling,
classical statistical tests, time-series analysis,
classification, and clustering.
2. Advanced Data Visualization:
• With packages like ggplot2, plotly, and lattice, R
excels at creating complex and aesthetically pleasing
data visualizations, including plots, graphs, and
charts.
3. Extensive Packages and Libraries:
• The Comprehensive R Archive Network (CRAN)
hosts thousands of packages that extend R’s
capabilities in areas such as machine learning, data
manipulation, bioinformatics, and more.
4. Open Source and Free:
• R is free to download and use, making it accessible to
everyone. Its open-source nature encourages
community contributions and continuous
improvement.
5. Platform Independence:
• R is platform-independent, running on various
operating systems, including Windows, macOS, and
Linux, which ensures flexibility and ease of use
across different environments.
6. Integration with Other Languages:
• R language can integrate with other programming
languages such as C, C++, Python, Java, and SQL,
allowing for seamless interaction with various data
sources and computational processes.
7. Powerful Data Handling and Storage:
• R efficiently handles and stores data, supporting
various data types and structures, including vectors,
matrices, data frames, and lists.
8. Robust Community and Support:
• R has a vibrant and active community that provides
extensive support through forums, mailing lists, and
online resources, contributing to its rich ecosystem of
packages and documentation.
9. Interactive Development Environment (IDE):
• RStudio, the most popular IDE for R, offers a
userfriendly interface with features like syntax
highlighting, code completion, and integrated tools
for plotting, history, and debugging.
10. Reproducible Research:
• R supports reproducible research practices with tools
like R Markdown and Knitr, enabling users to create
dynamic reports, presentations, and documents that
combine code, text, and visualizations.

File reading in R
One of the important formats to store a file is in a text
file. R provides various methods that one can read data
from a text file.

read.delim(): This method is used for reading


“tabseparated value” files (“.txt”). By default, point (“.”)
is used as decimal point.
Syntax: read.delim(file, header = TRUE, sep = “\t”, dec =
“.”, …)

Parameters:

file: the path to the file containing the data to be read into
R.
header: a logical value. If TRUE, read.delim() assumes
that your file has a header row, so row 1 is the name of
each column. If that’s not the case, you can add the
argument header = FALSE.
sep: the field separator character. “\t” is used for a
tabdelimited file. dec: the character used in the file for
decimal points.
DATA FRAME
A data frame is a table or a two-dimensional array-like
structure in which each column contains values of one
variable and each row contains one set of values from
each column.
Following are the characteristics of a data frame.
• The column names should be non-empty.
• The row names should be unique.
• The data stored in a data frame can be of numeric,
factor or character type.
• Each column should contain same number of data
items.

• BASIC AND ADVANCED PLOTS FOR DATA


VISUALIZATION
What is Data Visualization?
It uses charts and graphs to visualize large amounts of
complex data. Visualization provides a quick, easy way to
convey concepts and summarize and present large data in
easy-to-understand and straightforward displays, which
enables readers to insightful information. With the help of
its techniques, enterprises are able to see the overview of
their unstructured enterprise data in a better way.
What are its key features?
• Identify areas that need attention or improvement.
• Clarify which factors influence customer behavior.
• Decision-making Ability.
• Integration Capability.
• Predict sales volumes.

Key Components of Data Visualization


Its component helps to give more details and alternative
views to look after the data. Listed below are its
components of it.
Line Charts
Line Charts involves Creating a graph in which data is
represented as a line or a set of data points joined by a
line.
Area chart
Area chart structure is a filled-in area that requires at least
two groups of data along an axis.

Pie Charts
Pie charts represent a graph in the shape of a circle. The
whole chart is divided into subparts, which look like a

sliced pie.
Donut Chart
Doughnut Charts are pie charts that do not contain any
data inside the circle.
Drill Down Pie charts
Drill down Pie charts are used for representing detailed
descriptions for a particular category.

Bar Charts
A bar chart is the type of chart in which data is
represented in vertical series and used to compare trends
over time.

Stacked Bar
In a stacked bar chart, parts of the data are adjacent to
each bar and display a total amount, broken down into
sub-amounts.

Gauges
The gauge (gauge) component renders graphical

representations of data.
Solid Gauge
Creates a gauge that indicates its metric value along a

180-degree arc.
Activity Gauge
Creates a gauge that shows the development of a task.
The inner rectangle shows the current level of a measure
against the ranges marked on an outer rectangle.
Heat and Treemaps
Heatmaps are useful for presenting variation across
different variables, revealing any patterns, displaying
whether any variables are related to each other, and
identifying if any associations exist in-between them.

Treemap with Levels


The treemap component displays quantitative hierarchical
data across two dimensions, represented visually by size
and color. Treemaps use a shape called a node to
reference the data in the hierarchy.
Scatter and Bubble Charts
Creates a chart in which the position and size of bubbles
represent data. Use to show similarities among types of
values, mainly when you have multiple data objects and
you require to see the general relations.
Combinations
Creates a graph that uses various kinds of data labels
(bars, lines, or areas) to represent different sets of data
items.

3D Charts
Creating a 3D chart helps rotate and view a chart from
different angles, which supports in representing data.
3D Column
A 3D chart of type columns will draw each column as a
cuboid and create a 3D effect.

Major Process of Data Visualization


Each and every data has its particular need to illustrate
data. Listed below are the stages and process flow for it.
Acquire
Obtaining the correct data type is a crucial part as the data
can be collected from various sources and can be
unstructured.
Parse
Provide some structure for the data's meaning by
restructuring the received data into different categories,
which helps better visualize and understand data.
Filter
Filtering out the data that cannot serve the purpose is
essential as filtering out will remove the unnecessary
data, further enhancing the chart visualization.
Mining
Building charts from statistics in a way that scientific
context is discrete. It helps viewers seek insights that
cannot be gained from raw data or statistics.
Represent
One of the most significant challenges for users is
deciding which chart suits them best and represents the
right information. The data exploration capability is
necessary for statisticians as this reduces the need for
duplicated sampling to determine which data is relevant
for each model.
Refine
Refining and Improving the essential representation helps
in user engagement.
Interact
Add methods for handling the data or managing what
features are visible.
What are the best tools?
Nowadays, there are many tools. Some of them are:
• Google Chart: Google Chart is one of the easiest
tools for visualization. With the help of Google
charts, you can analyze small datasets to complex
unstructured datasets.
We can implement simple charts as well as complex
tree diagrams. Google Chart is available
crossplatform as well.
• Tableau: The tableau desktop is a very easy-to-use
its tool. Two more versions are available of Tableau.
One is "Tableau Server," and the other is cloud-based
"Tableau Online." Here we can perform visualization
operations by applying drag and drop methods for
creating visual diagrams. In Tableau, we can create
dashboards very efficiently.

You might also like