0% found this document useful (0 votes)
12 views75 pages

Unit 1 Final

FDS UNIT 1 NOTES

Uploaded by

ramyaproject
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views75 pages

Unit 1 Final

FDS UNIT 1 NOTES

Uploaded by

ramyaproject
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

FDS Unit 1 - Unit 1 notes

Foundations of Datascience (Anna University)

by Ramya p
I
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing –
Basic statistical descriptions of Data

1.1 Introduction to Data Science

➢ Data Science is a combination of multiple disciplines that uses statistics, data analysis,
and machine learning to analyze data and to extract knowledge and insights from it.

➢ Data Science is about data gathering, analysis and decision-making. Also, it is about
finding patterns in data, through analysis, and make future predictions.

➢ Data science and big data are used almost everywhere in both commercial and non-
commercial settings.

➢ By using Data Science, companies are able to make:


• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)
1.1.1 Where is Data Science Needed?
➢ Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.
Examples
• For route planning: To discover the best routes to ship
• To foresee delays for flight/ship/train etc. (through predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company

• To analyze health benefit of training

by Ramya p
• To predict who will win elections
➢ Data Science can be applied in nearly every part of a business where data is available.
Examples
• Consumer goods
• Stock markets

• Industry

• Politics

• Logistic companies

• E-commerce

1.1.2 How Does a Data Scientist Work?


➢ A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics

• Programming (Python or R)

• Mathematics

• Databases
➢ A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.
How a Data Scientist works
1. Ask the right questions - To understand the business problem.
2. Explore and collect data - From database, web logs, customer feedback, etc.

3. Extract the data - Transform the data to a standardized format.

4. Clean the data - Remove erroneous values from the data.

5. Find and replace missing values - Check for missing values and replace them
with a suitable value (e.g. an average value).

6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than
1,8 m. However, the number 140 is larger than 1,8. - so scaling is important).

7. Analyze data, find patterns and make future predictions.

by Ramya p
8. Represent the result - Present the result with useful insights in a way the
"company" can understand.

1.2 Benefits and uses of Data Science


➢ The organizational importance of Data Science is continuously increasing. According to
one survey, the global Data Science market is expected to grow to $115 billion by 2023.
Some of the many Data Science benefits include the following:
1. Increases business predictability
➢ When a company invests in structuring its data, it can work with what we call
predictive analysis. With the help of the data scientist, it is possible to use technologies
such as Machine Learning and Artificial Intelligence to work with the data that the
company has and, in this way, carry out more precise analyses of what is to come.
➢ Thus, you increase the predictability of the business and can make decisions today that
will positively impact the future of your business.
2. Ensures real-time intelligence
➢ The data scientist can work with RPA (Robotic Process Automation) professionals to
identify the different data sources of their business and create automated dashboards,
which search all this data in real-time in an integrated manner.
➢ This intelligence is essential for the managers of your company to make more
accurate and faster decisions.
3. Favors the marketing and sales area
➢ Data-driven Marketing is a universal term nowadays. The reason is simple: only with
data, we can offer solutions, communications, and products that are genuinely in line
with customer expectations.
➢ The data scientists can integrate data from different sources, bringing even more
accurate insights to their team. Can you imagine obtaining the entire customer journey
map considering all the touch points your customer had with your brand? This is
possible with Data Science.
4. Improves data security
➢ One of the benefits of Data Science is the work done in the area of data security. In
that sense, there is a world of possibilities. The data scientists work on fraud
prevention systems, for example, to keep your company’s customers safer. On the
other hand, he can also study recurring patterns of behavior in a company’s systems to
identify possible architectural flaws.

by Ramya p
➢ Data Science is widely used in the banking and finance sectors for fraud detection and
personalized financial advice.
5. Helps interpret complex data
➢ Data Science is a great solution when we want to cross different data to understand
the business and the market better. Depending on the tools we use to collect data, we
can mix data from “physical” and virtual sources for better visualization.
6. Facilitates the decision-making process
➢ Data Science is improving the decision-making process. This is because we can create
tools to view data in real-time, allowing more agility for business managers. This is
done both by dashboards and by the projections that are possible with the data
scientist’s treatment of data.
➢ Eg. The construction companies use Data Science for better decision making by
tracking activities, including average time for completing tasks, materials-based
expenses, and more.
7. Study purpose
➢ Universities use data science in their research but also to enhance the study
experience of their students. The rise of massive open online courses (MOOC)
produces a lot of data, which allows universities to study how this type of learning can
complement traditional classes.
1. Explain are the facets of data? NOV / DEC 2023 ,NOV / DEC 2022

1.3 Facets of data


➢ In Data Science and Big Data we may use many different types of data, and each of
them tends to require different tools and techniques. The main categories of data are
these:
• Structured data
• Unstructured data
• Natural Language
• Machine-generated
• Graph-based
• Audio, video and images
• Streaming
1.3.1 Structured data

by Ramya p
➢ Structured data is the data which conforms to a data model, has a well define
structure, follows a consistent order and can be easily accessed and used by a person
or a computer program.
➢ Structured data is usually stored in well-defined schemas such as Databases. It is
generally tabular with column and rows that clearly define its attributes.
➢ The SQL (Structured Query language) is often used to manage structured data stored
in databases.
Characteristics of Structured Data
• Data conforms to a data model and has easily identifiable structure.
• Data is stored in the form of rows and columns. Example: Database
• Data is well organised so, Definition, Format and Meaning of data is explicitly
known.
• Data resides in fixed fields within a record or file.
• Similar entities are grouped together to form relations or classes.
• Entities in the same group have same attributes.
• Easy to access and query, So data can be easily used by other programs.
• Data elements are addressable, so efficient to analyse and process.
Sources of Structured Data
• SQL Databases
• Spreadsheets such as Excel (Figure 1.1)
• OLTP Systems
• Online forms
• Sensors such as GPS or RFID tags
• Network and Web server logs
• Medical devices

by Ramya p
Figure 1.1 An Excel table is an example of structured data.
1.3.2 Unstructured Data
➢ Unstructured data is the data which does not conforms to a data model and has no
easily identifiable structure such that it cannot be used by a computer program
easily.
➢ Unstructured data is not organised in a pre-defined manner or does not have a pre-
defined data model, thus it is not a good fit for a mainstream relational database.
➢ One example of unstructured data is our regular email (figure 1.2). Although email
contains structured elements such as the sender, title, and body text, it’s a challenge to
find the number of people who have written an email complaint about a specific
employee because so many ways exist to refer to a person, for example. The
thousands of different languages and dialects out there further complicate this.

by Ramya p
Figure 1.2 Email is simultaneously an example of unstructured data and natural language
data.
Characteristics of Unstructured Data
• Data neither conforms to a data model nor has any structure.
• Data cannot be stored in the form of rows and columns as in Databases
• Data does not follows any semantic or rules
• Data lacks any particular format or sequence
• Data has no easily identifiable structure
• Due to lack of identifiable structure, it cannot used by computer programs easily
Sources of Unstructured Data
• Web pages
• Images (JPEG, GIF, PNG, etc.)
• Videos
• Memos
• Reports
• Emails
• Surveys
1.3.3 Natural Language
➢ Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.

by Ramya p
➢ Natural Language Processing or NLP is a branch that focuses on teaching computers
how to read and interpret the text in the same way as humans do. It is a field that is
developing methodologies for filling the gap between Data Science and human
languages.
➢ Many areas like Healthcare, Finance, Media, Human Resources, etc are using NLP for
utilizing the data available in the form of text and speech. Many text and speech
recognition applications are built using NLP.
➢ The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models
trained in one domain don’t generalize well to other domains.
➢ Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of
text. This shouldn’t be a surprise though: humans struggle with natural language as
well. It’s ambiguous by nature.
➢ A human-written email, as shown in figure 1.2, is also a perfect example of natural
language data.
1.3.4 Machine-generated data
➢ Machine data, also known as machine-generated data, is information that is created
without human interaction as a result of a computer process or application activity.
This means that data entered manually by an end-user is not recognized to be
machine-generated.
➢ These data affect all industries that use computers in their daily operations, and
individuals are increasingly generating this data inadvertently or causing it to be
generated by the machine.
➢ Application log files, call detail records, clickstream data associated with user web
activities, data files, system configuration files, alerts, and tickets are all examples of
machine data.
➢ Both machine-to-machine (M2M) and human-to-machine (H2M) interactions
generate machine data.
➢ Humans rarely alter machine data, although it can be collected and analysed. Machine
data is generated automatically, either on a regular basis or in reaction to a specific
occurrence.
Need of Machine Data
➢ Machine data can provide a wealth of useful information and commercial benefits. If
a company wants to stay ahead of the competition, it must first understand its

by Ramya p
customers' aggregate behaviour. Using the right data products, businesses can obtain
insight.
➢ Machine data has enormous potential for enabling more precise models in a variety of
applications. These models have the potential to alter the way businesses are run.
➢ Machine data, in particular, allows us to hear the voice of each individual customer
rather than a group of customers. This provides a level of business information that
was previously unimaginable.
Types of Machine Data
➢ The most common types of machine data are as follows:
Sensor Data
➢ Sensors work together to continuously monitor, measure and gather Machine Data
(e.g., movements, temperatures, pressures, and rotational speeds). Further review and
analysis of this data are possible, allowing for the extraction of insights and the
implementation of action plans.
Computer or System Log Data
➢ Computers generate log files that include information about the system's operation. A
log file is made up of a series of log lines that show various system actions, such as
saving or deleting a file, connecting to a Wi-Fi network, installing new software,
opening an application, attaching a Bluetooth device, emptying a recycle bin, and
more.
➢ Some types of computer log data are shared with the manufacturers of computers,
operating systems, applications, and programs, while others are kept locally and
confidentially.
Geotag Data
➢ Geotagging is the process of adding geographical metadata to a media type based on
the location of the device that created it. Geotags, which can include timestamps and
other contextual information, can be generated automatically for photos, videos, text
messages, and other types of media.
Call Log Data
➢ The Machine Data connected with telephone calls is referred to as a call log or call
detail record. The automated process of gathering, recording, and evaluating data
regarding phone calls is known as call logging.
➢ The call duration, start and finish times of the call, the caller and recipient's locations,
as well as the network utilised, are all recorded in the logs.

by Ramya p
Web Log Data
➢ A weblog is an automatic record of a user's online activity, as opposed to computer
log data, which records actions that occur during the functioning of a system.
Application Log Data
➢ An application log is a file that keeps track of the activities that occur within a
software application. Despite the fact that human users initiate the actions, the
Machine Data referred to here is generated automatically rather than being manually
entered.
➢ The application utilised, timestamps, problems, downtimes, access requests, user IDs,
file sizes uploaded or downloaded, and more are all included in this data. These
records can be used to assess and prevent recurrences of errors, as well as to follow
the activity of various people.
Benefits of Machine Data
• Business Intelligence and Data Analytics
• Predictive Maintenance
• Log Management and Analysis
• Customized Customer Experience
• Improving Cybersecurity
1.3.5 Graph-based or network data
➢ Graphs are data structures to describe relationships and interactions between entities
in complex systems. In general, a graph contains a collection of entities called nodes
and another collection of interactions between a pair of nodes called edges.
➢ Graph Theory can be used to represent and analyze a wide variety of network
information and has numerous modern applications within Data Science.
➢ Graph or network data is, in short, data that focuses on the relationship or adjacency
of objects. Graph-based data is a natural way to represent social networks, and its
structure allows us to calculate specific metrics such as the influence of a person and
the shortest path between two people.
➢ Examples of graph-based data can be found on many social media websites (figure
1.3). For instance, on LinkedIn we can see who you know at which company. Our
follower list on Twitter is another example of graph-based data. The power and
sophistication comes from multiple, overlapping graphs of the same nodes. For
example, imagine the connecting edges here to show “friends” on Facebook. Imagine
another graph with the same people which connects business colleagues via LinkedIn.

by Ramya p
Figure 1.3 Friends in a social network are an example of graph-based data.
➢ Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.
1.3.6 Audio, image, and video data
➢ The Audio, image, and video are different types of data that pose specific challenges
to a data scientist. Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
➢ For instance, the High-speed cameras at stadiums will capture ball and athlete
movements to calculate in real time, for example, the path taken by a defender relative
to two baselines.
➢ Recently a company called DeepMind succeeded at creating an algorithm that’s
capable of learning how to play video games. This algorithm takes the video screen as
input and learns to interpret everything via a complex process of deep learning
1.3.7 Streaming data
➢ Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously, and in small sizes.
➢ Streaming data includes a wide variety of data such as log files generated by
customers using our mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors, or
geospatial services, and telemetry from connected devices or instrumentation in data
centers.
➢ This data needs to be processed sequentially and incrementally on a record-by-record
basis or over sliding time windows, and used for a wide variety of analytics including
correlations, aggregations, filtering, and sampling. Information derived from such

by Ramya p
analysis gives companies visibility into many aspects of their business and customer
activity such as service usage, server activity, website clicks, and geo-location of
devices, people, and physical goods and enables them to respond promptly to
emerging situations.
➢ Some real-life examples of streaming data include use cases in every industry,
including real-time stock trades, up-to-the-minute retail inventory management, social
media feeds, multiplayer game interactions, and ride-sharing apps.

2. Describe the overview of the data science process AP / MAY 2023 NOV / DEC
2022

1.4 Data Science process


1.4.1 Overview of Data Science process
➢ The Data Science is all about a systematic process used by Data Scientists to analyze,
visualize and model large amounts of data. A data science process helps data scientists
use the tools to find unseen patterns, extract data, and convert information to
actionable insights that can be meaningful to the company.
➢ This aids companies and businesses in making decisions that can help in customer
retention and profits. Further, a data science process helps in discovering hidden
patterns of structured and unstructured raw data. The process helps in turning a
problem into a solution by treating the business problem as a project.
➢ The typical data science process consists of six steps:

by Ramya p
Figure 1.4 six steps of the data science process
1.4.2 Defining Research goals and Creating a project charter
➢ The first step of this process is setting a research goal. Here, the main purpose is
making sure all the stakeholders understand the what, how, and why of the project. So
that everybody knows what to do and can agree on the best course of action. The
question may ask like:
• Who the customers are?
• How to identify them?
• What is the sale process right now?
• Why are they interested in your products?
• What products they are interested in?

by Ramya p
➢ The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables, and a plan of action with a timetable. This information is
then best placed in a project charter.
➢ The length and formality may differ between projects and companies. At the end of
this step, we must have as much information at hand as possible.
1.4.2.1 Spend time understanding the goals and context of our research
➢ First of all, we have to be very careful about understanding the goals and context is
critical for project success. First, define context based on our research objectives,
because it can mean different things such as a particular team or group, an
organisation, community, society, country, culture...etc.
➢ Second, the importance of context is that it gives meaning to our research. Simply, it
helps shape our research. Continuously asking questions and devising examples until
grasp the exact business expectations, identify how our project fits in the bigger
picture, appreciate how our research is going to change the business, and understand
how they’ll use our results.

Figure 1.5 Setting the research goal


1.4.2.2 Create a project charter
➢ A project management charter states the scope and objectives of a project, as well as
the people who will participate in it.
➢ If we work in operations or project management, we have to work tirelessly to
establish the most efficient ways to accomplish tasks and maintain quality.
➢ However, before establishment a new process or make significant changes to a current
process, we need to get approval from stakeholders and get everyone else on board
with our vision.

by Ramya p
➢ The project charter contents should have questions, particularly if the company
doesn’t offer a standard form or template to fill out.
• A clear research goal
• The project mission and context
• Set a Budget
• Assess scope and Risks.
• How you’re going to perform your analysis
• What resources you expect to use
• Proof that it’s an achievable project, or proof of concepts
• Deliverables and a measure of success
• A timeline
1.4.3 Retrieving Data
➢ In databases, data retrieval is the process of identifying and extracting data from a
database, based on a query provided by the user or application. The next step of data
science process is retrieve the required data. Many companies will have already
collected and stored the data in the database.
➢ Most of the organizations are making even high-quality data freely available for
public and commercial use.
➢ Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective now is acquiring all the data whatever we need. This may be
difficult, and even if we succeed, data is often like in a different format, so we have to
polishing to be of any use to us.

Figure 1.6 Retrieving Data


1.4.3.1 Start with data stored within the company

by Ramya p
➢ The first act should be to assess the relevance and quality of the data that’s readily
available within the company. Most companies have a program for maintaining key
data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses, and data
lakes maintained by a team of IT professionals.
➢ The primary goal of a database is data storage, while a data warehouse is designed for
reading and analyzing that data.
➢ A data mart is a subset of the data warehouse and geared toward serving a specific
business unit. While data warehouses and data marts are home to pre-processed data.
➢ A data lake is a centralized repository designed to store, process, and secure large
amounts of structured, semi structured, and unstructured data. It can store data in its
native format and process any variety of it, ignoring size limits.
➢ Finding data even within our own company can sometimes be a challenge. As
companies grow, their data becomes scattered around many places. Knowledge of the
data may be dispersed as people change positions and leave the company.
➢ Getting access to data is another difficult task. The organizations should understand
the value and sensitivity of data and often have policies in place so everyone has
access to what they need. These policies translate into physical and digital barriers
called Chinese walls. These “walls” are mandatory and well-regulated for customer
data in most countries.
➢ A Chinese wall is a barrier that separates two or more groups, usually as a means of
restricting the flow of information. Typically, the wall is purely conceptual, although
groups may be divided by physical barriers as well as policies. Getting access to the
data may take time and involve company politics.
1.4.3.2 Don’t be afraid to shop around
➢ If data isn’t available inside our organization, look outside our organization’s walls.
Many companies specialize in collecting valuable information. Other companies
provide data so that you, in turn, can enrich their services and ecosystem. Such is the
case with Twitter, LinkedIn, and Facebook.
➢ Although data is considered as an asset in many companies. The governments and
organizations share their data for free with the world. This data can be of excellent
quality; it depends on the institution that creates and manages it.
➢ The information they share covers a broad range of topics such as the number of
accidents or amount of drug abuse in a certain region and its demographics. This data

by Ramya p
is helpful when we want to enrich proprietary data but also convenient when training
our data science skills.
1.4.3.3 Do data quality checks now to prevent problems later
➢ Try to spend a good portion of your project time doing data correction and cleansing,
sometimes up to 80%. The retrieval of data is the first time will inspect the data in the
data science process.
➢ Most of the errors will encounter during the data gathering phase are easy to spot, but
being too careless will make us spend many hours solving data issues that could have
been prevented during data import.
➢ Properly investigate the data during the import, data preparation, and exploratory
phases. The difference is in the goal and the depth of the investigation. During data
retrieval, check if the data is equal to the data in the source document. This shouldn’t
take too long; when you have enough evidence that the data is similar to the data you
find in the source document.
➢ Data preparation is the process of gathering, combining, structuring and organizing
data so it can be used in business intelligence (BI), analytics and data
visualization applications. The components of data preparation include data
preprocessing, profiling, cleansing, validation and transformation; it often also
involves pulling together data from different internal systems and external sources.
➢ Data preparation work is done by information technology (IT), BI and data
management teams as they integrate data sets to load into a data warehouse, NoSQL
database or data lake repository, and then when new analytics applications are
developed with those data sets.
➢ In addition, data scientists, data engineers, other data analysts and business users
increasingly use self-service data preparation tools to collect and prepare data
themselves.
➢ One of the primary purposes of data preparation is to ensure that raw data being
readied for processing and analysis is accurate and consistent
➢ During the exploratory phase our focus shifts to what we can learn from the data.
Now assume the data to be clean and look at the statistical properties such as
distributions, correlations, and outliers. We can often iterate over these phases. For
instance, when we discover outliers in the exploratory phase, they can point to a data
entry error. Now that we can understand how the quality of the data is improved
during the process.

3. Briefly describe the steps involved in Data Preparation. NOV /


by Ramya p
DEC 2023

by Ramya p
1.4.4 Data Preparation- Cleansing, integrating, and transforming data
➢ The data preparation phase covers all activities to construct the final dataset from the
initial raw data in order to prepare the data for further processing. Data preparation
tasks are likely to be performed multiple times, and not in any prescribed order.
Why Data Preparation?
➢ Data comes from multitude of sources; it can be high in volume and have variety of
attributes. Real-world data is generally noisy, incomplete and inconsistent. It implies
that raw data tends to be corrupt, have missing values or attributes, outliers or
conflicting values.
➢ Data preparation stage resolves such kinds of data issues to ensure the dataset used for
modeling stage is acceptable and of improved quality. Analytical models fed with poor
quality data can lead to misleading predictions.
Benefits of Data Preparation
➢ Good data preparation is crucial to producing valid and reliable models that have high
accuracy and efficiency. It is essential to spot data issues early to avoid getting
misleading predictions.
➢ Accuracy of any analytical model depends highly on the quality of data fed into it.
Excellent quality data leads to more useful insights which enhance organizational
decision making and improve overall operational efficiency.
➢ Figure 1.7 shows the most common actions to take during the data cleansing,
integration, and transformation phase.
1.4.4.1 Cleansing Data
➢ Data cleansing, also referred to as data cleaning or data scrubbing, is the process of
fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set.
➢ It involves identifying data errors and then changing, updating or removing data to
correct them. Data cleansing improves data quality and helps provide more accurate,
consistent and reliable information for decision-making in an organization.

by Ramya p
Figure 1.7 Data Preparation
➢ Data cleansing is a key part of the overall data management process and one of the
core components of data preparation work that readies data sets for use in business
intelligence (BI) and data science applications.
➢ It's typically done by data quality analysts and engineers or other data management
professionals. But data scientists, BI analysts and business users may also clean data
or take part in the data cleansing process for their own applications.
Why is clean data important?
➢ Business operations and decision-making are increasingly data-driven, as
organizations look to use data analytics to help improve business performance and
gain competitive advantages over rivals.
➢ As a result, clean data is a must for BI and data science teams, business executives,
marketing managers, sales reps and operational workers. That's particularly true in
retail, financial services and other data-intensive industries, but it applies to
organizations across the board, both large and small.

by Ramya p
➢ If data isn't properly cleansed, customer records and other business data may not be
accurate and analytics applications may provide faulty information. That can lead to
flawed business decisions, misguided strategies, missed opportunities and operational
problems, which ultimately may increase costs and reduce revenue and profits.
Types of data errors
➢ Data cleansing addresses a range of errors and issues in data sets, including
inaccurate, invalid, incompatible and corrupt data. Some of those problems are caused
by human error during the data entry process, while others result from the use of
different data structures, formats and terminology in separate systems throughout an
organization.
➢ For example, interpretation error: a person’s age is greater than 300 years.
Inconsistencies error: This type of error is putting “Female” in one table and “F” in
another, but both are represent the same thing, means that the person is female.
Data Cleaning Techniques
➢ Before start cleaning, think about the objectives and what you hope to gain from
cleaning and analyzing this data. This will help you establish what is relevant within
your data, and what is not. Here are 8 effective data cleaning techniques:
1. Remove duplicates or Data entry errors
2. Remove irrelevant data
3. Standardize capitalization
4. Convert data type
5. Clear formatting
6. Fix errors
7. Language translation
8. Handle missing values
Remove Duplicates or Data entry errors
➢ When you collect your data from a range of different places, or scrape your data, it’s
likely that you will have duplicated entries. These duplicates could originate from
human error where the person inputting the data or filling out a form made a mistake.
➢ Duplicates will inevitably skew your data and/or confuse your results. They can also
just make the data hard to read when you want to visualize it, so it’s best to remove
them right away.

by Ramya p
➢ Data collection and data entry are error-prone processes. They often require human
intervention, and because humans are only human, they make typos or lose their
concentration for a second and introduce an error into the chain.
➢ But data collected by machines or computers isn’t free from errors either. Errors can
arise from human sloppiness, whereas others are due to machine or hardware failure.
Examples of errors originating from machines are transmission errors or bugs in the
extract, transform, and load phase.
➢ For small data sets you can check every value by hand. Detecting data errors when the
variables you study don’t have many classes can be done by tabulating the data with
counts.
➢ When you have a variable that can take only two values: “Good” and “Bad”, you can
create a frequency table and see if those are truly the only two values present. In table
1.1, the values “Godo” and “Bade” point out something went wrong in at least 16
cases.
Table 1.1
Value Count
Good 1598647
Bad 1354468
Godo 15
Bade 1
➢ Most errors of this type are easy to fix with simple assignment statements and if-
thenelse rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Remove Irrelevant Data
➢ Irrelevant data will slow down and confuse any analysis that you want to do. So,
deciphering what is relevant and what is not is necessary before you begin your data
cleaning.
➢ For instance, if you are analyzing the age range of your customers, you don’t need to
include their email addresses. Sanity checks are another valuable type of data check.
➢ Here you check the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 299 years. Sanity checks can be
directly expressed with rules:

by Ramya p
check = 0 <= age <= 120
Standardize Capitalization
➢ Capital letter mismatches are common. Within your data, you need to make sure that
the text is consistent. If you have a mixture of capitalization, this could lead to
different erroneous categories being created.
➢ It could also cause problems when you need to translate before processing as
capitalization can change the meaning. For instance, Bill is a person's name whereas a
bill is something else entirely.
➢ Most programming languages make a distinction between “Brazil” and “brazil”. In
this case you can solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python. “Brazil”.lower() == “brazil”.lower() should
result in true.
Convert Data Types
➢ Numbers are the most common data type that you will need to convert when cleaning
your data. Often numbers are imputed as text, however, in order to be processed, they
need to appear as numerals.
➢ If they are appearing as text, they are classed as a string and your analysis algorithms
cannot perform mathematical equations on them.
➢ The same is true for dates that are stored as text. These should all be changed to
numerals. For example, if you have an entry that reads September 24th 2021, you’ll
need to change that to read 09/24/2021.
Clear Formatting
➢ Machine learning models can’t process the information if it is heavily formatted. If
you are taking data from a range of sources, it’s likely that there are a number of
different document formats. This can make your data confusing and incorrect.
➢ You should remove any kind of formatting that has been applied to your documents,
so you can start from zero. This is normally not a difficult process, both excel and
google sheets, for example, have a simple standardization function to do this.
Fix Errors

➢ It probably goes without saying that you’ll need to carefully remove any errors from
your data. Errors as avoidable as typos could lead to you missing out on key findings
from your data. Some of these can be avoided with something as simple as a quick
spell-check.

by Ramya p
➢ Spelling mistakes or extra punctuation in data like an email address could mean you
miss out on communicating with your customers. It could also lead to you sending
unwanted emails to people who didn’t sign up for them.

➢ Other errors can include inconsistencies in formatting. For example, if you have a
column of US dollar amounts, you’ll have to convert any other currency type into US
dollars so as to preserve a consistent standard currency. The same is true of any other
form of measurement such as grams, ounces, etc.

Language Translation
➢ To have consistent data, you’ll want everything in the same language. The Natural
Language Processing (NLP) models behind software used to analyze data are also
predominantly monolingual, meaning they are not capable of processing multiple
languages. So, need to translate everything into one language.
Handle Missing Values
➢ When it comes to missing values you have two options:
1. Remove the observations that have this missing value

2. Input the missing data

➢ Removing the missing value completely might remove useful insights from your data.
After all, there was a reason that you wanted to pull this information in the first
place.

➢ Therefore it might be better to input the missing data by researching what should go
in that field. If you don’t know what it is, you could replace it with the word missing.
If it is numerical you can place a zero in the missing field.

➢ However, if there are so many missing values that there isn’t enough data to use, then
you should remove the whole section.

Benefits of effective data cleansing

• Improved decision-making. With more accurate data, analytics applications can


produce better results. That enables organizations to make more informed decisions
on business strategies and operations, as well as things like patient care and
government programs.

by Ramya p
• More effective marketing and sales. Customer data is often wrong, inconsistent or
out of date. Cleaning up the data in customer relationship management and sales
systems helps improve the effectiveness of marketing campaigns and sales efforts.

• Better operational performance. Clean, high-quality data helps organizations avoid


inventory shortages, delivery snafus and other business problems that can result in
higher costs, lower revenues and damaged relationships with customers.

• Increased use of data. Data has become a key corporate asset, but it can't generate
business value if it isn't used. By making data more trustworthy, data cleansing helps
convince business managers and workers to rely on it as part of their jobs.

• Reduced data costs. Data cleansing stops data errors and issues from further
propagating in systems and analytics applications. In the long term, that saves time
and money, because IT and data management teams don't have to continue fixing the
same errors in data sets.

1.4.4.2 Data integrating or Data blending


➢ Data blending involves pulling data from different sources and creating a single,
unique, dataset for visualization and analysis.
➢ To illustrate, you may have data spread out across multiple spreadsheets like Excel,
business intelligence systems, IoT devices, cloud systems, and web applications.
➢ Using a data blending platform, quickly mash together data from all disparate sources
in a way that’s fast and easy.
➢ Data blending is typically used for ad hoc reporting and rapid analysis.
Different ways of combining data
➢ There are two operations to combine information from different data sets. The first
operation is joining: enriching an observation from one table with information from
another table.
➢ The second operation is appending or stacking: adding the observations of one table
to those of another table. When combining data, you have the option to create a new
physical table or a virtual table by creating a view. The advantage of a view is that it
doesn’t consume more disk space.
Joining tables
➢ Joining tables allows to combine two analytics tables with different record structures
into a new third table.

by Ramya p
➢ Select any combination of fields from the two original tables to be included in the
new table. Record structures are different if they have one or more fields (data
elements) that differ. Joining is a good choice for investigative work that requires a
permanently joined set of data as a starting point for analysis.

Figure 1.8 Join Tables

Example
➢ Let’s say you work for a company that sells T-shirts on a website. Consider a dataset
that tells the details about the website users. For simplicity’s sake, let’s say this table
contains just a unique client ID, client name, and country of origin:

Client-id Client_name Country


101 Jackson India
102 Bob USA
103 John Maxico
104 Jana Maxico
➢ Another dataset that tells, all of the purchases that those website users have made:

Client_id Sale
101 1000
102 1500
103 500
104 2500

by Ramya p
➢ Consider the scenario, your boss walks into the room and demands that you tell her
which country has the most website sales.
➢ To answer her question, you need to bring the data that tells you where your clients are
located together with the data that tells you what they’ve purchased. In other words,
for each transaction, you need to figure out where the client was located.
➢ What makes this possible is the fact that there is some commonality between the two
datasets. In this case, it’s the client ID.
➢ To “join” is simply to combine data based on a common data point. Fittingly, that
common data point is called a “primary key.”
➢ For this example, we can join the user data to the purchase data with the SQL query
below, which uses “client ID” as the primary key:
select purchases.client_id, purchases.sale, users.country from purchases left join users
on purchases.client_id = users.client_id;

Client-id Sale Country


101 1000 India
102 1500 USA
103 500 Maxico
104 2500 Maxico
➢ Now we can see that Mexico has the most sales.
Appending tables
➢ Appending tables combines records from two or more Analytics tables into a new
table. It is need to append multiple tables into a single table before perform analysis.
➢ For example, you want to perform analysis on an entire year's worth of data but the
data is spread among twelve monthly Excel worksheets. After importing the
individual worksheets into Analytics, you can then append them to create a single
annual table for analysis.
Example
➢ Figure 1.9 shows an example of appending tables. One table contains the observations
from the month January and the second table contains observations from the month
February.
➢ The result of appending these tables is a larger one with the observations from
January as well as February. The equivalent operation in set theory would be the
union, and this is also the command in SQL, the common language of relational

by Ramya p
databases. Other set operators are also used in data science, such as set difference and
intersection.

Figure 1.9 Appending tables


1.4.4.3 Transforming Data
➢ After Successful completion of cleansed and integrated data, the next phase is
transforming data. Data transformation is the process of converting data from one
format, such as a database file, XML document or Excel spreadsheet, into another.
➢ The process of data transformation, as noted, involves identifying data sources and
types; determining the structure of transformations that need to occur; and defining
how fields will be changed or aggregated.
➢ It includes extracting data from its original source, transforming it and sending it to
the target destination, such as a database or data warehouse. Extractions can come
from many locations, including structured sources, streaming sources or log files from
web applications.
➢ Data analysts, data engineers and data scientists are typically in charge of data
transformation within an organization. They identify the source data, determine the
required data formats and perform data mapping, as well as execute the actual
transformation process before moving the data into appropriate databases for storage
and use.

by Ramya p
➢ Their work involves five main steps:

• Data discovery, in which data professionals use data profiling tools or profiling scripts to
understand the structure and characteristics of the data and also to determine how it should be
transformed.

• Data mapping, during which data professionals connect, or match, data fields from one
source to data fields in another.

• Code generation, a part of the process where the software code required to transform the
data is created (either by data transformation tools or the data professionals themselves
writing script).

• Execution of the code, where the data undergoes the transformation.

• Review, during which data professionals or the business/end users confirm that the output
data meets the established transformation requirements and, if not, address and correct any
anomalies and errors.
Converting a data from linear data into sequential or continuous form of data

Reducing the number of variables - Having too many variables in your model makes the
model difficult to handle, and certain techniques don’t perform well when you overload
them with too many input variables.

by Ramya p
B.E- CSE-CS3352–FDS – R-2021 II/III

Turning variables into Dummies

Benefits and challenges of data transformation


➢ Organizations across the board need to analyze their data for a host of business
operations, from customer service to supply chain management. They also need data
to feed the increasing number of automated and intelligent systems within their
enterprise.
➢ To gain insight into and improve these operations, organizations need high-quality
data in formats compatible with the systems consuming the data.
➢ Thus, data transformation is a critical component of an enterprise data program
because it delivers the following benefits:
• Higher data quality
• Reduced number of mistakes, such as missing values
• Faster queries and retrieval times
• Less resources needed to manipulate data
• Better data organization and management
• More usable data, especially for advanced business intelligence or analytics.
➢ The data transformation process, however, can be complex and complicated. The
challenges organizations face include the following:
• High cost of transformation tools and professional expertise

by Ramya p
• Significant compute resources, with the intensity of some on-premises
transformation processes having the potential to slow down other operations.
• Difficulty recruiting and retaining the skilled data professionals required for
this work, with data professionals some of the most in-demand workers today.
• Difficulty of properly aligning data transformation activities to the business's
data-related priorities and requirements.
Examples of data transformation
➢ There are various data transformation methods, including the following:
• Aggregation, in which data is collected from multiple sources and stored in a
single format.
• Attribute construction, in which new attributes are added or created from
existing attributes.
• Discretization, which involves converting continuous data values into sets of
data intervals with specific values to make the data more manageable for
analysis.
• Generalization, where low-level data attributes are converted into high-level
data attributes (for example, converting data from multiple brackets broken up
by ages into the more general "young" and "old" attributes) to gain a more
comprehensive view of the data.
• Integration, a step that involves combining data from different sources into a
single view.
• Manipulation, where the data is changed or altered to make it more readable
and organized.
• Normalization, a process that converts source data into another format to
limit the occurrence of duplicated data.
• Smoothing, which uses algorithms to reduce "noise" in data sets, thereby
helping to more efficiently and effectively identify trends in the data.

1.4.5 Exploratory Data analysis

➢ During exploratory data analysis take a deep dive into the data (see figure 1.10).

by Ramya p
Figure 1.10 Data exploration

➢ Exploratory Data Analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data visualization
methods.
➢ It helps determine how best to manipulate data sources to get the answers you need,
making it easier for data scientists to discover patterns, spot anomalies, test a
hypothesis, or check assumptions.

➢ EDA is primarily used to see what data can reveal beyond the formal modeling or
hypothesis testing task and provides a better understanding of data set variables and
the relationships between them.

➢ The goal isn’t to cleanse the data, but it’s common that still discover anomalies you
missed before, forcing you to take a step back and fix them.
Importance of Exploratory Data Analysis in Data Science
➢ The main purpose of EDA is to help look at data before making any assumptions. It
can help identify obvious errors, as well as better understand patterns within the data,
detect outliers or anomalous events, find interesting relations among the variables.
➢ Data scientists can use exploratory analysis to ensure the results they produce are
valid and applicable to any desired business outcomes and goals. EDA also helps
stakeholders by confirming they are asking the right questions.
➢ EDA can help answer questions about standard deviations, categorical variables, and
confidence intervals. Once EDA is complete and insights are drawn, its features can

by Ramya p
then be used for more sophisticated data analysis or modeling, including machine
learning.
➢ The visualization techniques you use in this phase range from simple line graphs or
histograms, as shown in figure 1.11, to more complex diagrams such as Sankey and
network graphs. Sometimes it’s useful to compose a composite graph from simple
graphs to get even more insight into the data

by Ramya p
Figure 1.11 From top to bottom, a bar chart, a line plot, and a distribution are some of the
graphs used in exploratory analysis.
➢ These plots can be combined to provide even more insight, as shown in figure 1.12.

Figure 1.12 Drawing multiple plots together can help you understand the structure of your
data over multiple variables.

by Ramya p
➢ Overlaying several plots is common practice. In figure 1.13, combine simple graphs
into a Pareto diagram, or 80-20 diagram.

Figure 1.13 A Pareto diagram is a combination of the values and a cumulative distribution.
➢ It’s easy to see from this diagram that the first 50% of the countries contain slightly
less than 80% of the total amount. If this graph represented customer buying power
and we sell expensive products, probably don’t need to spend our marketing budget in
every country; start with the first 50%.
➢ Figure 1.14 shows another technique: brushing and linking. With brushing and
linking, combine and link different graphs and tables (or views) so changes in one
graph are automatically transferred to the other graphs.
➢ Figure 1.14 shows the average score per country for questions. Not only does this
indicate a high correlation between the answers, but it’s easy to see that when you
select several points on a subplot, the points will correspond to similar points on the
other graphs.
➢ In this case the selected points on the left graph correspond to points on the middle
and right graphs, although they correspond better in the middle and right graphs.

by Ramya p
Figure 1.14 Link and brush allows you to select observations in one plot and highlight the
same observations in the other plots.
➢ Two other important graphs are the histogram shown in figure 1.15 and the boxplot
shown in figure 1.16. In a histogram a variable is cut into discrete categories and the
number of occurrences in each category are summed up and shown in the graph.
➢ The boxplot, on the other hand, doesn’t show how many observations are present but
does offer an impression of the distribution within categories. It can show the
maximum, minimum, median, and other characterizing measures at the same time.
Tabulation, clustering, and other modeling techniques can also be a part of
exploratory analysis

Figure 1.15 Example histogram: the number of people in the age groups of 5-year intervals

Figure 1.16 Example boxplot: each user category has a distribution of the appreciation each
has for a certain picture on a photography website.
1.4.6 Build the models
➢ With clean data and a good understanding of the content, now which is ready to build
models with the goal of making better predictions, classifying objects, or gaining an
understanding of the system.

by Ramya p
➢ This phase is much more focused than the exploratory analysis step, because you
know what you’re looking for and what you want the outcome to be. Figure 1.17
shows the components of model building.

Figure 1.17 Data modelling


➢ Building a model is an iterative process. The way for build the model depends on
whether you go with classic statistics or the somewhat more recent machine learning.
most of the models consist of the following main steps:
• Selection of a modeling technique and variables to enter in the model
• Execution of the model
• Diagnosis and model comparison
Model and variable seletion
➢ First, select the variables to include in your model and a modelling technique. Your
findings from the exploratory analysis should already give a fair idea of what
variables will help to construct a good model.
➢ Many modeling techniques are available, and choosing the right model for a problem
requires judgment on your part.
➢ Now consider the model performance and whether your project meets all the
requirements to use your model, as well as other factors:
• Must the model be moved to a production environment and, if so, would it be
easy to implement?
• How difficult is the maintenance on the model: how long will it remain
relevant if left untouched?
• Does the model need to be easy to explain?

by Ramya p
Model execution

by Ramya p
➢ Once finalized a model, now this is the time to implement it in code. Python is the
most prevalent coding language leveraged in the Data Science profession, however
other programming languages such as R, Perl, C/C++, SQL, and Java are also used.
Data Scientists can use these programming languages to arrange Unstructured Data
Collections.

➢ Machine Learning is a must-have ability for any Data Scientist. Predictive Models are
created using Machine Learning. For example, if you want to forecast how many
clients you’ll have in the coming month based on the previous month’s Data, you’ll
need to employ Machine Learning techniques. Machine Learning and Deep Learning
techniques are the backbones of Data Science Modelling.
Incorporating Machine Learning Algorithms
➢ This is one of the most crucial processes in Data Science Modelling as the Machine
Learning Algorithm aids in creating a usable Data Model. There are a lot of
algorithms to pick from, the Model is selected based on the problem. There are three
types of Machine Learning methods that are incorporated:
1. Supervised Learning
It is based on the results of a previous operation that is related to the existing business
operation. Based on previous patterns, Supervised Learning aids in the prediction of an
outcome. Some of the Supervised Learning Algorithms are:
• Linear Regression
• Random Forest

by Ramya p
• Support Vector Machines
2. Unsupervised Learning
➢ This form of learning has no pre-existing consequence or pattern. Instead, it
concentrates on examining the interactions and connections between the presently
available Data points. Some of the Unsupervised Learning Algorithms are:
• KNN (k-Nearest Neighbors)
• K-means Clustering

• Hierarchical Clustering

• Anomaly Detection

by Ramya p
3. Reinforcement Learning
➢ It is a fascinating Machine Learning technique that uses a dynamic Dataset that
interacts with the real world. In simple terms, it is a mechanism by which a system
learns from its mistakes and improves over time. Some of the Reinforcement
Learning Algorithms are:
• Q-Learning
• State-Action-Reward-State-Action (SARSA)
• Deep Q Network
Model diagnostics and model comparison
➢ Building multiple models from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model.
➢ A holdout sample is a part of the data you leave out of the model building so it can be
used to evaluate the model afterward. The principle here is simple: the model should
work on unseen data.
➢ Use only a fraction of your data to estimate the model and the other part, the holdout
sample, is kept out of the equation. The model is then unleashed on the unseen data
and error measures are calculated to evaluate it.
➢ Multiple error measures are available, and in figure 1.18 show the general idea on
comparing models. The error measure used in the example is the mean square error.

➢ Mean square error is a simple measure: check for every prediction how far it was
from the truth, square this error, and add up the error of every prediction. Figure 1.18
compares the performance of two models to predict the order size from the price.
➢ The first model is size = 3 * price and the second model is size = 10. To estimate the
models, we use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
➢ Once the model is trained, then predict the values for the other 20% of the variables
based on those for which already know the true value, and calculate the model error
with an error measure.
➢ Then choose the model with the lowest error. In this example chose model 1 because
it has the lowest total error. Many models make strong assumptions, such as

by Ramya p
independence of the inputs, and have to verify that these assumptions are indeed met.
This is called model diagnostics.

Figure 1.18 A holdout sample helps you compare models


1.4.7 Presenting findings and building applications
➢ After successful analyzed the data and built a well-performing model, now which is
ready to present our findings to the world (Figure 1.19).

Figure 1.19 Presentation and Automation

➢ The final step of the data analytics process is to share these insights with the wider
world (or at least with your organization’s stakeholders!) This is more complex than
simply sharing the raw results of work.
➢ It involves interpreting the outcomes, and presenting them in a manner that’s
digestible for all types of audiences. Since you’ll often present information to

by Ramya p
decision-makers, it’s very important that the insights you present are 100% clear and
unambiguous. For this reason, data analysts commonly use reports, dashboards, and
interactive visualizations to support their findings.

➢ Depending on what you share, your organization might decide to restructure, to


launch a high-risk product, or even to close an entire division. That’s why it’s very
important to provide all the evidence that you’ve gathered, and not to cherry-pick
data.
1.5 Data Mining
➢ Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data
mining techniques and tools enable enterprises to predict future trends and make
more-informed business decisions.
➢ Data mining is a key part of data analytics overall and one of the core disciplines
in data science, which uses advanced analytics techniques to find useful information
in data sets.
➢ At a more granular level, data mining is a step in the knowledge discovery in
databases (KDD) process, a data science methodology for gathering, processing and
analyzing data.
1.5.1 Importance of data mining
➢ Data mining is a crucial component of successful analytics initiatives in organizations.
The information it generates can be used in business intelligence (BI) and advanced
analytics applications that involve analysis of historical data, as well as real-time
analytics applications that examine streaming data as it's created or collected.
➢ Effective data mining aids in various aspects of planning business strategies and
managing operations. That includes customer-facing functions such as marketing,
advertising, sales and customer support, plus manufacturing, supply chain
management, finance and HR.
➢ Data mining supports fraud detection, risk management, cyber security planning and
many other critical business use cases. It also plays an important role in healthcare,
government, scientific research, mathematics, sports and more.

by Ramya p
Figure 1.20 Importance of Data Mining
1.5.2 Data mining process
➢ Data mining is typically done by data scientists and other skilled BI and analytics
professionals. But it can also be performed by data-savvy business analysts,
executives and workers who function as citizen data scientists in an organization.
➢ Its core elements include machine learning and statistical analysis, along with data
management tasks done to prepare data for analysis.
➢ The use of machine learning algorithms and artificial intelligence (AI) tools has
automated more of the process and made it easier to mine massive data sets, such as
customer databases, transaction records and log files from web servers, mobile apps
and sensors.
➢ The data mining process can be broken down into these four primary stages:
1.5.2.1 Data gathering
➢ Relevant data for an analytics application is identified and assembled. The data may
be located in different source systems, a data warehouse or a data lake, an
increasingly common repository in big data environments that contain a mix of
structured and unstructured data.
➢ External data sources may also be used. Wherever the data comes from, a data
scientist often moves it to a data lake for the remaining steps in the process.
1.5.2.2 Data preparation

by Ramya p
➢ This stage includes a set of steps to get the data ready to be mined. It starts with
data exploration, profiling and pre-processing, followed by data cleansing work to
fix errors and other data quality issues.
➢ Data transformation is also done to make data sets consistent, unless a data
scientist is looking to analyze unfiltered raw data for a particular application.
1.5.2.3 Mining the data
➢ Once the data is prepared, a data scientist chooses the appropriate data mining
technique and then implements one or more algorithms to do the mining.
➢ In machine learning applications, the algorithms typically must be trained on sample
data sets to look for the information being sought before they're run against the full set
of data.
1.5.2.4 Data analysis and interpretation
➢ The data mining results are used to create analytical models that can help drive
decision-making and other business actions. The data scientist or another member of a
data science team also must communicate the findings to business executives and
users, often through data visualization and the use of data storytelling techniques.
1.5.3 Types of data mining techniques
➢ Various techniques can be used to mine data for different data science applications.
Pattern recognition is a common data mining use case that's enabled by multiple
techniques, as is anomaly detection, which aims to identify outlier values in data sets.
Popular data mining techniques include the following types:
• Association rule mining. In data mining, association rules are if-then
statements that identify relationships between data elements. Support and
confidence criteria are used to assess the relationships. The support measures
how frequently the related elements appear in a data set, while confidence
reflects the number of times an if-then statement is accurate.
• Classification. This approach assigns the elements in data sets to different
categories defined as part of the data mining process. Decision trees, Naive
Bayes classifiers, k-nearest neighbor and logistic regression are some
examples of classification methods.
• Clustering. In this case, data elements that share particular characteristics are
grouped together into clusters as part of data mining applications. Examples
include k-means clustering, hierarchical clustering and Gaussian mixture
models.

by Ramya p
• Regression. This is another way to find relationships in data sets, by
calculating predicted data values based on a set of variables. Linear regression
and multivariate regression are examples. Decision trees and some other
classification methods can be used to do regressions, too.
• Sequence and path analysis. Data can also be mined to look for patterns in
which a particular set of events or values leads to later ones.
• Neural networks. A neural network is a set of algorithms that simulates the
activity of the human brain. Neural networks are particularly useful in
complex pattern recognition applications involving deep learning, a more
advanced offshoot of machine learning.
1.5.4 Benefits of data mining
➢ In general, the business benefits of data mining come from the increased ability to
uncover hidden patterns, trends, correlations and anomalies in data sets. That
information can be used to improve business decision-making and strategic planning
through a combination of conventional data analysis and predictive analytics.
➢ Specific data mining benefits include the following:
• More effective marketing and sales. Data mining helps marketers better
understand customer behavior and preferences, which enables them to create
targeted marketing and advertising campaigns. Similarly, sales teams can use
data mining results to improve lead conversion rates and sell additional
products and services to existing customers.
• Better customer service. Thanks to data mining, companies can identify
potential customer service issues more promptly and give contact center
agents up-to-date information to use in calls and online chats with customers.
• Improved supply chain management. Organizations can spot market trends
and forecast product demand more accurately, enabling them to better manage
inventories of goods and supplies. Supply chain managers can also use
information from data mining to optimize warehousing, distribution and other
logistics operations.
• Increased production uptime: Mining operational data from sensors on
manufacturing machines and other industrial equipment supports predictive
maintenance applications to identify potential problems before they occur,
helping to avoid unscheduled downtime.

by Ramya p
• Stronger risk management: Risk managers and business executives can better assess
financial, legal, cyber security and other risks to a company and develop plans for
managing them.
• Lower costs. Data mining helps drive cost savings through operational
efficiencies in business processes and reduced redundancy and waste in corporate
spending.
1.5.5 Industry examples of data mining
➢ Here's how organizations in some industries use data mining as part of analytics applications:
• Retail. Online retailers mine customer data and internet clickstream records to help them
target marketing campaigns, ads and promotional offers to individual shoppers.
• Financial services. Banks and credit card companies use data mining tools to
build financial risk models, detect fraudulent transactions and vet loan and credit
applications.
• Insurance. Insurers rely on data mining to aid in pricing insurance policies
and deciding whether to approve policy applications, including risk modeling and
management for prospective customers.
• Manufacturing. Data mining applications for manufacturers include efforts to
improve uptime and operational efficiency in production plants, supply chain performance
and product safety.
• Entertainment. Streaming services do data mining to analyze what users are
watching or listening to and to make personalized recommendations based on people's
viewing and listening habits.
• Healthcare. Data mining helps doctors diagnose medical conditions, treat
patients and analyze X-rays and other medical imaging results. Medical research also
depends heavily on data mining, machine learning and other forms of analytics.
Architecture of Data Mining

A typical data mining system may have the following major components.

by Ramya p
Data Mining Functionalities:

We have observed various types of databases and information


repositories on which data mining can be performed. Let us now examine the
kinds of data patterns that can be mined. Data mining functionalities are used to
specify the kind of patterns to be found in data mining tasks. In general, data
mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order
to make predictions.

In some cases, users may have no idea regarding what kinds of patterns in their
data may be interesting, and hence may like to search for several different
kinds of patterns inparallel. Thus it is important to have a data mining system
that can mine multiple kinds of patterns to accommodate different user
expectations or applications. Furthermore, data mining systems should be able
to discover patterns at various granularities (i.e., different levels of abstraction).
Data mining systems should also allow users to specify hints to guide or focus
the search for interesting patterns. Because some patterns may not hold for all
of the data in the database, a measure of certainty or “trustworthiness” is
usually associated with each discovered pattern.
Data mining functionalities, and the kinds of patterns they can discover,
are described Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in
data. There are many kinds of frequent patterns, including item sets,
subsequences, and substructures.
A frequent item set typically refers to a set of items that frequently appear
together in a transactional data set, such as milk and bread. A frequently
occurring subsequence, such as the pattern that customers tend to purchase first
a PC, followed by a digital camera, and then a memory card, is a (frequent)
sequential pattern. A substructure can refer to different structural forms, such
as graphs, trees, or lattices, which may be combined with item sets or
subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern. Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.

by Ramya p
Data Warehousing
➢ A data warehouse is a central repository of information that can be analyzed to make
more informed decisions. Data flows into a data warehouse from transactional
systems, relational databases, and other sources, typically on a regular cadence.

by Ramya p
➢ Business analysts, data engineers, data scientists, and decision makers access the data
through business intelligence (BI) tools, SQL clients, and other analytics applications.
➢ Data and analytics have become indispensable to businesses to stay competitive.
Business users rely on reports, dashboards, and analytics tools to extract insights from
their data, monitor business performance, and support decision making.
➢ Data warehouses power these reports, dashboards, and analytics tools by storing data
efficiently to minimize the input and output (I/O) of data and deliver query results
quickly to hundreds and thousands of users concurrently.
➢ A typical data warehouse often includes the following elements:
• A relational database to store and manage data
• An extraction, loading, and transformation (ELT) solution for preparing the
data for analysis.
• Statistical analysis, reporting, and data mining capabilities.
• Client analysis tools for visualizing and presenting data to business users.
• Other, more sophisticated analytical applications that generate actionable
information by applying data science and artificial intelligence (AI)
algorithms, or graph and spatial features that enable more kinds of analysis of
data at scale
1.5.6 Benefits of Data Warehouse
➢ Informed decision making
➢ Consolidated data from many sources
➢ Historical data analysis
➢ Data quality, consistency, and accuracy
➢ Separation of analytics processing from transactional databases, which improves
performance of both systems
1.6.1 Data Warehouse Component
➢ Build a data warehouse with software and hardware components.

by Ramya p
Figure 1.21 Components of Data warehousing
➢ The figure 1.21 shows the essential elements of a typical warehouse. The Source
Data component shows on the left. The Data staging element serves as the next
building block.
➢ In the middle, the Data Storage component that handles the data warehouses data.
This element not only stores and manages the data; it also keeps track of data using
the metadata repository.
➢ The Information Delivery component shows on the right consists of all the different
ways of making the information from the data warehouses available to the users.
1.6.1.1 Source Data Component
➢ Source data coming into the data warehouses may be grouped into four broad
categories:
➢ Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments
of the data from the various operational modes.
➢ Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.

by Ramya p
➢ Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
➢ External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.
1.6.1.2 Data Staging Component
➢ After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and made
ready in a format that is relevant to be saved for querying and analysis.

Figure 1.22 Data staging


1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks
as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or

by Ramya p
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source
records.
On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.
1.6.1.3 Data Storage Components
➢ Data storage for the data warehousing is a split repository. The data repositories for
the operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.
1.6.1.4 Information Delivery Component
➢ The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to
some customer-specified scheduling algorithm.
1.6.1.5 Metadata Component
➢ Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information about
the indexes, and so on.
1.6.1.6 Data Marts
➢ It includes a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to particular selected subjects. Data in a data warehouse should
be a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable.

by Ramya p
➢ Data marts are lower than data warehouses and usually contain organization. The
current trends in data warehousing are to develop a data warehouse with several
smaller related data marts for particular kinds of queries and reports.
1.6.1.7 Management and Control Component
➢ The management and control elements coordinate the services and functions within
the data warehouse. These components control the data transformation and the data
transfer into the data warehouse storage.
➢ On the other hand, it moderates the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in the
repositories.
➢ It monitors the movement of information into the staging method and from there into
the data warehouses storage itself.
1.6.2 Difference between Database and Data Warehouse
Database Data warehouse
It is used for Online Transactional Processing It is used for Online Analytical Processing
(OLTP) but can be used for other objectives (OLAP). This reads the historical information
such as Data Warehousing. This records the for the customers for business decisions.
data from the clients for history.
The tables and joins are complicated since The tables and joins are accessible since they
they are normalized for RDBMS. This is are de-normalized. This is done to minimize
done to reduce redundant files and to save the response time for analytical queries.
storage space.
Data is dynamic Data is largely static
Entity: Relational modeling procedures are Data: Modeling approach are used for the
used for RDBMS database design. Data Warehouse design.
Optimized for write operations. Optimized for read operations.
Performance is low for analysis queries. High performance for analytical queries.
The database is the place where the data is Data Warehouse is the place where the
taken as a base and managed to get available application data is handled for analysis and
fast and efficient access. reporting objectives

1.6.3 Data Warehouse Architecture


➢ A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing

by Ramya p
within the enterprise. Each data warehouse is different, but all are characterized by
standard vital components.
➢ Production applications such as payroll accounts payable product purchasing and
inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.
➢ Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.
➢ Production databases are updated continuously by either by hand or via OLTP
applications. In contrast, a warehouse database is updated from operational systems
periodically, usually during off-hours.
➢ As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users.
➢ As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the
needs to the user for sorting, combining, and summarizing data.
➢ Data warehouses and their architectures very depending upon the elements of an
organization's situation. (Figure 1.23)

Figure 1.23 Data Warehouse Architecture


Operational System
➢ An operational system is a method used in data warehousing to refer to a system that
is used to process the day-to-day transactions of an organization.
Flat Files

by Ramya p
➢ A Flat file system is a system of files in which transactional data is stored, and every
file in the system must have a different name.
Meta Data
➢ A set of data that defines and gives information about other data. Meta Data used in
Data Warehouse for a variety of purpose, including:
o Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.
o Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data

➢ The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
➢ The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools
➢ The principal purpose of a data warehouse is to provide information to the business
managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.
➢ The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

EXPLAIN THE ARCHITECHTURE OF DATA WRAEHOUSE? AP / MAY 2023


Three Tier Data Warehouse Architecture:Tier-1:
The bottom tier is a warehouse database server that is almost always a
relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources (such
as customer profile information provided by external consultants). These tools
and utilities perform data extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load
and refresh functions to update the data warehouse.
The data are extracted using application program interfaces known as
by Ramya p
gateways. A gateway is supported by the underlying DBMS and allows client
programs to generate SQL code to be executed at a server. Examples of
gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection). This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
Tier-2:
The middle tier is an OLAP server that is typically implemented
using either a relational OLAP(ROLAP) model or a multidimensional
OLAP.
OLAP model is an extended relational DBMS that maps operations on
multidimensional data to standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose
server that directly implements multidimensional data and
operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools,and/or data mining tools (e.g., trend analysis, prediction, and so
on).

by Ramya p
Data Warehouse Models:

There are three data warehouse models.

➢ Enterprise warehouse:

An enterprise warehouse collects all of the information about subjects


spanning the entire organization. It provides corporate-wide data integration, usually
from one or more operational systems or external information providers, and is
cross- functional in scope. It typically contains detailed data as well as summarized
data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes,
or beyond. An enterprise data warehouse may be implemented on
traditional mainframes, computer super servers, or parallel architecture platforms. It
requires extensive business modeling and may take years to design and build.

➢ Data mart:

A Data mart contains a subset of corporate-wide data that is of value to a


specific group of users. The scope is confined to specific selected subjects.
For example, a marketing data mart may confine its subjects to customer,
item, and sales. The data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental
servers that are UNIX/LINUX- or Windows-based. The implementation
cycle of a data mart is more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the
long run if its design and planning were not enterprise- wide.

Depending on the source of data, data marts can be categorized as


independent more dependent. Independent data marts are sourced from
data captured from one or more operational systems or external
information providers, or from data generated locally within a particular
department or geographic area. Dependent data marts are source directly
from enterprise data warehouses.

➢ Virtual warehouse:

by Ramya p
A virtual warehouse is a set of views over operational databases.
For efficient query processing, only some of the possible summary views
may be materialized.
A virtual warehouse is easy to build but requires excess capacity on
operational database servers.

1.7 Basic statistical descriptions of Data


➢ For data preprocessing to be successful, it is essential to have an overall picture of our
data. Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
1.7.1 Measuring the Central Tendency: Mean, Median, Mode

by Ramya p
➢ A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data.
➢ As such, measures of central tendency are sometimes called measures of central
location. They are also classed as summary statistics. The mean (often called the
average) is most likely the measure of central tendency that you are most familiar
with, but there are others, such as the median and the mode.
➢ The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate to
use than others.
1.7.1.1 Mean (Arithmetic)
➢ The mean (or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data, although its use is
most often with continuous data.
➢ The mean is equal to the sum of all the values in the data set divided by the number of
values in the data set.
➢ The most common and effective numeric measure of the “center” of a set of data is
the (arithmetic) mean. Let x1, x2, . . . , xN be a set of N values or observations, such
as for some numeric attribute X, like salary.
➢ The mean of this set of values is

Example
➢ Suppose we have the following values for salary (in thousands of dollars), shown in
increasing order: 30, 31, 47, 50, 52, 52, 56, 60, 63, 70, 70 and 110. Using the above
equation, we have

Thus, the mean salary is $58K.


1.7.1.2 Median

by Ramya p
➢ The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to
calculate the median, suppose we have the data below:

➢ We first need to rearrange that data into order of magnitude (smallest first):

➢ Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the
middle mark because there are 5 scores before it and 5 scores after it. This works fine
when you have an odd number of scores, but what happens when you have an even
number of scores? What if you had only 10 scores? Simply have to take the middle
two scores and average the result. So, if we look at the example below:

➢ We again rearrange that data into order of magnitude (smallest first):

➢ Only now we have to take the 5th and 6th score in our data set and average them to
get a median of 55.5.
1.7.1.3 Mode
➢ The mode is the most frequent score in our data set. On a histogram it represents the
highest bar in a bar chart or histogram. Sometimes consider the mode as being the
most popular option. An example of a mode is presented below:

by Ramya p
➢ Normally, the mode is used for categorical data where we wish to know which is the
most common category, as illustrated below:

➢ From the above bar chart that the most common form of transport, in this particular
data set, is the bus. However, one of the problems with the mode is that it is not
unique, so it leaves us with problems when we have two or more values that share the
highest frequency, such as below:

by Ramya p
➢ We are now stuck as to which mode best describes the central tendency of the data.
This is particularly problematic when we have continuous data because we are more
likely not to have any one value that is more frequent than the other.

➢ For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How
likely is it that we will find two or more people with exactly the same weight (e.g.,
67.4 kg)? The answer, is probably very unlikely - many people might be close, but
with such a small sample (30 people) and a large range of possible weights, you are
unlikely to find two people with exactly the same weight; that is, to the nearest 0.1 kg.
This is why the mode is very rarely used with continuous data.

➢ Another problem with the mode is that it will not provide us with a very good
measure of central tendency when the most common mark is far away from the rest of
the data in the data set, as depicted in the diagram below:

by Ramya p
➢ In the above diagram the mode has a value of 2. We can clearly see, however, that the
mode is not representative of the data, which is mostly concentrated around the 20 to
30 value range. To use the mode to describe the central tendency of this data set
would be misleading.

by Ramya p
PART A

1. What is Data Science?


➢ Data Science is a combination of multiple disciplines that uses statistics, data analysis,
and machine learning to analyze data and to extract knowledge and insights from it.
➢ Data Science is about data gathering, analysis and decision-making. Also, it is about
finding patterns in data, through analysis, and make future predictions.
➢ Data science and big data are used almost everywhere in both commercial and non-
commercial settings.
➢ By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)
2. What is big data?
➢ Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but
with huge size.

➢ The characteristics of big data are often referred to as the three Vs:

➢ Volume—How much data is there?

➢ Variety—How diverse are different types of data?

➢ Velocity—At what speed is new data generated?

3. Applications of data Science.


➢ Fraud and Risk Detection
➢ Healthcare

➢ Internet Search

➢ Targeted Advertising

➢ Website Recommendations

➢ Advanced Image Recognition

by Ramya p
➢ Speech Recognition

➢ Airline Route Planning

➢ Gaming

➢ Augmented Reality

4. List the benefits and uses of data Science?


➢ Increases business predictability
➢ Ensures real-time intelligence
➢ Favors the marketing and sales area
➢ Improves data security
➢ Helps interpret complex data
➢ Facilitates the decision-making process
➢ Study purpose
5. List the facets of data.
➢ Structured data
➢ Unstructured data
➢ Natural Language
➢ Machine-generated
➢ Graph-based
➢ Audio, video and images
➢ Streaming
6. Difference between structured and unstructured data.

On the basis of Structured data Unstructured data

Technology It is based on a relational It is based on character and binary


database. data.

Flexibility Structured data is less flexible There is an absence of schema, so


and schema-dependent. it is more flexible.

Scalability It is hard to scale database It is more scalable.


schema.

Robustness It is very robust. It is less robust.

by Ramya p
7. What are all difference sources of unstructured data?
➢ Web pages
➢ Images (JPEG, GIF, PNG, etc.)
➢ Videos
➢ Memos
➢ Reports
➢ Emails
➢ Surveys
8. What is NLP?
➢ Natural Language Processing or NLP is a branch that focuses on teaching computers
how to read and interpret the text in the same way as humans do. It is a field that is
developing methodologies for filling the gap between Data Science and human
languages.
➢ Many areas like Healthcare, Finance, Media, Human Resources, etc are using NLP for
utilizing the data available in the form of text and speech. Many text and speech
recognition applications are built using NLP.
➢ The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models
trained in one domain don’t generalize well to other domains.
9. What is Machine data? List the different types of Machine data.
➢ Machine data, also known as machine-generated data, is information that is created
without human interaction as a result of a computer process or application activity.
This means that data entered manually by an end-user is not recognized to be
machine-generated.
➢ These data affect all industries that use computers in their daily operations, and
individuals are increasingly generating this data inadvertently or causing it to be
generated by the machine.
➢ The different types of machine data are,
• Sensor Data
• Computer or System Log Data
• Geotag Data
• Call Log Data
• Web Log Data
10. What are the different steps involved in data science process?

by Ramya p
➢ Setting the research goal
➢ Retrieving data
➢ Data preparation
➢ Data exploration
➢ Data modelling
➢ Presentation and automation
11. List out the contents of project charter.
➢ A clear research goal
➢ The project mission and context
➢ Set a Budget
➢ Assess scope and Risks.
➢ How you’re going to perform your analysis
➢ What resources you expect to use
➢ Proof that it’s an achievable project, or proof of concepts
➢ Deliverables and a measure of success
➢ A timeline
12. What are all the steps involved in data preparation?
➢ Cleansing Data - Data cleansing, also referred to as data cleaning or data scrubbing, is
the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a
data set
➢ Data integrating or Data blending- Data blending involves pulling data from different
sources and creating a single, unique, dataset for visualization and analysis.
➢ Transforming Data- After Successful completion of cleansed and integrated data, the
next phase is transforming data. Data transformation is the process of converting data
from one format, such as a database file, XML document or Excel spreadsheet, into
another.
13. List the different types of data cleaning techniques.
➢ Remove duplicates or Data entry errors
➢ Remove irrelevant data
➢ Standardize capitalization
➢ Convert data type
➢ Clear formatting
➢ Fix errors
➢ Language translation

by Ramya p
➢ Handle missing values
14. What is Exploratory Data Analysis (EDA)?
➢ Exploratory Data Analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data visualization
methods.
➢ It helps determine how best to manipulate data sources to get the answers you need,
making it easier for data scientists to discover patterns, spot anomalies, test a
hypothesis, or check assumptions.

➢ EDA is primarily used to see what data can reveal beyond the formal modeling or
hypothesis testing task and provides a better understanding of data set variables and
the relationships between them.

➢ The goal isn’t to cleanse the data, but it’s common that still discover anomalies you
missed before, forcing you to take a step back and fix them.

15. What are the steps involved in building a data model?


Building a model is an iterative process. The way for build the model depends on
whether you go with classic statistics or the somewhat more recent machine learning.
most of the models consist of the following main steps:

• Selection of a modeling technique and variables to enter in the model


• Execution of the model
• Diagnosis and model comparison
16. What is Machine Learning (ML)? Mention their types.
Machine Learning is said as a subset of Artificial Intelligence that is mainly
concerned with the development of algorithms which allow a computer to learn from the data
and past experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:

“Machine learning enables a machine to automatically learn from data, improve


performance from experiences, and predict things without being explicitly programmed.” The
different types of ML are supervised, unsupervised and reinforcement learning.

17. Difference between supervised and unsupervised learning.

Supervised Learning Unsupervised Learning

by Ramya p
Supervised learning algorithms are trained using Unsupervised learning algorithms are
labeled data. trained using unlabeled data.

Supervised learning model takes direct feedback Unsupervised learning model does
to check if it is predicting correct output or not. not take any feedback.

Supervised learning model predicts the output. Unsupervised learning model finds
the hidden patterns in data.

In supervised learning, input data is provided to In unsupervised learning, only input


the model along with the output. data is provided to the model.

18. What is data mining?


➢ Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data
mining techniques and tools enable enterprises to predict future trends and make
more-informed business decisions.
➢ Data mining is a key part of data analytic overall and one of the core disciplines
in data science, which uses advanced analytic techniques to find useful information in
data sets.
19. List out the steps involved in data mining process.
➢ Data gathering
➢ Data preparation
➢ Mining the data
➢ Data analysis and interpretation
20. Mention the different types of data mining techniques.
➢ Association rule mining.
➢ Classification.
➢ Clustering.
➢ Regression
➢ Sequence and path analysis
➢ Neural networks.
21. List out the benefits of data mining.

by Ramya p
➢ More effective marketing and sales.
➢ Better customer service.
➢ Improved supply chain management.
➢ Increased production uptime
➢ Stronger risk management
➢ Lower costs
➢ Industry examples of data mining include the following,
• Retail.
• Financial services.
• Insurance.
• Manufacturing.
• Entertainment.
• Healthcare.
22. What is data warehousing?
➢ A data warehouse is a central repository of information that can be analyzed to make
more informed decisions. Data flows into a data warehouse from transaction
systems, relational databases, and other sources, typically on a regular cadence.
➢ Business analysts, data engineers, data scientists, and decision makers access the data
through business intelligence (BI) tools, SQL clients, and other analytic applications.
➢ Data and analytic have become indispensable to businesses to stay competitive.
Business users rely on reports, dashboards, and analytic tools to extract insights from
their data, monitor business performance, and support decision making.
➢ Data warehouses power these reports, dashboards, and analytic tools by storing data
efficiently to minimize the input and output (I/O) of data and deliver query results
quickly to hundreds and thousands of users concurrently.
23. List out the different sources of data components.

Source data coming into the data warehouses may be grouped into four broad
categories:

➢ Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments
of the data from the various operational modes.

by Ramya p
➢ Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
➢ Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
➢ External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.
24. Difference between database and data warehousing.
Database Data warehouse
It is used for Online Transaction Processing It is used for Online Analytical Processing
(OLTP) but can be used for other objectives (OLAP). This reads the historical information
such as Data Warehousing. This records the for the customers for business decisions.
data from the clients for history.
The tables and joins are complicated since The tables and joins are accessible since they
they are normalized for RDBMS. This is are de-normalized. This is done to minimize
done to reduce redundant files and to save the response time for analytical queries.
storage space.
Data is dynamic Data is largely static
Entity: Relational modeling procedures are Data: Modeling approach are used for the
used for RDBMS database design. Data Warehouse design.

25. What is statistical distribution of data? Mention the different types of distribution.
➢ The distribution provides a parameterized mathematical function that can be used to
calculate the probability for any individual observation from the sample space. This
distribution describes the grouping or the density of the observations, called the
probability density function.

➢ We can also calculate the likelihood of an observation having a value equal to or lesser
than a given value. A summary of these relationships between observations is called a
cumulative density function. The different types of distribution includes,

• Gaussian Distribution

by Ramya p
• Student’s t-Distribution
• Chi-Squared Distribution

by Ramya p
1. Explain about how to build a model. Also mention the importance of machine learning in
building a model.
2. What is data mining? Explain about different data mining techniques.
3. Explain about different components of data ware housing with a diagram.
4. Explain about three different statistical descriptions of data.

by Ramya p

You might also like