0% found this document useful (0 votes)
29 views26 pages

Unit 1 Dsa

data science unit 2 notes

Uploaded by

AJAY KRISHNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views26 pages

Unit 1 Dsa

data science unit 2 notes

Uploaded by

AJAY KRISHNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT 1 INTRODUCTION TO DATA SCIENCE 9

Need for Data Science – Benefits and uses – Facets of data – Data Science Process: Setting The
Research Goal – Retrieving Data – Cleansing, Integrating and Transforming Data – Exploratory
Data Analysis – Build the Models – Presenting And Building Applications.
INTRODUCTION
Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains. Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines. Data science is an evolutionary
extension of statistics capable of dealing with the massive amounts of data produced today. It
The characteristics of big data are often referred to as the three Vs:
Volume—How much data is there?
Variety—How diverse is different types of data?
Velocity—At what speed is new data generated?
DATA SCIENCE
Data science is an interdisciplinary field that utilizes scientific methods, algorithms, processes,
and systems to extract insights and knowledge from structured and unstructured data. It
combines elements of mathematics, statistics, computer science, and domain expertise to analyze
complex data sets and derive valuable insights
NEED FOR DATA SCIENCE
From business to the health industry, science to our everyday lives, marketing to research, in
fact, for everything in a fraternity, data is required to thrust the movement forward. Computer
science and information technology have taken over our lives, and it is advancing with each
passing day with such velocity and variety that the operational techniques used a few years back
have now become obsolete. The same is the case with challenges and problems. The problems
and concerns of the past for a specific theme, illness, or shortfall may not be the same today as
they have advanced in terms of complexity. Every field of science and study or organization,
therefore, needs an updated set of operational systems and technology to keep up with the
challenges of today and tomorrow as well as to derive solutions for unanswered questions.
 Better Decision Making
 Predictive Analysis
 Pattern Discovery
 Healthcare industry
 Retailers
 Financial sectors
 Transportation
 Government sectors
 Universities
BENEFITS AND USES

1. Improved Decision-Making
By using data to address problems and inform viewpoints, data scientists play a critical role in
allowing better decision-making. To analyze and process massive datasets and to extract
insightful data, they use a variety of methodologies. Data scientists' work offers data-driven
insights that can enable companies and organizations to make wise decisions. A data scientist
might examine patient data in a healthcare organization, for instance, to find trends and patterns
that can improve patient outcomes. In the retail sector, data analysis may be used to develop new
goods and services and to have a better understanding of consumer behavior.
2. Increased Efficiency
Business operations can be made more efficient and costs can be cut with the use of data science.
Businesses can spot inefficiencies and potential improvement areas by analyzing data. To
analyze its supply chain and locate bottlenecks that are creating delays, for instance, a
corporation could use data science. The organization can shorten delivery times and boost overall
efficiency by altering their supply chain in response to this information.
3. Enhanced Customer Experience
Discovering customer preferences and behavior can be accomplished through data analysis. The
customer experience can be improved by using this information to create goods and services that
are catered to the needs of the user. Using data science, a business may, for example, analyze
prior customer purchases and make customized product recommendations. The probability of
repeat business might rise as a result of this.
4. Competitive Advantage
By empowering them to make better decisions and discover new opportunities, data science may
provide firms a competitive edge. Businesses may remain competitive by utilizing data to obtain
insights into their processes and customers. A store, for instance, could use data science to
examine sales data and spot fresh trends. Based on this knowledge, the merchant can create new
products or change their marketing plan to benefit from these trends before their rivals.
5. Predictive Analytics
Based on past data, data science can be used to forecast future results. Businesses can find trends
and forecast future occurrences by using machine learning algorithms to analyze massive
datasets. A healthcare professional could, for instance, use data science to identify the
individuals most at risk of contracting a specific disease and provide preventive care due to this
predictive analysis.
6. Personalized Marketing and Customer Segmentation
Organizations can segment their consumer bases and develop individualized marketing efforts
using data science. Businesses may send tailored and relevant communications that increase
customer engagement and conversion rates by analyzing consumer data and behavior. This
allows them to better understand individual preferences and needs. For instance, a retail business
can utilize data science approaches to recognize high-value clients and develop tailored
marketing campaigns or loyalty schemes to improve client retention. Similar to this, an e-
commerce platform can make pertinent product recommendations based on a user's browsing
history and buying habits by using customer segmentation.
7. Better Healthcare Outcomes
The healthcare sector could undergo a transformation because of data science. Data scientists can
gain insights to increase diagnosis precision, optimize treatment strategies, and improve patient
care, eventually resulting in better healthcare outcomes, by analyzing patient data, medical
records, and clinical studies. Additionally, by taking into account a patient's unique traits, such as
genetics, lifestyle, and previous treatment outcomes, data science enables the optimisation of
treatment programmes. Data scientists can find patterns and connections in large-scale clinical
data that help them choose the best treatments for certain patient profiles.
8. Efficient Resource Allocation
Utilizing data on resource utilization, demand trends, and supply chain dynamics, data science
aids organizations in maximizing resource allocation. As a result, waste is reduced and
operational efficiency is increased while resources like inventory, people, and equipment are
appropriately allocated.
9. Continuous Improvement
Organizations with a culture of continual development benefit from data science. Organizations
can assess performance, monitor advancement, and pinpoint areas for development by analyzing
data. This data-driven strategy encourages an attitude of constant improvement and innovation.
10. Innovation and New Opportunities
Last but not least, data science may help companies innovate and spot new opportunities. Data
science is becoming a driving force behind innovation, allowing companies to find fresh
perspectives and untapped potential. Additionally, data science can find new business prospects
by examining competition data, market dynamics, and consumer behavior.
FACETS OF DATA
The main categories of data are these:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming
1. Structured data
Structured data is arranged in rows and column format. It helps for application to retrieve and
process data easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a structure.
The most common form of structured data or records is a database where specific information is
stored based on a methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is understood by
computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.

2. Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not
used for unstructured data. Therefore it is difficult to retrieve required information. Unstructured
data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer
feedbacks), audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured form.
This carries lots of information. But extracting information from these various sources is a very
big challenge.
Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in nature.
3. Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and sentences,
then apply meaning and understanding to that information. This helps machines to understand
language as humans do.
• Natural language processing is the driving force behind machine intelligence in many modern
real-world applications. The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it must go
through speech recognition, natural language understanding and machine translation. It is an
iterative process comprised of several layers of text analysis.
4. Machine - Generated Data
• Machine-generated data is an information that is created without human interaction as a result
of a computer process or application activity. This means that data entered manually by an end-
user is not recognized to be machine-generated.
• Machine data contains a definitive record of all activity and behavior of our customers, users,
transactions, applications, servers, networks, factory machinery and so on.
• It's configuration data, data from APIs and message queues, change events, the output of
diagnostic commands and call detail records, sensor data from remote equipment and more.
• Examples of machine data are web server logs, call detail records, network event logs and
telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate
machine data. Machine data is generated continuously by every processor-based system, as well
as many consumer-oriented systems.
• It can be either structured or unstructured. In recent years, the increase of machine data has
surged. The expansion of mobile devices, virtual servers and desktops, as well as cloud- based
services and RFID technologies, is making IT infrastructures more complex.
5. Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between entities in complex
systems. In general, a graph contains a collection of entities called nodes and another collection
of interactions between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our problem
domain. By connecting nodes with edges, we will end up with a graph (network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents. Data is stored
just like we might sketch ideas on a whiteboard. Our data is stored without restricting it to a
predefined model, allowing a very flexible way of thinking about and using it.
• Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can
use relationships to process financial and purchase transactions in near-real time. With fast graph
queries, we are able to detect that, for example, a potential purchaser is using the same email
address and credit card as included in a known fraud case.
• Graph databases can also help user easily detect relationship patterns such as multiple people
associated with a personal email address or multiple people sharing the same IP address but
residing in different physical addresses.

6. Audio, Image and Video


• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers.
•The terms audio and video commonly refers to the time-based media storage format for
sound/music and moving pictures information. Audio and video digital recording, also referred
as audio and video codes, can be uncompressed, lossless compressed or lossy compressed
depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia data
bring significant challenges in data management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in multimedia data.
Multimedia data usually contains various forms of media, such as text, image, video, geographic
coordinates and even pulse waveforms, which come from multiple sources. Data Science can be
a key instrument covering big data, machine learning and data mining solutions to store, handle
and analyze such heterogeneous data.
7. Streaming Data
Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously and in small sizes (order of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by customers using
your mobile or web applications, ecommerce purchases, in-game player activity, information
from social networks, financial trading floors or geospatial services and telemetry from
connected devices or instrumentation in data centers.
Difference between Structured and Unstructured Data
OVERVIEW OF THE DATA SCIENCE PROCESS
Following a structured approach to data science helps to maximize the chances of success in a
data science project at the lowest cost. The typical data science process consists of six steps
1. The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every project this
will result in a project charter.
2. The second phase is data retrieval. The data should be available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3. Now, it’s time to prepare the raw data. This includes transforming the data from a raw form
into data that’s directly usable in your models. To achieve this, detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If this
step is successfully completed, you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of
the data. Look for patterns, correlations, and deviations based on visual and descriptive
techniques.
5. Finally, we get to the model building often referred to as “data modeling”. It is now that you
attempt to gain the insights or make the predictions stated in your project charter.
6. The last step of the data science model is presenting the results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions.
Step 1: Setting the research goal
Data science is mostly applied in the context of an organization. When the business asks to
perform a data science project, the analyst first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what
data and resources you need, a timetable, and deliverables
Step 2: Retrieving data
The second step is to collect data. The data which is required and where to find is specified in the
project charter. In this step ensure that you can use the data in your program, which means
checking the existence of, quality, and access to the data. Data can also be delivered by third-
party companies and takes many forms ranging from Excel spreadsheets to different types of
databases.
Step 3: Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases:
Data cleansing removes false values from a data source and inconsistencies across data sources,
Data integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
Step 4: Data exploration
Data exploration is concerned with building a deeper understanding of user’s data. User has to
try to understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this, descriptive statistics, visual techniques, and simple modeling
are used.
Step 5: Data modeling or model building
In this phase models, domain knowledge, and insights about the data you found in the previous
steps to answer the research question. You select a technique from the fields of statistics,
machine learning, operations research, and so on. Building a model is an iterative process that
involves selecting the variables for the model, executing the model, and model diagnostics.
Step 6: Presentation and automation
Finally, present the results to the business people. These results can take many forms, ranging
from presentations to research reports. Sometimes the analyst will automate the execution of the
process because the business will want to use the insights you gained in another project or enable
an operational process to use the outcome from the model.
DEFINING RESEARCH GOALS AND CREATING A PROJECT CHARTER:
Spend time understanding the goals and context of your research.Continuously ask questions
and devise examples until the business expectations are clear.
Create a project charter outlining:
Clear research goals
Project mission and context
Approach for analysis
Expected resources
Proof of project feasibility,
Deliverables and success metrics
Timeline
Retrieving Data:
Start with data stored within the company.
Data may be stored in databases, data marts, data warehouses, or data lakes.
Accessing data may require time and adherence to company policies.
Cleansing, Integrating, and Transforming Data:
Cleaning: Remove errors in data to ensure consistency and accuracy.
Integrating: Combine data from different sources through joining and appending operations.
Transforming: Restructure data to meet model requirements, including reducing variables and
using dummy variables.
Exploratory Data Analysis:
Take a deep dive into the data to understand its characteristics.
Utilise graphical techniques such as bar plots, line plots, scatter plots, histograms, etc., to
visualise data and identify patterns.
Building Models:
Develop models aimed at making predictions, classifying objects, or understanding underlying
systems.
Presenting Findings and Building Applications:
Use soft skills to present results to stakeholders effectively.
Industrialise the analysis process for repetitive use and integration with other tools.

You might also like