Unit 1 Dsa
Unit 1 Dsa
Need for Data Science – Benefits and uses – Facets of data – Data Science Process: Setting The
Research Goal – Retrieving Data – Cleansing, Integrating and Transforming Data – Exploratory
Data Analysis – Build the Models – Presenting And Building Applications.
INTRODUCTION
Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains. Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines. Data science is an evolutionary
extension of statistics capable of dealing with the massive amounts of data produced today. It
The characteristics of big data are often referred to as the three Vs:
Volume—How much data is there?
Variety—How diverse is different types of data?
Velocity—At what speed is new data generated?
DATA SCIENCE
Data science is an interdisciplinary field that utilizes scientific methods, algorithms, processes,
and systems to extract insights and knowledge from structured and unstructured data. It
combines elements of mathematics, statistics, computer science, and domain expertise to analyze
complex data sets and derive valuable insights
NEED FOR DATA SCIENCE
From business to the health industry, science to our everyday lives, marketing to research, in
fact, for everything in a fraternity, data is required to thrust the movement forward. Computer
science and information technology have taken over our lives, and it is advancing with each
passing day with such velocity and variety that the operational techniques used a few years back
have now become obsolete. The same is the case with challenges and problems. The problems
and concerns of the past for a specific theme, illness, or shortfall may not be the same today as
they have advanced in terms of complexity. Every field of science and study or organization,
therefore, needs an updated set of operational systems and technology to keep up with the
challenges of today and tomorrow as well as to derive solutions for unanswered questions.
Better Decision Making
Predictive Analysis
Pattern Discovery
Healthcare industry
Retailers
Financial sectors
Transportation
Government sectors
Universities
BENEFITS AND USES
1. Improved Decision-Making
By using data to address problems and inform viewpoints, data scientists play a critical role in
allowing better decision-making. To analyze and process massive datasets and to extract
insightful data, they use a variety of methodologies. Data scientists' work offers data-driven
insights that can enable companies and organizations to make wise decisions. A data scientist
might examine patient data in a healthcare organization, for instance, to find trends and patterns
that can improve patient outcomes. In the retail sector, data analysis may be used to develop new
goods and services and to have a better understanding of consumer behavior.
2. Increased Efficiency
Business operations can be made more efficient and costs can be cut with the use of data science.
Businesses can spot inefficiencies and potential improvement areas by analyzing data. To
analyze its supply chain and locate bottlenecks that are creating delays, for instance, a
corporation could use data science. The organization can shorten delivery times and boost overall
efficiency by altering their supply chain in response to this information.
3. Enhanced Customer Experience
Discovering customer preferences and behavior can be accomplished through data analysis. The
customer experience can be improved by using this information to create goods and services that
are catered to the needs of the user. Using data science, a business may, for example, analyze
prior customer purchases and make customized product recommendations. The probability of
repeat business might rise as a result of this.
4. Competitive Advantage
By empowering them to make better decisions and discover new opportunities, data science may
provide firms a competitive edge. Businesses may remain competitive by utilizing data to obtain
insights into their processes and customers. A store, for instance, could use data science to
examine sales data and spot fresh trends. Based on this knowledge, the merchant can create new
products or change their marketing plan to benefit from these trends before their rivals.
5. Predictive Analytics
Based on past data, data science can be used to forecast future results. Businesses can find trends
and forecast future occurrences by using machine learning algorithms to analyze massive
datasets. A healthcare professional could, for instance, use data science to identify the
individuals most at risk of contracting a specific disease and provide preventive care due to this
predictive analysis.
6. Personalized Marketing and Customer Segmentation
Organizations can segment their consumer bases and develop individualized marketing efforts
using data science. Businesses may send tailored and relevant communications that increase
customer engagement and conversion rates by analyzing consumer data and behavior. This
allows them to better understand individual preferences and needs. For instance, a retail business
can utilize data science approaches to recognize high-value clients and develop tailored
marketing campaigns or loyalty schemes to improve client retention. Similar to this, an e-
commerce platform can make pertinent product recommendations based on a user's browsing
history and buying habits by using customer segmentation.
7. Better Healthcare Outcomes
The healthcare sector could undergo a transformation because of data science. Data scientists can
gain insights to increase diagnosis precision, optimize treatment strategies, and improve patient
care, eventually resulting in better healthcare outcomes, by analyzing patient data, medical
records, and clinical studies. Additionally, by taking into account a patient's unique traits, such as
genetics, lifestyle, and previous treatment outcomes, data science enables the optimisation of
treatment programmes. Data scientists can find patterns and connections in large-scale clinical
data that help them choose the best treatments for certain patient profiles.
8. Efficient Resource Allocation
Utilizing data on resource utilization, demand trends, and supply chain dynamics, data science
aids organizations in maximizing resource allocation. As a result, waste is reduced and
operational efficiency is increased while resources like inventory, people, and equipment are
appropriately allocated.
9. Continuous Improvement
Organizations with a culture of continual development benefit from data science. Organizations
can assess performance, monitor advancement, and pinpoint areas for development by analyzing
data. This data-driven strategy encourages an attitude of constant improvement and innovation.
10. Innovation and New Opportunities
Last but not least, data science may help companies innovate and spot new opportunities. Data
science is becoming a driving force behind innovation, allowing companies to find fresh
perspectives and untapped potential. Additionally, data science can find new business prospects
by examining competition data, market dynamics, and consumer behavior.
FACETS OF DATA
The main categories of data are these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
1. Structured data
Structured data is arranged in rows and column format. It helps for application to retrieve and
process data easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a structure.
The most common form of structured data or records is a database where specific information is
stored based on a methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is understood by
computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
2. Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not
used for unstructured data. Therefore it is difficult to retrieve required information. Unstructured
data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer
feedbacks), audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured form.
This carries lots of information. But extracting information from these various sources is a very
big challenge.
Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in nature.
3. Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and sentences,
then apply meaning and understanding to that information. This helps machines to understand
language as humans do.
• Natural language processing is the driving force behind machine intelligence in many modern
real-world applications. The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it must go
through speech recognition, natural language understanding and machine translation. It is an
iterative process comprised of several layers of text analysis.
4. Machine - Generated Data
• Machine-generated data is an information that is created without human interaction as a result
of a computer process or application activity. This means that data entered manually by an end-
user is not recognized to be machine-generated.
• Machine data contains a definitive record of all activity and behavior of our customers, users,
transactions, applications, servers, networks, factory machinery and so on.
• It's configuration data, data from APIs and message queues, change events, the output of
diagnostic commands and call detail records, sensor data from remote equipment and more.
• Examples of machine data are web server logs, call detail records, network event logs and
telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate
machine data. Machine data is generated continuously by every processor-based system, as well
as many consumer-oriented systems.
• It can be either structured or unstructured. In recent years, the increase of machine data has
surged. The expansion of mobile devices, virtual servers and desktops, as well as cloud- based
services and RFID technologies, is making IT infrastructures more complex.
5. Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between entities in complex
systems. In general, a graph contains a collection of entities called nodes and another collection
of interactions between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our problem
domain. By connecting nodes with edges, we will end up with a graph (network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents. Data is stored
just like we might sketch ideas on a whiteboard. Our data is stored without restricting it to a
predefined model, allowing a very flexible way of thinking about and using it.
• Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can
use relationships to process financial and purchase transactions in near-real time. With fast graph
queries, we are able to detect that, for example, a potential purchaser is using the same email
address and credit card as included in a known fraud case.
• Graph databases can also help user easily detect relationship patterns such as multiple people
associated with a personal email address or multiple people sharing the same IP address but
residing in different physical addresses.