0% found this document useful (0 votes)
16 views17 pages

Unit - I

Data science is an interdisciplinary field that combines methods from statistics, machine learning, and computer science to extract insights from data, addressing challenges across various industries. The data science lifecycle includes stages such as business understanding, data collection, preprocessing, exploratory analysis, modeling, evaluation, and deployment, each critical for generating actionable insights. Additionally, the toolkit for data science encompasses programming languages, libraries, data visualization tools, and database management systems, facilitating effective data analysis and processing.

Uploaded by

Ayman Mir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views17 pages

Unit - I

Data science is an interdisciplinary field that combines methods from statistics, machine learning, and computer science to extract insights from data, addressing challenges across various industries. The data science lifecycle includes stages such as business understanding, data collection, preprocessing, exploratory analysis, modeling, evaluation, and deployment, each critical for generating actionable insights. Additionally, the toolkit for data science encompasses programming languages, libraries, data visualization tools, and database management systems, facilitating effective data analysis and processing.

Uploaded by

Ayman Mir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Data Sciences : Unit-I 7th Semester

UNIT-I
INTRODUCTION TO DATA SCIENCE
I. WHAT IS DATA SCIENCE:

Data science is an interdisciplinary field that utilizes scientific methods, algorithms, processes, and
systems to extract insights and knowledge from structured and unstructured data. It combines various
domains such as statistics, machine learning, computer science, and domain-specific knowledge to
analyse complex datasets.
The accelerating volume of data sources, and subsequently data, has made data science is one of the
fastest growing field across every industry.
Data science combines math and statistics, specialized programming, advanced analytics, artificial
intelligence (AI) and machine learning with specific subject matter expertise to uncover actionable
insights hidden in an organization’s data. These insights can be used to guide decision making and
strategic planning.

 Real-Time Examples:

Predictive Maintenance in Manufacturing:

Problem: Manufacturing companies often face unexpected equipment failures, leading to costly
downtime.
Solution: Data science techniques can be applied to sensor data collected from machinery to predict
when maintenance is required before a breakdown occurs.
Example: General Electric (GE) utilizes data from sensors embedded in aircraft engines to predict
potential failures, enabling proactive maintenance and reducing unplanned downtime.

Personalized Recommendation Systems in E-commerce:

Problem: E-commerce platforms struggle to engage users and improve sales due to the vast array of
products available.
Solution: Data science algorithms analyse users' historical behaviours, preferences, and interactions to
provide personalized product recommendations.
Example: Amazon uses data science to analyse customers' browsing and purchasing history to suggest
products they are likely to be interested in, thereby enhancing user experience and driving sales.

Healthcare Analytics for Disease Diagnosis:

Problem: Healthcare providers face challenges in accurately diagnosing diseases and predicting patient
outcomes.
Solution: Data science techniques are applied to patient data including medical records, lab results, and
imaging scans to aid in disease diagnosis and prognosis.
Example: IBM's Watson for Oncology analyses vast amounts of medical literature and patient data to
assist oncologists in developing personalized treatment plans for cancer patients, improving treatment
outcomes and reducing errors.

II. DATA SCIENCE LIFE CYCLE :

The Data Science Lifecycle is an extensive step-by-step guide that illustrates how machine learning and
other analytical techniques can be used to generate insights and predictions from data to accomplish a
business objective.

1
Data Sciences : Unit-I 7th Semester

Several processes are taken during the entire process, including data preparation, cleaning, modelling,
and model evaluation.

Data
Ingestion

1. BUSINESS UNDERSTANDING OR UNDERSTANDING THE PROBLEM


STATEMENT: The business understanding stage of the data science lifecycle is about
grasping the enterprise objectives to guide data analysis effectively. Without a clear grasp of
the specific problem or goal, data efforts may lack direction. It's vital to pinpoint the core
business objectives, whether it's minimizing costs or forecasting trends. For instance, a retail
company aiming to increase sales would direct data analysis towards customer segmentation or
pricing strategies. Engaging with stakeholders and domain experts is crucial to ensure
alignment with business needs. Ultimately, this stage sets the tone for the entire data science
process, ensuring that insights generated contribute to tangible business outcomes.

2. DATA COLLECTION OR DATA INGESTION: Data collection is a pivotal phase in the


data science lifecycle, involving the acquisition of both structured and unstructured data from
various sources. Structured data, such as customer records, and unstructured data, like log files
and multimedia, are gathered using methods such as manual entry, web scraping, and real-time
streaming. Manual entry entails inputting data manually into systems, while web scraping
involves extracting information from websites. Real-time streaming captures data as it is
generated by systems and devices. These sources span customer databases, sales records, social
media platforms, IoT devices, and more. The collected data serves as the raw material for
subsequent analysis, enabling organizations to derive valuable insights and make data-driven
decisions. Effective data collection ensures the quality and breadth of the dataset, laying a solid
foundation for the data analysis and interpretation stages. As such, meticulous attention to data
collection methods and sources is essential for the success of data science projects, ensuring
that the insights generated are accurate, relevant, and actionable.

3. DATA STORAGE AND DATA PREPROCESSING: Data storage and processing play a
critical role in the data science lifecycle, as they determine how data is managed and accessed
for analysis. Given the diverse formats and structures of data, companies must select

2
Data Sciences : Unit-I 7th Semester

appropriate storage systems tailored to their specific needs. This decision is influenced by
factors such as data volume, velocity, variety, and veracity. Data management teams are
instrumental in this stage, as they establish standards and protocols for data storage and
structure. These standards ensure consistency and compatibility across different datasets,
facilitating seamless workflows around analytics, machine learning, and deep learning models.
By adhering to standardized storage practices, organizations can streamline data access and
analysis processes, enhancing efficiency and accuracy.

Additionally, data preparation is a crucial component of this stage, involving tasks such as
cleaning, de-duplicating, transforming, and combining data. Techniques like ETL (extract,
transform, load) jobs and other data integration technologies are commonly used for these
purposes. Data preparation serves to promote data quality by removing inconsistencies, errors,
and redundancies, thus ensuring that the data is accurate and reliable for analysis. Once the data
is prepared, it is loaded into storage systems such as data warehouses, data lakes, or other
repositories. These storage platforms provide centralized repositories for storing and accessing
data, enabling easy retrieval and analysis by data scientists and analysts. Ultimately, effective
data storage and processing are essential for promoting data quality, enabling efficient analysis,
and deriving actionable insights that drive business value.

4. EXPLORATORY DATA ANALYSIS: The step described involves exploring the data before
constructing the actual model, ensuring a thorough understanding of the variables and their
relationships. Graphical exploration techniques, such as bar graphs for categorical variables and
scatter plots or heat maps for numerical variables, are employed to visualize the distribution
and relationships within the dataset.
Bar graphs are useful for representing the distribution of categorical variables, showing the
frequency or proportion of each category. This allows analysts to identify patterns or
imbalances in the data. Scatter plots are effective for visualizing relationships between
numerical variables, displaying the distribution of data points and indicating any correlation or
trend between them. Heat maps provide a visual representation of the relationship between two
numerical variables, with colour intensity indicating the strength of the relationship.
Data visualization techniques play a crucial role in exploring each feature individually and
understanding how they interact with one another. By visually inspecting the data, analysts can
identify potential outliers, anomalies, or patterns that may influence model construction and
interpretation. In addition to the mentioned techniques, other visualization methods such as
histograms, box plots, and pair plots may also be utilized depending on the characteristics of
the data and the specific questions being addressed. These exploratory visualization techniques
help ensure a comprehensive understanding of the data before proceeding to model building,
thereby improving the quality and reliability of the subsequent analysis and insights derived
from it.

5. DATA MODELLING: This pivotal stage involves selecting the appropriate model type based
on the problem at hand—whether it's classification, regression, clustering, or another type of
analysis. Once the model type is determined, the next step is to choose the specific algorithm
within that model family, considering factors such as the complexity of the problem and the
characteristics of the dataset. After selecting the algorithm, fine-tuning its hyperparameters is
essential to optimize its performance, balancing factors like accuracy and computational
efficiency. However, it's crucial to strike the right balance between performance and
generalizability, ensuring that the model not only performs well on the training data but also
can effectively generalize to unseen data. Overfitting, where the model learns noise rather than
underlying patterns, is a common challenge that must be addressed through techniques like
regularization and cross-validation. The ultimate goal of data modelling is to develop a robust
and accurate model capable of capturing underlying patterns and making reliable predictions
or classifications on new, unseen data. This requires careful consideration at every step—from
selecting the model type and algorithm to tuning hyperparameters and validating the model's
performance—to ensure that it effectively serves its intended purpose in real-world

3
Data Sciences : Unit-I 7th Semester

applications. These models are trained on historical data, also known as the training dataset, to
learn patterns, relationships, and trends within the data. The aim of model building is to create
a robust and accurate model that can make predictions, classify data into categories, or uncover
valuable insights.

6. MODEL EVALUATION: In the model evaluation phase, the readiness of the model for
deployment is assessed by testing it on unseen data and evaluating its performance against
carefully chosen assessment metrics. This process ensures that the model is capable of making
accurate predictions or classifications in real-world scenarios and conforms to reality. If the
evaluation does not yield satisfactory results, the modeming process is iterated until the desired
level of performance is achieved. Like a human, a data science solution, such as a machine
learning model, must evolve and improve with new data and adapt to new evaluation metrics.
While multiple models can be constructed for a particular phenomenon, many may be
imperfect. Model evaluation allows us to select and build the most suitable model for the given
problem, ensuring that it meets the desired criteria for performance and reliability. This iterative
approach to model evaluation and refinement is essential for developing robust and effective
data science solutions that can deliver actionable insights and drive meaningful business
outcomes.

7. MODEL DEPLOYMENT / INTERPRETATION: This deployment phase involves


integrating the model into the existing systems or applications where it will be utilized to make
predictions, classifications, or generate insights. Depending on the specific requirements and
infrastructure of the organization, the deployment may involve embedding the model within
software applications, deploying it as a web service or API, or incorporating it into automated
decision-making processes. The deployment process also includes considerations for
scalability, reliability, and security to ensure smooth and efficient operation of the model in
production environments. Once deployed, the model begins to deliver value by providing
actionable insights, optimizing processes, or enabling data-driven decision-making across the
organization. Ongoing monitoring and maintenance are essential to ensure that the deployed
model continues to perform optimally and remains aligned with evolving business needs and
data dynamics.

III. DATA SCIENCE TOOLKIT:

The data science toolkit encompasses a range of programming languages, libraries, tools,
technologies, and database management systems essential for conducting data analysis, machine
learning, and big data processing tasks.

1. Programming Languages:

 Python: Renowned for its simplicity, versatility, and extensive ecosystem of libraries,
Python is widely used in data science for data manipulation, machine learning, and data
visualization tasks.
 R: Noted for its statistical computing capabilities, R is preferred by many statisticians and
data analysts for its robust libraries and packages tailored for data analysis and
visualization.
 SQL: Essential for querying and manipulating structured data stored in relational database
management systems (RDBMS), SQL is used to extract insights from databases through
querying and manipulation operations.

2. Libraries and Frameworks:

 NumPy and pandas: Fundamental libraries in Python for numerical computing and data
manipulation, respectively.

4
Data Sciences : Unit-I 7th Semester

 scikit-learn: A comprehensive machine learning library in Python, providing tools for


classification, regression, clustering, and dimensionality reduction.
 TensorFlow and PyTorch: Leading deep learning frameworks used for building,
training, and deploying neural networks and other deep learning models.

3. Data Visualization Tools:

 Matplotlib: A versatile plotting library in Python, Matplotlib offers a wide range of static
visualization capabilities for creating plots, charts, and graphs.
 Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for creating
attractive statistical graphics.
 Plotly: Known for its interactive and dynamic visualization capabilities, Plotly is often
used for creating web-based dashboards and visualizations.

Some of the popular data visualization tools:

 Tableau: A powerful and intuitive data visualization tool that allows users to create
interactive and visually appealing dashboards, reports, and charts without requiring
extensive programming skills.

 Power BI: Developed by Microsoft, Power BI is a business analytics tool that enables
users to visualize and share insights from their data through interactive dashboards
and reports.
 QlikView/Qlik Sense: QlikView and Qlik Sense are business intelligence platforms
that offer powerful data visualization capabilities, allowing users to explore and
analyze data through dynamic and interactive visualizations.

 D3.js: A JavaScript library for creating dynamic, interactive, and custom data
visualizations on the web. D3.js provides extensive capabilities for building highly
customized visualizations using HTML, SVG, and CSS.

 Google Data Studio: A free data visualization tool offered by Google that enables
users to create interactive dashboards and reports using data from various sources,
including Google Analytics, Google Sheets, and Google BigQuery.

 Highcharts: A JavaScript library for creating interactive and visually appealing


charts and graphs. Highcharts offers a wide range of chart types and customization
options for building dynamic visualizations.

4. Data Technologies:

 Hadoop: A distributed storage and processing framework, Hadoop is used for storing
and analyzing large volumes of data across clusters of commodity hardware.
 Spark: Offering fast in-memory data processing capabilities, Apache Spark is widely
used for big data analytics and machine learning tasks.
 Kafka: A distributed streaming platform, Kafka is used for building real-time data
pipelines and streaming applications.

5. Database Management Systems:

 MySQL: Popular open-source relational database management systems (RDBMS),


commonly used for transactional and analytical applications.
 MongoDB: A leading NoSQL database, MongoDB is favored for its flexibility and
scalability in handling unstructured data and document-oriented databases.

5
Data Sciences : Unit-I 7th Semester

Together, these components form a comprehensive toolkit for data scientists and analysts,
enabling them to perform a wide range of data processing, analysis, visualization, and modelling
tasks across various domains and industries.

IV. TYPES OF DATA:

In data science, "data" refers to the raw information collected and stored for analysis. This
information can come in various forms, including structured data, unstructured data, and semi-
structured data.

1. STRUCTURED DATA:

Structured data refers to information that is organized and stored in a predefined format,
usually in rows and columns within a relational database or spreadsheet. Each piece of data
has a specific data type and is stored consistently, making it easily searchable and
analyzable. Examples of structured data include tables of customer information, transaction
records, and sensor data with clearly defined fields. Structured data is commonly used in
business applications, financial systems, and data warehouses due to its organization and
ease of access.

Examples of storage locations for structured data:

 Relational Databases: MySQL, PostgreSQL, Oracle, SQL Server


 Spreadsheets: Microsoft Excel, Google Sheets
 Data Warehouses: Amazon Redshift, Google BigQuery, Snowflake

2. UNSTRUCTURED DATA:

Unstructured data, on the other hand, lacks a predefined format and is not organized in a
tabular structure like structured data. Instead, it can encompass a wide variety of data types
such as text documents, emails, social media posts, images, videos, and audio recordings.
Unstructured data poses challenges for analysis because of its lack of organization, but
advanced techniques such as natural language processing (NLP) and computer vision can
be employed to extract insights from unstructured data sources. Examples of unstructured
data analysis include sentiment analysis of customer reviews, image recognition in social
media posts, and speech-to-text conversion in audio recordings.

Examples of storage locations for unstructured data:

 File Systems: Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud
Storage
 NoSQL Databases: MongoDB, Cassandra, Couchbase
 Content Management Systems (CMS): WordPress, Drupal
 Social Media Platforms: Facebook, Twitter, Instagram

3. SEMI-STRUCTURED DATA:

Semi-structured data falls somewhere between structured and unstructured data. It exhibits
some organizational structure, but it may not adhere to a strict schema like structured data.
Examples of semi-structured data formats include JSON files, XML documents, and log
files. While semi-structured data may have some predefined elements or tags, it allows for

6
Data Sciences : Unit-I 7th Semester

flexibility and variation in data representation. Semi-structured data is commonly


encountered in web applications, IoT devices, and data interchange formats where there is
a need for both structure and flexibility in data storage and processing.

Examples of storage locations for semi-structured data:

 Document Databases: MongoDB, Couchbase, Elasticsearch


 Data Interchange Formats: JSON files, XML documents
 Log Files: Apache log files, system logs.

In summary, structured data is organized and stored in a predefined format, unstructured data lacks
organization and structure, and semi-structured data exhibits some level of organization but allows for
flexibility in data representation. Understanding the characteristics of each data type is crucial for
effective data management, analysis, and decision-making in various domains.

V. APPLICATIONS:

Applications of data science in various domains:

1. Finance:

 Fraud Detection: Data science techniques, such as machine learning algorithms, are used to
detect fraudulent activities in financial transactions by analyzing patterns and anomalies in
transaction data.

7
Data Sciences : Unit-I 7th Semester

 Risk Assessment: Data science models help financial institutions assess the credit risk
associated with borrowers by analyzing historical data and predicting the likelihood of default.
 Algorithmic Trading: Data-driven algorithms analyze market data and make high-speed
trading decisions to optimize trading strategies and generate profits.

2. Healthcare:

 Disease Prediction: Data science models analyze patient data, including medical history and
genetic information, to predict the likelihood of developing certain diseases and enable early
intervention.
 Personalized Medicine: Data-driven approaches tailor medical treatments and interventions
to individual patients based on their genetic makeup, medical history, and other relevant factors.
 Medical Image Analysis: Machine learning algorithms analyze medical images, such as X-
rays and MRIs, to assist in diagnosing diseases and identifying abnormalities.

3. E-commerce:

 Recommendation Systems: Data science algorithms analyze customer behavior and


preferences to recommend personalized products and services, enhancing the shopping
experience and increasing sales.
 Customer Segmentation: By clustering customers based on their demographic and behavioral
data, e-commerce companies can target specific customer segments with tailored marketing
campaigns and product offerings.
 Demand Forecasting: Data science models predict future demand for products and services,
enabling e-commerce companies to optimize inventory management and supply chain
operations.

4. Marketing:

 Targeted Advertising: Data science techniques analyze customer data to identify target
audiences and deliver personalized advertisements that are more likely to resonate with
individual customers.
 Customer Churn Prediction: By analyzing customer behavior and engagement metrics, data
science models predict which customers are at risk of churning, allowing marketers to
implement retention strategies.
 Sentiment Analysis: Natural language processing algorithms analyze customer reviews, social
media posts, and other textual data to gauge customer sentiment and inform marketing
strategies and brand perception.

5. Manufacturing:

 Predictive Maintenance: Data science models analyze sensor data from manufacturing
equipment to predict equipment failures and schedule maintenance proactively, reducing
downtime and maintenance costs.
 Supply Chain Optimization: Data-driven optimization techniques optimize supply chain
operations, including inventory management, logistics, and distribution, to improve efficiency
and reduce costs.
 Quality Control: Machine learning algorithms analyze production data to identify defects and
anomalies in manufactured products, ensuring product quality and minimizing defects.

8
Data Sciences : Unit-I 7th Semester

6. Transportation:

 Route Optimization: Data science models optimize transportation routes and schedules to
minimize travel time, fuel consumption, and transportation costs for logistics companies and
transportation providers.
 Demand Forecasting: By analyzing historical transportation data, data science models predict
future demand for transportation services, helping companies allocate resources efficiently and
improve service quality.
 Autonomous Vehicles: Data science techniques, including machine learning and computer
vision, are used to develop self-driving cars and autonomous vehicles that can navigate and
operate safely on roads.

7. Telecommunications:

 Network Optimization: Data science techniques optimize telecommunications networks by


analyzing network traffic patterns, identifying bottlenecks, and optimizing network
configurations to improve performance and reliability.
 Customer Churn Prediction: By analyzing customer usage patterns and behavior, data
science models predict which customers are likely to churn, enabling telecommunications
companies to implement targeted retention strategies.
 Fraud Detection: Data science algorithms analyze call detail records and other
telecommunications data to detect fraudulent activities, such as unauthorized usage or
subscription fraud, helping companies minimize revenue losses and protect against fraud.

8. Government:

 Crime Prediction: Data science models analyze historical crime data and other relevant factors
to predict crime hotspots and trends, enabling law enforcement agencies to allocate resources
effectively and prevent crime.
 Resource Allocation: By analyzing demographic, socioeconomic, and geographic data,
government agencies can optimize resource allocation for public services, infrastructure
development, and disaster response.
 Policy Making: Data-driven insights inform evidence-based policy-making decisions by
providing policymakers with actionable insights and understanding the impact of policy
changes on various stakeholders and outcomes.

VI. DATA COLLECTION AND MANAGEMENT:

Data collection and management involve gathering, organizing, and storing information from various
sources to derive insights and support decision-making.

DATA.

Based on how data is collected, it can be divided into two categories - Primary and Secondary data.
a) Primary Data: Primary data refers to data that is collected first hand by the researcher for a specific
research purpose.

Characteristics:
 Original and direct: Primary data is collected directly from the source, without any
intermediaries.
 Specific to research objectives: The data is collected with a specific research question or
objective in mind.

9
Data Sciences : Unit-I 7th Semester

 Time-consuming and costly: Collecting primary data can be time-consuming and


expensive, as it often involves conducting surveys, interviews, experiments, or
observations.

Examples:
 Surveys: Questionnaires or interviews administered to gather information directly from
respondents.
 Experiments: Controlled studies conducted to test hypotheses or investigate causal
relationships.
 Observations: Systematic observations of behavior or phenomena in natural settings.
 Focus groups: Discussions with a small group of participants to explore opinions,
attitudes, or preferences.

b) Secondary Data: Secondary data refers to data that is collected by someone else for a purpose
other than the researcher's current study.

Characteristics:
 Already available: Secondary data is readily available from existing sources, such as
published literature, databases, reports, or websites.
 Collected for other purposes: The data was originally collected for purposes other than
the current research project.
 Less costly and time-consuming: Using secondary data is often more cost-effective and
less time-consuming compared to collecting primary data.

Examples:
 Published literature: Books, academic journals, articles, and reports containing research
findings or statistical data.

10
Data Sciences : Unit-I 7th Semester

 Government sources: Data collected and published by government agencies, such as


census data, economic indicators, and health statistics.
 Commercial sources: Market research reports, sales data, and industry surveys published
by market research firms or industry associations.
 Online databases: Digital repositories containing datasets, statistics, and other research
materials, such as PubMed, Google Scholar.

1. TYPES OF DATA BASED ON SOURCES:

In the contemporary era, where data holds immense value, organizations leverage a plethora of data
sources to gather information and facilitate decision-making processes in the realm of Big Data
Analytics.

a) Internal Data: Internal data is the data captured and collected by an organization’s internal
processes and systems. A few of the most common examples of internal data include -
 Transactional Data (customer purchase, equipment procurement, employee payroll, etc.)
 Sales and Marketing Data (Email opens, click rates, marketing campaigns, etc.)
 Consumer Data (Customer profiles, names, addresses, etc.)
 Customer Service and Support Data (customer calls, tickets, etc.)
 Online Activity/Browsing Data

b) Third-Party Analytics: In some case, when an organization does not have the capacity or
resources to collect internal data for the analysis, they rely on third-party analytics tools and
services to close internal gaps, collect required data and analyze it based on their requirements.
For example, Google Analytics is a popular third-party analytics tool that can provide
organizations insights to understand better how consumers use their websites.

c) External Data: As the name suggests, External Data is information that originates outside the
organization and is available in the public domain. It can include social media posts, weather
data, market prices, historical demographic data, etc. For example, organizations use social
media posts from Twitter or Facebook to analyze consumer sentiment for their products.

11
Data Sciences : Unit-I 7th Semester

d) Open Data: Open Data is accessible to everyone, and it is free to use. It comes with its own
challenges, such as it can be highly aggregated, it might not be in the required format, etc. A
few common examples of open data include - government data, health and science data, etc.

2. DENTIFYING AND GATHERING DATA:

In the process of Data Extraction, the initial phase involves identifying the necessary data required
to address the specified problem statement and objectives. This data may originate from various
sources, necessitating the formulation of a data collection strategy. This strategy entails
determining the means of accessing the relevant sources, as well as specifying the duration of data
required. Let's explore several approaches to collecting data –

a) DATABASES: A database is a structured collection of data designed to be easily accessed,


managed, and updated. It serves as a centralized repository for storing and organizing data,
facilitating efficient data management within organizations. Typically, a database is managed
by a Database Management System (DBMS), which provides tools and interfaces for
interacting with the data.
Several popular databases are commonly used in organizations, including MySQL, SQL Server,
MongoDB, PostgreSQL, and Oracle DB. These databases offer different features and
capabilities to suit various business needs.

Databases can be categorized in various ways, but one of the most common distinctions is
between Relational Databases (RDBMS) and non-relational databases. In an RDBMS, data is
organized into tables with rows and columns, and the schema for each table is predefined.
Examples of RDBMS include MySQL and Oracle. SQL (Structured Query Language) is the
standard language used to interact with relational databases, providing a simple and
standardized syntax for querying and managing data

b) DATA-STREAMING: Data streams refer to continuously generated data, also known as


streaming data, which originates from a multitude of sources such as IoT devices, sensors,
social media platforms, and logs. This data is transmitted in real-time, arriving in small,
incremental chunks. Streaming data is utilized for various purposes, including real-time data
extraction, aggregation, and filtering. Data scientists and analysts leverage streamed data to
access information instantly and derive actionable insights on the fly, in real-time. By
processing data streams in real-time, organizations can react promptly to changing conditions,
identify emerging trends, and make data-driven decisions without delay. Additionally, real-

12
Data Sciences : Unit-I 7th Semester

time data analysis enables businesses to detect anomalies, monitor performance, and optimize
processes in near real-time, enhancing operational efficiency and agility.

c) API: An API, or Application Programming Interface, acts as a bridge that enables different
software components to communicate with each other by following a set of predefined
protocols. Many websites and service providers offer APIs that allow users to access and extract
data for further processing and analysis. When a user or application calls an API, it sends an
HTTP/web request to the API provider's server, which then responds with the requested data in
a specified format. APIs can return data in various formats, including text, JSON (JavaScript
Object Notation), XML (Extensible Markup Language), HTML (Hypertext Markup
Language), and more. For instance, Google provides a range of APIs that enable developers to
retrieve information from its search engine, maps, and other services. Similarly, popular social
media platforms like twitter and Facebook offer APIs that allow users to access and extract
relevant data for analysis purposes. For example, researchers can use Twitter's API to download
tweets for tasks such as sports analytics, sentiment analysis, tracking consumer trends, and
more. These APIs empower developers and analysts to leverage valuable data from diverse
sources, facilitating insights generation and informed decision-making.

d) WEB SCRAPPING: Web scraping is the technique of extracting content and data from
websites or the internet. It offers automated methods for efficiently gathering large volumes of
data from various online sources. Typically, the data obtained through web scraping is in
unstructured or HTML format, which requires further processing to convert it into a structured
format suitable for analysis.
Common types of data extracted through web scraping include text, images, videos, pricing
information, reviews, and product details, among others.

13
Data Sciences : Unit-I 7th Semester

There are multiple approaches to perform web scraping and obtain valuable data from websites.
One option is to utilize online web scraping services that offer pre-built tools and platforms for
scraping data from specific websites or domains. Alternatively, developers can create custom-
built scraping scripts or programs using programming languages like Python.

e) RPA : Robotic process automation (RPA) refers to software designed to automate repetitive
and mundane tasks typically performed by humans. RPA bots can handle various activities
related to data collection, such as opening emails and attachments, gathering social media
metrics, extracting information from specific fields in documents, and accessing data from
databases and spreadsheets. However, traditional RPA tools are primarily suited for working
with structured and semi-structured data formats. When dealing with unstructured data, which
comprises a significant portion of potentially valuable content, more advanced solutions
powered by artificial intelligence (AI) are required. These AI-driven solutions are capable of
processing and extracting insights from unstructured data, enabling organizations to leverage a
broader range of information for decision-making and analysis.

f) IDP: Intelligent document processing (IDP) integrates several technologies to streamline the
handling of documents:

 Optical Character Recognition (OCR): OCR is used to extract text from scanned
documents, enabling the digitization of paper-based content.

 Robotic Process Automation (RPA): RPA automates routine tasks involving structured
and semi-structured data, such as data entry or data manipulation.

 Machine Learning (ML) Techniques: ML techniques, including computer vision and


natural language processing (NLP), are employed to classify documents based on their
content, extract relevant information, and organize unstructured data for further analysis.
This involves identifying patterns in text, images, or visual structure to enhance document
understanding.

IDP is particularly useful in scenarios involving the collection and processing of data from various
types of documents, such as insurance claims, medical forms, invoices, contracts, and agreements.
By minimizing the need for manual intervention, IDP enhances efficiency and accuracy in
document processing tasks, ultimately improving overall productivity and decision-making
processes within organizations

3. DATA REPOSITORIES: WHERE TO STORE COLLECTED DATA:

Data repositories play a critical role in the data management lifecycle, serving as centralized storage
locations where collected data is stored, managed, and accessed. The choice of data repository
depends on various factors such as data volume, structure, access patterns, scalability requirements,
and budget constraints. Let's elaborate on the different types of data repositories:

14
Data Sciences : Unit-I 7th Semester

a. Relational Databases:

Relational databases store data in tables with rows and columns, following a predefined schema.
They are ideal for structured data storage and are commonly used for transactional and analytical
purposes. Relational databases provide robust mechanisms for data organization, querying, and
management. They ensure data integrity through features like transactions, constraints, and
indexing.

Examples: MySQL, PostgreSQL, Microsoft SQL Server.

b. NoSQL Databases:

NoSQL databases are designed for handling unstructured and semi-structured data at scale. They
offer flexibility in data modeling and can accommodate various types of data. NoSQL databases
are highly scalable making them suitable for use cases with rapidly changing data formats or high
volumes of data.

Examples: MongoDB, Cassandra, Redis.

c. Data Warehouses:

Data warehouses are optimized for storing and analyzing large volumes of structured data. They
support complex queries and are commonly used for business intelligence and analytics
applications. Data warehouses provide a centralized repository for historical and aggregated data,
enabling advanced analytics and reporting.

Examples: Amazon Redshift, Google BigQuery, Snowflake.

d. Data Lakes:

Data lakes provide scalable storage for structured, semi-structured, and unstructured data. They
offer flexibility in data ingestion and support a wide range of analytics and machine learning
workflows. Data lakes enable organizations to store diverse data types in their raw format,
facilitating exploratory analysis and data discovery.

Examples: Amazon S3, Azure Data Lake Storage, Google Cloud Storage.

e. File Systems:

Traditional file systems store files and documents in a hierarchical structure. They are suitable for
storing unstructured data. File systems are easy to set up and manage, making them suitable for
simpler data storage needs.

Examples: Network Attached Storage (NAS), Distributed File Systems (DFS).

f. Cloud Storage:

Cloud storage services offer scalable and durable storage for various types of data. They provide
easy integration with other cloud services and support data access from anywhere with an internet
connection. Cloud storage eliminates the need for managing physical infrastructure and provides
high availability and reliability.

Examples: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage.

15
Data Sciences : Unit-I 7th Semester

VII. USING MULTIPLE DATA SOURCES:

Integrating data from multiple sources is a crucial aspect of modern data analysis and decision-making
processes. By combining data from disparate sources, organizations can gain deeper insights, uncover
correlations, and make more informed decisions. Let's explore this concept further with an example:

 Scenario: A retail company wants to launch targeted marketing campaigns to increase sales.
They decide to integrate sales data from their transactional database with customer
demographic data obtained from a third-party source.

 Process:

Data Collection: The Company collects sales data from its internal transactional database,
which includes information such as product sales, transaction dates, and customer IDs.
Additionally, they acquire demographic data from a third-party source, containing details like
age, gender, income level, and location.

Data Integration: The collected data from both sources are integrated using data integration
tools or platforms. This process involves aligning and merging the datasets based on common
identifiers, such as customer IDs or demographic attributes.

Data Analysis: With the integrated dataset in place, the company conducts various analyses to
gain insights. For instance, they may analyse sales trends based on demographic segments to
identify which customer demographics contribute most to sales revenue.

16
Data Sciences : Unit-I 7th Semester

Decision Making: Based on the insights gained from the analysis, the company makes
informed decisions about their marketing strategies. For example, they may decide to tailor
their marketing campaigns to specific demographic groups, such as targeting younger
customers with social media ads or offering discounts to high-income earners.

 Benefits:

Comprehensive Insights: Integrating data from multiple sources provides a holistic view of
the business environment, enabling deeper insights into customer behaviour, market trends, and
sales performance.

Targeted Marketing: By combining sales data with demographic information, organizations


can create targeted marketing campaigns tailored to specific customer segments, leading to
higher engagement and conversion rates.

Improved Decision Making: Data integration facilitates data-driven decision-making by


providing accurate and timely information to stakeholders, helping them make informed
choices that drive business growth and profitability.

In conclusion, integrating data from multiple sources is essential for organizations seeking to gain a
competitive edge in today's data-driven world. By combining diverse datasets, businesses can unlock
valuable insights and opportunities, ultimately leading to better decision-making and improved
outcomes.

17

You might also like