0% found this document useful (0 votes)
2K views27 pages

Unit 1 FUNDAMENTALS OF DATA SCIENCE-1

Uploaded by

x-factor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views27 pages

Unit 1 FUNDAMENTALS OF DATA SCIENCE-1

Uploaded by

x-factor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Unit-I:

FUNDAMENTALS OF DATA SCIENCE


INTRODUCTION TO CORE CONCEPTS AND TECHNOLOGIES:

Introduction:
Data science combines math and statistics, specialized
programming, advanced analytics, artificial intelligence (AI), and
machine learning with specific subject matter expertise to uncover
actionable insights hidden in an organization’s data. These insights
can be used to guide decision making and strategic planning.

Terminology:

Figure 1
Artificial Intelligence:

Technique that enables machines to mimic


human intelligence using logic, if-then rules,
decision trees, and machine learning that
includes deep learning.

Machine Learning:

A subset of Artificial Intelligence that makes the


machine learn without explicitly programming
it.

Deep Learning:
Figure 2
The subset of machine learning composed of
algorithms that permit software to train itself to perform tasks like speech recognition and image recognition
by exploring multiple layered neural networks.

Natural language processing:

Natural language processing (NLP) is a branch of artificial intelligence (AI) that enables computers to
comprehend, generate, and manipulate human language. Natural language processing has the ability to
interrogate the data with natural language text or voice.

Data Science:

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyse large amounts of data. This analysis helps data scientists
to ask and answer questions like what happened, why it happened, what will happen, and what can be done
with the results.
Algorithm:

Algorithms are repeatable sets, usually expressed mathematically, of instructions that humans or machines
can use to process given data. Typically, algorithms are constructed by feeding them data and adjusting
variables until the desired result is achieved. Python is one of the most popular programming languages
today, however, it is best known as a versatile language that allows it to be very useful for analyzing data.
The language creators focused on making a language that is easy to learn and user-friendly, therefore it is
also a very common first programming language to learn. Chines now typically perform this task of
combining, as they can do it much faster than a human.

Python:

Python is one of the most popular programming languages today, however, it is best known as a versatile
language that allows it to be very useful for analyzing data. The language creators focused on making a
language that is easy to learn and user-friendly, therefore it is also a very common first programming
language to learn.

Furthermore, the easily understandable syntax of Python allows for quick, compact, and readable
implementation of scripts or programs, in comparison with other programming languages.

For many reasons, the fastest-growing programming languages globally: its ease of learning, the recent
explosion of the Data Science field, and the rise of Machine Learning. Python also supports Object-Oriented
and Functional Programming styles, which facilitate building automated tasks and deployable systems.
There are plenty of Python scientific packages for Data Visualization, Machine Learning, Natural Language
Processing, and more.

NumPy & Pandas

NumPy is the fundamental package for Scientific Computing with Python, adding support for large, multi-
dimensional arrays, along with an extensive library of high-level mathematical functions. Pandas is a library
built on top of NumPy for data manipulation and analysis. The library provides data structures and a rich set
of operations for manipulating numerical tables and time series.

Big Data:

The term "Big Data" has emerged as an ever-increasing amount of data has become available. Today's data
differs from that of the past not only in the amount but also in the speed at which it is available. It is data
with such large size and complexity that none of the traditional data management tools can store it or
process it efficiently.

Big data benefits:

 Big Data can produce more complete answers, because you have more information
 More precisely defined answers through confirmation of multiple data sources

API

APIs provide users with a set of functions used to interact with the features of a specific service or
application. Facebook, for example, provides developers of software applications with access to Facebook
features through its API. By hooking into the Facebook API, developers can allow users of their applications
to log in using Facebook, or they can access personal information stored in their databases.

Why is data science important?

Data science is important because it combines tools, methods, and technology to generate meaning from
data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically
collect and store information. Online systems and payment portals capture more data in the fields of e-
commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image
data available in vast quantities.

History of data science:

While the term data science is not new, the meanings and connotations have changed over time. The word
first appeared in the ’60s as an alternative name for statistics. In the late ’90s, computer science
professionals formalized the term. A proposed definition for data science saw it as a separate field with three
aspects: data design, collection, and analysis. It still took another decade for the term to be used outside of
academia.

Future of data science:

Artificial intelligence and machine learning innovations have made data processing faster and more
efficient. Industry demand has created an ecosystem of courses, degrees, and job positions within the field of
data science. Because of the cross-functional skillset and expertise required, data science shows strong
projected growth over the coming decades.

Data Science Process:

Case Study Model Deloyment


Data Model Communicat
(Business panning (Operationali
Preparation Building e Results
Requirment) using EDA ze)

Data Science Lifecycle revolves around the use ofFigure 3


machine learning and different analytical strategies to
produce insights and predictions from information in order to acquire a commercial enterprise objective. The
complete method includes a number of steps like data cleaning, preparation, modelling, model evaluation,
etc. It is a lengthy procedure and may additionally take quite a few months to complete. So, it is very
essential to have a generic structure to observe for each and every hassle at hand. The globally mentioned
structure in fixing any analytical problem is referred to as a Cross Industry Standard Process for Data
Mining or CRISP-DM framework.

People involved in Data Science Project:

1. Data Scientist 5. Statistician


2. Data Analyst 6. Data Base Administrator
3. Data Architect 7. Business Analyst
4. Data Engineer 8. Data & Analysis Manager

Data Scientist:
Data scientists have to understand the challenges of business and offer the best solutions using data analysis
and data processing. A data scientist is an expert in R, MatLab, SQL, Python, and other complementary
technologies.

Few Important Roles and Responsibilities of a Data Scientist include:

 Identifying data collection sources for business needs


 Processing, cleansing, and integrating data
 Automation data collection and management process
 Using Data Science techniques/tools to improve processes
 Analysing large amounts of data to forecast trends and provide reports with recommendations
 Collaborating with business, engineering, and product teams

Data Analyst:

Data analysts are responsible for a variety of tasks including visualisation. SQL, R, SAS, and Python are
some of the sought-after technologies for data analysis. Also, a data scientist should have good problem-
solving qualities.

Few Important Roles and Responsibilities of a Data Analyst include:

 Extracting data from primary and secondary sources using automated tools
 Developing and maintaining databases
 Performing data analysis and making reports with recommendations
 Analysing data and forecasting trends that impact the organization/project
 Working with other team members to improve data collection and quality processes

Data Architect:

A data architect creates the blueprints for data management. A data architect requires expertise in data
warehousing, data modelling, extraction transformation and loan (ETL), etc. You also must be well
versed in Hive, Pig, and Spark, etc.

Few Important Roles and Responsibilities of a Data Architect include:

 Developing and implementing overall data strategy in line with business/organization


 Identifying data collection sources in line with data strategy
 Collaborating with cross-functional teams and stakeholders for smooth functioning of database
systems
 Planning and managing end-to-end data architecture
 Maintaining database systems/architecture considering efficiency and security
 Regular auditing of data management system performance and making changes to improve
systems accordingly.

Data Engineer:

Data engineers build and test scalable Big Data ecosystems for the businesses. A Data Engineer should
know technologies that require hands-on experience include Hive, NoSQL, R, Ruby, Java, C++, and
Matlab. It would also help if you can work with popular data APIs and ETL tools, etc.

Few Important Roles and Responsibilities of a Data Engineer include:

 Design and maintain data management systems


 Data collection/acquisition and management
 Conducting primary and secondary research
 Finding hidden patterns and forecasting trends using data
 Collaborating with other teams to perceive organizational goals
 Make reports and update stakeholders based on analytics

Machine Learning Engineer:

Machine learning engineers are in high demand today. However, the job profile comes with its
challenges. Machine Learning Engineer should have sound knowledge of some of the technologies like
Java, Python, JS, etc. Also, should have a strong grasp of statistics and mathematics.

Few Important Roles and Responsibilities of a Machine Learning Engineer include:

 Designing and developing Machine Learning systems


 Researching Machine Learning Algorithms
 Testing Machine Learning systems
 Developing apps/products basis client requirements
 Extending existing Machine Learning frameworks and libraries
 Exploring and visualizing data for a better understanding
 Training and retraining systems
 Know the importance of statistics in machine learning

Statistician:

A statistician, as the name suggests, has a sound understanding of statistical theories and data
organization. A statistician has to have a passion for logic. They should also be good with a variety of
database systems such as SQL, data mining, and the various machine learning technologies.

Few Important Roles and Responsibilities of a Statistician include:

 Collecting, analysing, and interpreting data


 Analysing data, assessing results, and predicting trends/relationships using statistical
methodologies/tools
 Designing data collection processes
 Communicating findings to stakeholders
 Advising/consulting on organizational and business strategy basis dat
 Coordinating with cross-functional teams

Data Base Administrator:

Data Base Administrator are responsible for the proper functioning of all the databases of an
enterprise and grant or revoke its services to the employees of the company depending on their
requirements. Essential skills of a Database Administrator include database backup and recovery,
data security, data modelling, and design, etc.

Few Important Roles and Responsibilities of a Database Administrator include:

 Working on database software to store and manage data


 Working on database design and development
 Implementing security measures for database
 Preparing reports, documentation, and operating manuals
 Data archiving
 Working closely with programmers, project managers, and other team members

Business Analyst:

The role of business analysts is slightly different than other data science jobs. They identify how the Big
Data can be linked to actionable business insights for business growth. Business analysts sshould have
an understanding of business finances and business intelligence, and also the IT technologies like data
modelling, data visualization tools, etc.

Few Important Roles and Responsibilities of a Business Analyst include:

 Understanding the business of the organization


 Conducting detailed business analysis – outlining problems, opportunities, and solutions
 Working on improving existing business processes
 Analysing, designing, and implementing new technology and systems
 Budgeting and forecasting
 Pricing analysis

Data & Analysis Manager:

A data and analytics manager oversees the data science operations and assigns the duties to their team
according to skills and expertise. Their strengths should include technologies like SAS, Python, R, SQL,
Java, etc. and of course management.

Few Important Roles and Responsibilities of a Data and Analytics Manager include:

 Developing data analysis strategies


 Researching and implementing analytics solutions
 Leading and managing a team of data analysts
 Overseeing all data analytics operations to ensure quality
 Building systems and processes to transform raw data into actionable business insights
 Staying up to date on industry news and trends

Data Science Process:

(Data Science Life Cycle or Steps involved in a Data Science Project)

1. Case Study (Business Requirement)

A problem cannot be solved if you don’t know what the problem is. Plenty of executives will go to their data
science team claiming that there’s a problem and that the data science team needs to solve it, yet will have
no idea how to articulate the problem, why it needs to be solved, and what the connection is between the
business case and the technical case.

 The first step is to produce a clear definition and understanding of the problem or business case and
then translate that into a data science problem with actionable steps and goals.
 This involves clear, concise communication with the business executives and asking enough
questions that no contradictory results can be produced.
 Solving a problem with data takes a lot of work, so you might as well do it right the first time.
 One of the key questions that executives should be asked is how solving the problem will benefit the
company (or its customers) and how the problem fits into the other processes of the company.
 Not only does this help you and your team determine which data sets will be pulled, but also the
types of analyses you’ll run and the answers you’ll be looking for.
2. Data Preparation:

Data collection

Data collection is the process of collecting, measuring and analysing different types of information using a
set of standard validated techniques. The main objective of data collection is to gather information-rich and
reliable data, and analyse them to make critical business decisions. Once the data is collected, it goes
through a rigorous process of data cleaning and data processing to make this data truly useful for businesses.

There are two main methods of data collection in research based on the information that is required, namely:

1. Primary Data Collection


2. Secondary Data Collection

Primary Data Collection Methods

Primary data refers to data collected from first-hand experience directly from the main source. It refers to
data that has never been used in the past.

The methods of collecting primary data can be further divided into quantitative data collection methods
(deals with factors that can be counted) and qualitative data collection methods (deals with factors that are
not necessarily numerical in nature).

Here are some of the most common primary data collection methods:

 Interviews
 Observations
 Surveys and Questionnaires
 Focus Groups
 Oral Histories

Secondary Data Collection Methods

Secondary data refers to data that has already been collected by someone else. It is much more
inexpensive and easier to collect than primary data. While primary data collection provides more
authentic and original data, there are numerous instances where secondary data collection provides great
value to organizations.

Here are some of the most common secondary data collection methods:

 Internet
 Government Archives
 Libraries

Data cleaning and preparation (Data pre-processing)

Data pre-processing is technique that transforms raw data into a more understandable, useful and
efficient format. The raw data that is available may not be useable in its current format so than is
why

Why is data pre-processing required?


The raw data that is available may not be useable in its current format in the real world, data is generally:

 Incomplete: Certain attributes or values or both are missing or only aggregate data is available.
 Noisy: Data contains errors or outliers
 Inconsistent: Data contains differences in codes or names etc.

Tasks in Data Pre-processing

1. Data Cleaning: The data cleaning process detects and removes the errors and inconsistencies present
in the data and improve its quality. Data quality problems occur due to misspellings during data
entry, missing values or any other invalid data. Basically, “dirty” data is transformed into clean data.
“Dirty” data does not produce the accurate and good results. Garbage data gives garbage out. So it
becomes very important to handle this data.
What to do to clean data?
o Handle Missing Values: In the case of Numerical data, we can compute its mean or median
and use the result to replace missing values. When two records seem to repeat, an algorithm
needs to determine if the same measurement was recorded twice, or the records represent
different events.
Missing values can be handled in two ways:
 When the data set is huge we can simply remove rows with missing data.
 We can substitute missing values with mean of the rest of the data using pandas
dataframe in python.
Ex: df.mean()
df.fillna(mean)
Data
The different ways to handle missing data are: cloud Cube Files
 Ignore the data row
 Fill the missing values manually Integrated
 Use a global constant to fill in for missing Data base
values
 Use attribute mean or median Figure 4
 Use forward fill or backward fill method

o Handle Noise and Outliers


o Remove unwanted data
2. Data Integration: This task involves integrating (combining) data from multiple sources such as
databases (relational and non-relational), data cubes, files, etc. The data sources can be homogeneous
or heterogeneous. The data obtained from the sources can be structured, unstructured or semi-
structured in format.
3. Data Transformation: This involves normalisation function and aggregation function of data
according to the needs of the data set. Using these functions and formulas we transform the data into
usable format.

4. Data Reduction: During this step data is reduced. The number of records or the number of attributes
or dimensions can be reduced. Reduction is performed by keeping in mind that reduced data should
produce the same results as original data.
5. Data Discretization: It is considered as a part of data reduction. The numerical attributes are replaced
with nominal ones. Here we specify the type of data on which we are performing discretization like:

 Numeric data
 Alphabetical data
 Alpha-numeric data
6. Data Validation: Audits on the cleaned and to ensure consistent and meet your standards
7. Encoding Categorical data: converting categorical data to numerical data.
8. Feature extraction: Select more relevant features and columns that are most important for analysis
9. Redundancy removal: removing repetition. These can arise due to errors or system issues.
10. Standardization and Normalization: Scale the numerical feature of the dataset to common scale
while normalization scales the values to the range between 0 to 1.
11. Feature Selection: Select the more relevant features columns that are most important for analysis.
4. Model Planning using Exploratory data analysis (EDA)
 The main purpose of EDA is to help look at data before making any assumptions.
 It can help identify obvious errors, as well as better understand patterns within the data, detect
outliers or anomalous events, find interesting relations among the variables.
 Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals.
 There are four primary types of EDA:
o Uni-variate non-graphical: In this the data being analysed consists of just one variable. Since
it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of uni-
variate analysis is to describe the data and find patterns that exist within it.
o Uni-variate graphical: Non-graphical methods don’t provide a full picture of the data.
Graphical methods are therefore required. Common types of uni-variate graphics include:
 Stem-and-leaf plots, which show all data values and the shape of the distribution.

Figure 5
 Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.

Figure 6

 Box plots, which graphically depict the five-number summary of minimum, first
quartile, median, third quartile, and maximum.

Figure 7

o Multivariate non-graphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship between two or
more variables of the data through cross-tabulation or statistics.
o Multivariate graphical: Multivariate data uses graphics to display relationships between two
or more sets of data. The most used graphic is a grouped bar chart with each group
representing one level of one of the variables and each bar within a group representing the
levels of the other variable.
Figure 8

 An analysis if any situation can be done in 2 ways:


o Statistical: It is a science of collecting exploring and presenting large amount of data to
identify patterns and trends. It is also called as qualitative analysis.
Ex: Done on numeric data set using mean, median & mode and also many other techniques.
o Non-statistical: It provides generic information and includes text, sound, still images and
moving images, etc. It is also called as qualitative analysis.
Exploratory Data
analysis

Statistical Non - Statistical

Descriptive Inferential
Statistics Statistics

Measure of Measure of
central tendency variability

Frequency Measure of
distribution shape

5. Model building

 Now is the time to split your data set into train and test sets that will be used to develop your
machine learning models.
 The split is mostly done 80% training and 20% testing data. It may vary but the training data will
be greater than testing data.
 Here is where you’ll determine whether you need to create a Supervised or Unsupervised
machine learning model.
 Supervised models are used to classify unseen data and forecast future trends and outcomes by
“learning” patterns in the training data.
 Unsupervised models are used to find similarities within data, understand relationships between
different data points within a set, and perform additional data analyses.
 For example:
o supervised models may be used to protect a company from spam, or to forecast changes
in markets.
o Unsupervised models may be used to segment clients into marketing environments or
recommend products and services to customers based on their previous purchases.
 Your models may need a few tweaks here and there, but if you’ve done all of the previous steps
correctly, there shouldn’t be any major changes necessary.
5. Communicate Results
 The results of the model will be shared with the client using data visualization.
 Visualizing the tabular data will make the data easy to understand by a common person.
6. Operationalize
 Once you’ve built a model that you’re satisfied with, it will then be deployed into the pre-
production or test environment. If you find any issue we try to solve it, and if not possible to
solve we have to repeat all the steps from the beginning.
 While it’s likely that the company’s software development team will likely handle and monitor
the majority of the deployment performance.
 Making sure everything is ready the model will be in use by the stakeholders.
 If there are any exceptions they can be handles and can be farther developed in future.

Data Science toolkit:


Data scientists rely on popular programming languages to conduct exploratory data analysis and statistical
regression. These open source tools support pre-built statistical modelling, machine learning, and graphics
capabilities. These languages include the following (read more at "Python vs. R: What's the Difference?"):

 R Studio: An open source programming language and environment for developing statistical
computing and graphics.
 Python: It is a dynamic and flexible programming language. The Python includes numerous
libraries, such as NumPy, Pandas, Matplotlib, for analysing data quickly.

To facilitate sharing code and other information, data scientists may use GitHub and Jupyter notebooks.

Some data scientists may prefer a user interface, and two common enterprise tools for statistical analysis
include:

 SAS: A comprehensive tool suite, including visualizations and interactive dashboards, for analysing,
reporting, data mining, and predictive modelling.
 IBM SPSS: Offers advanced statistical analysis, a large library of machine learning algorithms, text
analysis, open source extensibility, integration with big data, and seamless deployment into
applications.
Data scientists also gain proficiency in using big data processing platforms, such as Apache Spark,
the open source framework Apache Hadoop, and NoSQL databases.

Python for data science has gradually grown in popularity over the last ten years and is now by far the most
popular programming language for practitioners in the field. An overview of the core tools used by data
scientists largely focussed on python-based tools:

o NumPy
NumPy is a powerful library for performing mathematical and scientific computations with
python. You will find that many other data science libraries require it as a dependency to run as it
is one of the fundamental scientific packages.
This tool interacts with data as an N-dimensional array object. It provides tools for manipulating
arrays, performing array operations, basic statistics and common linear algebra calculations such
as cross and dot product operations.
o Pandas
The Pandas library simplifies the manipulation and analysis of data in python. Pandas works with
two fundamental data structures. They are Series, which is a one-dimensional labelled array, and
a DataFrame, which is a two-dimensional labelled data structure. The Pandas package has a
multitude of tools for reading data from various sources, including CSV files and relational
databases.
Once data has been made available as one of these data structures pandas has a wide range of
very simple functions provided for cleaning, transforming and analysing data. These include
built-in tools to handle missing data, simple plotting functionality and excel-like pivot tables.
o SciPy
SciPy is another core scientific computational python library. This library is built to interact with
NumPy arrays and depends on much of the functionality made available through NumPy.
However, although to use this package you need to have NumPy both installed and imported,
there is no need to directly import the functionality as this is automatically made available.
Scipy effectively builds on the mathematical functionality available in NumPy. Where NumPy
provides very fast array manipulation, SciPy works with these arrays and enables the application
of advanced mathematical and scientific computations.
o Scikit-learn
Scikit-learn is a user friendly, comprehensive and powerful library for machine learning. It
contains functions to apply most machine learning techniques to data and has a consistent user
interface for each.
This library also provides tools for data cleaning, data pre-processing and model validation. One
of its most powerful features are the concept of machine learning pipelines. These pipelines
enable the various steps in machine learning e.g. preprocessing, training and so on to be chained
together into one object.

o Keras
Keras is a python API which aims to provide a simple interface for working with neural
networks. Popular deep learning libraries such as Tensorflow are notorious for not being very
user-friendly. Keras sits on top of these frameworks to provide a friendly way to interact with
them.
Keras supports both convolutional and recurrent networks, provides support for multi-backends
and runs on both CPU and GPU.
o Matplotlib
Matplotlib is one of the fundamental plotting libraries in python. Many other popular plotting
libraries depend on the matplotlib API including the pandas plotting functionality and Seaborn.
Matplolib is a very rich plotting library and contains functionality to create a wide range of charts
and visualisations. Additionally, it contains functions to create animated and interactive charts.
o Jupyter notebooks
Jupyter notebooks are an interactive python programming interface. The benefit of writing
python in a notebook environment is that it allows you to easily render visualisations, datasets
and data summaries directly within the program.
These notebooks are also ideal for sharing data science work as they can be highly annotated by
including markdown text directly in line with the code and visualisations.
o Python IDE
Jupyter notebooks are a useful place to write code for data science. However, there will be many
instances when writing code into reusable modules will be needed. This will particularly be the
case if you are writing code to put a machine learning model into production.
In these instances and an IDE (Integrated Development Environment) is useful as they provide
lots of useful features such as integrated python style guides, unit testing and version control.
o Github
Github is a very popular version control platform. One of the fundamental principles of data
science is that code and results should be reproducible either by you at a future point in time or
by others. Version control provides a mechanism to track and record changes to your work
online.
Additionally, Github enables a safe form of collaboration on a project. This is achieved by a
person cloning a branch (effectively a copy of your project), making changes locally and then
uploading these for review before they are integrated into the project. For an introductory guide
to Github for data scientists see my previous article here.

Types of data:
Based on the structure of the data is of 3 types:

 Structured data, is data that has been predefined and formatted to a set structure before being placed
in data storage, which is often referred to as schema-on-write. The best example of structured data is
the relational database: the data has been formatted into precisely defined fields, such as credit card
numbers or address, in order to be easily queried with SQL.
o There are three key benefits of structured data:
 Easily used by machine learning algorithms
 Easily used by business users
 Increased access to more tools
o The cons of structured data are centred in a lack of data flexibility. Here are some potential
drawbacks to structured data’s use:
 predefined purpose limits use
 Limited storage options
 Unstructured data, is data stored in its native format and not processed until it is used, which is
known as schema-on-read. It comes in a myriad of file formats, including email, social media posts,
presentations, chats, IoT sensor data, and satellite imagery.
o As there are pros and cons of structured data, unstructured data also has strengths and
weaknesses for specific business needs. Some of its benefits include:
 Freedom of the native format
 Faster accumulation rates:
 Data lake storage
o There are also cons to using unstructured data. It requires specific expertise and specialized
tools in order to be used to its fullest potential.
 Requires data science expertise
 Specialized tools
Figure 9

 Semi-structured data, data refers to what would normally be considered unstructured data, but that
also has metadata that identifies certain characteristics. The metadata contains enough information to
enable the data to be more efficiently catalogued, searched, and analysed than strictly unstructured
data.

Based on the nature of the data is of 2 types:

 Qualitative Data are measures of 'types' and may be represented by a name, symbol, or a number
code. This is data about categorical variables (e.g. what type).
o further divided into 2 types
 Nominal Data, data with no inherent order, like gender of a person – male, female.
 Ordinal Data, data with ordered series like rating a hotel – good, average, bad…….
 Quantitative data are measures of values or counts and are expressed as numbers. This type of data
are data about numeric variables (e.g. how many; how much; or how often).
o further divided into 2 types
 Discrete Data, holds finite No. of possible values, like No. of Students in a Class.
 Continuous Data, holds infinite No. of possible values, like weight of a person..

Example Applications:
Data science has found its applications in almost every industry.

1. Healthcare

Healthcare companies are using data science to build sophisticated medical instruments to detect and cure
diseases.

2. Gaming

Video and computer games are now being created with the help of data science and that has taken the
gaming experience to the next level.

3. Image Recognition
Identifying patterns in images and detecting objects in an image is one of the most popular data science
applications.

4. Recommendation Systems

Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase,
or browse on their platforms.

5. Logistics

Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and
increase operational efficiency.

6. Fraud Detection

Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.

7. Internet Search

When we think of search, we immediately think of Google. Right? However, there are other search engines,
such as Yahoo, Duckduckgo, Bing, AOL, Ask, and others, that employ data science algorithms to offer the
best results for our searched query in a matter of seconds. Given that Google handles more than 20 petabytes
of data per day. Google would not be the 'Google' we know today if data science did not exist.

8. Speech recognition

Speech recognition is dominated by data science techniques. We may see the excellent work of these
algorithms in our daily lives. Have you ever needed the help of a virtual speech assistant like Google
Assistant, Alexa, or Siri? Well, its voice recognition technology is operating behind the scenes, attempting
to interpret and evaluate your words and delivering useful results from your use. Image recognition may also
be seen on social media platforms such as Facebook, Instagram, and Twitter. When you submit a picture of
yourself with someone on your list, these applications will recognise them and tag them.

9. Targeted Advertising

If you thought Search was the most essential data science use, consider this: the whole digital marketing
spectrum. From display banners on various websites to digital billboards at airports, data science algorithms
are utilised to identify almost anything. This is why digital advertisements have a far higher CTR (Call-
Through Rate) than traditional marketing. They can be customised based on a user's prior behaviour. That is
why you may see adverts for Data Science Training Programs while another person sees an advertisement
for clothes in the same region at the same time.

10. Airline Route Planning

As a result of data science, it is easier to predict flight delays for the airline industry, which is helping it
grow. It also helps to determine whether to land immediately at the destination or to make a stop in between,
such as a flight from Delhi to the United States of America or to stop in between and then arrive at the
destination.

11. Augmented Reality

Last but not least, the final data science applications appear to be the most fascinating in the future. Yes, we
are discussing something other than augmented reality. Do you realise there's a fascinating relationship
between data science and virtual reality? A virtual reality headset incorporates computer expertise,
algorithms, and data to create the greatest viewing experience possible. The popular game Pokemon GO is a
minor step in that direction. The ability to wander about and look at Pokemon on walls, streets, and other
non-existent surfaces. The makers of this game chose the locations of the Pokemon and gyms using data
from Ingress, the previous app from the same business.

DATA COLLECTION AND MANAGEMENT:


Introduction:

Data collection is the methodological process of gathering information about a specific subject. It’s crucial
to ensure your data is complete during the collection phase and that it’s collected legally and ethically. If
not, your analysis won’t be accurate and could have far-reaching consequences.

Sources of data:

There are two main methods of data collection in research based on the information that is required, namely:

1. Primary Data Collection


2. Secondary Data Collection

Primary Data Collection Methods

Primary data refers to data collected from first-hand experience directly from the main source. It refers to
data that has never been used in the past.

The methods of collecting primary data can be further divided into quantitative data collection methods
(deals with factors that can be counted) and qualitative data collection methods (deals with factors that are
not necessarily numerical in nature).

Here are some of the most common primary data collection methods:

1. Interviews: Interviews are a direct method of data collection. It is simply a process in which the
interviewer asks questions and the interviewee responds to them.
 Interviews are of two types: Structured and unstructured.
Structured interview is a pre-planned interview where the queries are prepared earlier.
 Unstructured interview is not a preplanner interview where the queries are asked
spontaneously.
2. Observations: In this method, researchers observe a situation around them and record the findings. It
can be used to evaluate the behaviour of different people in controlled (everyone knows they are
being observed) and uncontrolled (no one knows they are being observed) situations.
3. Surveys and Questionnaires: Surveys and questionnaires provide a broad perspective from large
groups of people. They can be conducted face-to-face, mailed, or even posted on the Internet to get
respondents from anywhere in the world. The answers can be yes or no, true or false, multiple
choice, and even open-ended questions.
4. Focus Groups: A focus group is similar to an interview, but it is conducted with a group of people
who all have something in common.
5. Oral Histories: Oral histories also involve asking questions like interviews and focus groups.
However, it is defined more precisely and the data collected is linked to a single phenomenon.

Secondary Data Collection Methods

Secondary data refers to data that has already been collected by someone else. It is much more
inexpensive and easier to collect than primary data. While primary data collection provides more
authentic and original data, there are numerous instances where secondary data collection provides great
value to organizations.

Here are some of the most common secondary data collection methods:

1. Internet: The use of the Internet has become one of the most popular secondary data collection
methods in recent times. There is a large pool of free and paid research resources that can be easily
accessed on the Internet.

2. Government Archives: There is lots of data available from government archives that you can make
use of. The challenge, however, is that data is not always readily available. For example, criminal
records can come under classified information and are difficult for anyone to have access to them.

3. Libraries: Most researchers donate several copies of their academic research to libraries. You can
collect important and authentic information based on different research contexts.

Data collection and APIs:


 An API is a (hypothetical) contract between two software’s saying if the user software provides input
in a pre-defined format, the later with extend its functionality and provide the outcome to the user
software.
 Think of it like this, Graphical user interface (GUI) or command line interface (CLI) allows humans
to Interact with code, where as an Application programmable interface (API) allows one piece of
code to interact with other code.
 One of the most common use case for APIs is on the web. If you have spent a few hours on internet,
you have certainly used APIs. Sharing things on social media, making payments over the web,
displaying list of tweets through a social handle – all of these services use API at the back.
 Let’s try and understand it better with the help of an example:
o Pokemon Go has been one of the most popular smartphone games but in order to build such a
game taking in account the large ecosystem, one requires complete information of routes and
roads across the globe.
o That’s for sure the developers of the Pokemon Go must have faced a dilemma if they should
code the maps of the entire world or use the existing Google maps to build their application
on top of it.
o They choose the latter, simply because it’s practically not possible to create something
similar to Google maps in a short span of time.

 API provides a very convenient way of making code reusable.


Basic elements of an API:
An API has three primary elements:
 Access: is the user or who is allowed to ask for data or services?
 Request: is the actual data or service being asked for (e.g., if I give you current location from my
game (Pokemon Go), tell me the map around that place). A Request has two main parts:
o Methods: i.e. the questions you can ask, assuming you have access (it also defines the type of
responses available).
o Parameters: additional details you can include in the question or response.
 Response: the data or service as a result of your request.

Categories of API

 Web-based system
o A web API is an interface to either a web server or a web browser. These APIs are used
extensively for the development of web applications.
o These APIs work at either the server end or the client end. Companies like Google, Amazon,
eBay all provide web-based API.
o Some popular examples of web based API are Twitter REST API, Facebook Graph API,
Amazon S3 REST API, etc.
 Operating system
o There are multiple OS based API that offers the functionality of various OS features that can
be incorporated in creating windows or mac applications.
o Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.
 Database system
o Interaction with most of the database is done using the API calls to the database. These APIs
are defined in a manner to pass out the requested data in a predefined format that is
understandable by the requesting client.
o This makes the process of interaction with databases generalised and thereby enhancing the
compatibility of applications with the various database. They are very robust and provide a
structured interface to database.
o Some popular examples are Drupal 7 Database API, Drupal 8 Database API, Django API.
 Hardware System
o These APIs allows access to the various hardware components of a system. They are
extremely crucial for establishing communication to the hardware. Due to which it makes
possible for a range of functions from the collection of sensor data to even display on your
screens.
o For example, the Google PowerMeter API will allow device manufacturers to build home
energy monitoring devices that work with Google PowerMeter.
o Some other examples of Hardware APIs are: QUANT Electronic, WareNet CheckWare,
OpenVX Hardware Acceleration, CubeSensore, etc.

5 APIs every Data Scientists should know

1. Facebook API

Facebook API provides an interface to a large amount of data generated everyday. The innumerable post,
comments and shares in various groups & pages produces massive data. And this massive public data
provides a large number of opportunities for analyzing the crowd. It is also incredibly convenient to use
Facebook Graph API with both R and python to extract data. To read more about the Facebook API, click
here.

2. Google Map API

Google Map API is one of the commonly used API. Its applications vary from integration in a cab service
application to the popular Pokemon Go. You can retrieve all the information like location coordinates,
distances between locations, routes etc. The fun part is that you can also use this API for creating the
distance feature in your datasets as well. Read here to find out its complete implementation.

3. Twitter API

Just like Facebook Graph API, Twitter data can be accessed using the Twitter API as well. You can access
all the data like tweets made by any user, the tweets containing a particular term or even a combination of
terms, tweets done on the topic in a particular date range, etc. Twitter data is a great resource for performing
the tasks like opinion mining, sentiment analysis. For detailed usage of twitter API, read here.

4. IBM Watson API

IBM Watson offers a set of APIs for performing a host of complex tasks such as Tone analyzer, document
conversion, personality insights, visual recognition, text to speech, speech to text, etc. by using just few lines
of code. This set of APIs differ from the other APIs discussed so far, as they provide service for
manipulating and deriving insights from the data. To know in depth details about this API, read here.

5. Quandl API

Quandl lets you invoke the time series information of a large number of stocks for the specified date range.
The setting up of Quandl API is very easy and provides a great resource for projects like Stock price
prediction, stock profiling, etc. Click here, to read more details about Quandl API.

List of 5 cool data science projects using API:

Here’s the list of ideas you can start with. You can either use these APIs to retrieve data & manipulate it to
extract insights from it or pass the data to these APIs & perform complex functions.

 Social Media Sentiment Analysis : By using data from Twitter and Facebook API.
 Opinion Mining : By using data from Twitter and Facebook API.
 Stock Prediction : By using data from Yahoo Stock API and Quandl API.
 Most Popular languages on Github : By using data from Github API.
 Microsoft Face Sentiment Recognition : By using Microsoft face API.

Exploring and Fixing Data


Data exploration is the first step of data analysis used to explore and visualize data to uncover insights from
the start or identify areas or patterns to dig into more. Using interactive dashboards and point-and-click data
exploration, users can better understand the bigger picture and get to insights faster.

Why is Data Exploration Important?

Starting with data exploration helps users to make better decisions on where to dig deeper into the data and
to take a broad understanding of the business when asking more detailed questions later. With a user-
friendly interface, anyone across an organization can familiarize themselves with the data, discover patterns,
and generate thoughtful questions that may spur on deeper, valuable analysis.Data exploration and visual
analytics tools build understanding, empowering users to explore data in any visualization.

What are the Main Use Cases for Data Exploration?

Data exploration can help businesses explore large amounts of data quickly to better understand next steps in
terms of further analysis. This gives the business a more manageable starting point and a way to target areas
of interest. In most cases, data exploration involves using data visualizations to examine the data at a high
level. By taking this high-level approach, businesses can determine which data is most important and which
may distort the analysis and therefore should be removed. Data exploration can also be helpful in decreasing
time spent on less valuable analysis by selecting the right path forward from the start.

Data exploration and fixing are essential steps in the data science workflow. They involve understanding the
characteristics of the data, identifying issues or anomalies, and taking corrective actions to prepare the data
for analysis. Here are the key steps involved in data exploration and fixing:

1. Data Understanding: Begin by familiarizing yourself with the dataset. Understand the data sources,
variables, and their meanings. Identify the data types (numerical, categorical, text) and the structure of the
dataset (tabular, hierarchical). This initial understanding helps you plan the exploration process effectively.

2. Descriptive Statistics: Calculate descriptive statistics such as mean, median, mode, standard deviation,
and quartiles to gain insights into the central tendencies, spread, and distribution of numerical variables. For
categorical variables, you can determine the frequency counts or proportions of different categories.

3. Data Visualization: Visualize the data using various charts and plots. Histograms, box plots, scatter plots,
bar charts, and heatmaps are some common visualization techniques. Visual exploration helps identify
patterns, outliers, correlations, and potential data issues visually.

4. Handling Missing Values: Identify missing values in the dataset and decide how to handle them. You can
choose to remove records with missing values, fill in missing values with imputation techniques (mean,
median, mode), or use advanced imputation methods such as regression-based imputation or multiple
imputations.

5. Outlier Detection: Identify outliers in the data—data points that significantly deviate from the expected
patterns. Outliers can affect the analysis and modeling results. Techniques such as box plots, scatter plots, or
statistical methods like z-scores or Tukey's fences can help detect and handle outliers appropriately (e.g.,
removing outliers or applying transformations).

6. Data Transformation: Depending on the data characteristics and the analysis objectives, you might need
to transform variables. Common transformations include logarithmic, square root, or power transformations
for skewed distributions. Feature scaling, such as standardization (z-score normalization) or normalization
(min-max scaling), may be necessary for some machine learning algorithms.

7. Handling Inconsistent or Incorrect Data: Identify and resolve inconsistencies or errors in the dataset.
This could involve correcting typographical errors, resolving conflicting values, or reconciling discrepancies
between related variables. Domain knowledge, data validation rules, or external references might be helpful
in this process.

8. Handling Duplicates: Check for duplicate records or observations in the dataset. Duplicates can distort
analysis results and introduce bias. Remove or handle duplicates based on the context and nature of the data.
9. Feature Engineering: Explore and create new features from existing ones to improve the predictive
power of the data. Feature engineering involves techniques like one-hot encoding, creating interaction terms,
binning continuous variables, or deriving new variables based on domain knowledge or data patterns.

10. Data Sampling: Consider data sampling techniques if the dataset is too large or imbalanced. Sampling
can help reduce computational complexity, balance class distributions, or create representative subsets for
exploratory analysis or model development.

11. Data Validation and Cross-Checking: Validate the data against external sources, reference datasets, or
business rules to ensure accuracy and consistency. Cross-checking data with other trusted sources can help
identify discrepancies and anomalies.

12. Iterative Process: Data exploration and fixing are iterative processes. As you uncover insights or
identify issues, refine your understanding, adjust your approaches, and revisit previous steps if necessary.

Data exploration and fixing are critical for ensuring the quality, reliability, and suitability of data for
analysis. By thoroughly understanding and addressing data issues, you can enhance the effectiveness and
reliability of your data science projects.

Data Storage and Management In Data Science


 Data is everywhere and with reference to computer science data is information in binary form, i.e. all
the data is represented in different patterns of 0’s and 1’s.
 Every system has some storage space for example; our mobile can have 32Gb to 128Gb storage space
to store apps, music, images, etc. also our computer has 500Gb to 1Tb storage space.
 Here, all these are small size storage but with reference to data science we talk about Big Data where
the storage is a Peta Byte i.e. of 1million GB or more.

When discussing big data, there are typically five commonly recognized characteristics, often referred to as
the "V's" of big data. These characteristics help describe the unique nature of big data and the challenges
associated with its management and analysis. The five V's of big data are:

1. Volume: Volume refers to the vast amount of data generated from various sources such as social media,
sensors, online transactions, and more. Big data often involves handling and analyzing data at a scale that
surpasses the capabilities of traditional data management systems.

2. Velocity: Velocity refers to the speed at which data is generated, collected, and processed. With the
advent of real-time data sources like social media feeds, sensors, and financial transactions, data is often
produced at an incredibly high velocity. The challenge lies in capturing, storing, and analyzing data in near
real-time to derive timely insights.

3. Variety: Variety refers to the diverse types and formats of data that exist in big data. It includes
structured data (e.g., traditional databases), semi-structured data (e.g., XML, JSON), and unstructured data
(e.g., emails, social media posts, videos). Big data often involves dealing with a mix of data types,
requiring different processing and analysis techniques.

4. Veracity: Veracity refers to the reliability, accuracy, and trustworthiness of the data. Big data is often
characterized by data uncertainty, inconsistency, and noise. Dealing with data quality issues and ensuring
the veracity of the data is crucial for making accurate and reliable decisions based on big data.

5. Value: Value refers to the potential insights, knowledge, and business value that can be derived from big
data through analysis and interpretation. The goal of big data initiatives is to extract meaningful
information and actionable insights that can lead to improved decision-making, operational efficiency,
innovation, and competitive advantage.

 Put simply, big data is larger, more complex data sets, especially from new data sources. These data
sets are so voluminous that traditional data processing software just can’t manage them. But these
massive volumes of data can be used to address business problems you wouldn’t have been able to
tackle before.
 There is explosion of data in the last few decades with the development of web, movie devises and Iot
(Internet of things).
 Management and Quality assurance has become the priority for businesses.

Figure 10

 Effective data storage management is more important than ever, as security and regulatory compliance
have become even more challenging and complex over time.
 Enterprise data volumes continue to grow exponentially. So how can organizations effectively store it
all? That's where data storage management comes in.
 Storage management ensures data is available to users when they need it.

Data storage and management play a crucial role in data science. As a data scientist, you need to
effectively store, organize, and manage your data to extract meaningful insights and build accurate models.
Here are some key considerations and approaches for data storage and management in data science:

1. Data Collection: Data collection involves acquiring and gathering relevant data from various sources such
as databases, APIs, files, or web scraping. The collected data should be stored in a structured and organized
manner to facilitate easy retrieval and analysis.

2. Data Formats: Choose appropriate data formats based on your requirements. Common formats include
CSV (comma-separated values), JSON (JavaScript Object Notation), Parquet, Avro, or database formats like
SQL or NoSQL. Consider the nature of your data, its size, and the tools you'll be using for analysis when
selecting the format.

3. Data Warehousing: Data warehousing involves storing large volumes of structured and pre-processed
data for analysis. Data warehouses are designed to support complex queries and provide fast access to data.
They often use techniques like indexing, partitioning, and data denormalization to optimize query
performance.

4. Relational Databases: Relational databases, such as MySQL, PostgreSQL, or Oracle, are commonly used
for structured data storage. They offer a robust data schema, ACID (Atomicity, Consistency, Isolation,
Durability) compliance, and support for complex queries. Relational databases are suitable for scenarios
where data relationships and integrity are critical.

5. NoSQL Databases: NoSQL (Not Only SQL) databases, like MongoDB, Cassandra, or Redis, are designed
to handle large volumes of unstructured or semi-structured data. They provide flexible schemas, horizontal
scalability, and high availability. NoSQL databases are suitable for scenarios where agility and scalability
are important, but data relationships and complex queries are not a priority.

6. Distributed File Systems: Distributed file systems, such as Hadoop Distributed File System (HDFS) or
Amazon S3, are used for storing and processing large-scale data. They offer fault tolerance, scalability, and
support for parallel processing. Distributed file systems are commonly used in big data analytics and
machine learning workflows.

7. Data Lakes: Data lakes are repositories that store large amounts of raw and unprocessed data. They allow
storing diverse data types and formats, providing a central location for data exploration and analysis. Data
lakes often utilize technologies like Hadoop, Apache Spark, or cloud-based services like Amazon S3 and
Azure Data Lake Storage.

8. Cloud Storage: Cloud storage services like Amazon S3, Google Cloud Storage, or Azure Blob Storage
offer scalable, durable, and cost-effective storage solutions for data science. They provide features like data
versioning, access control, and seamless integration with other cloud-based services.

9. Data Governance: Establishing data governance practices ensures data quality, privacy, security, and
compliance with regulations. Implement data governance policies to manage data access, data retention, data
anonymization, and data lineage.

10. Metadata Management: Metadata, which provides information about the data, is crucial for data
discovery, understanding, and data lineage. Maintain metadata catalogs or use metadata management tools
to document data sources, attributes, relationships, and transformations.

11. Data Backup and Recovery: Implement regular data backup and recovery mechanisms to prevent data
loss or corruption. Backup strategies may involve techniques such as periodic snapshots, replication, or
incremental backups.

12. Data Cleaning and Pre-processing: Before analysis, data often requires cleaning and pre-processing to
handle missing values, outliers, or inconsistencies. Perform data cleaning steps while ensuring the original
data is preserved in case of mistakes.

13. Data Versioning: Maintain a versioning system to track changes and revisions to your datasets. Version
control allows you to keep track of different iterations and revert to previous versions if needed.

14. Data Security: Ensure data security by implementing appropriate access controls, encryption techniques,
and monitoring mechanisms to protect sensitive data. Adhere to data privacy regulations like GDPR
(General Data Protection Regulation) or CCPA (California Consumer Privacy Act).

15. Data Integration: Data integration involves combining data from multiple sources to create a unified
view. It can include tasks like data consolidation, data transformation, and data harmonization to ensure data
consistency and integrity.

16. Data Cataloging and Metadata Management: Implement data cataloguing tools or metadata
management systems to organize and docudatment datasets, data sources, and their characteristics. This
helps in data discovery, data lineage tracking, and promoting data reuse.
Effective data storage and management practices are vital for successful data science projects. Consider your
specific requirements, data types, volume, and the tools and technologies you'll be using to determine the
most suitable approaches for your data storage and management needs.

 Here are some general methods and services for data storage management:
o storage resource management software
o consolidation of systems
o multiprotocol storage arrays
o storage tiers
o strategic SSD (Solid storage device) deployment
o hybrid cloud
o archive storage of infrequently accessed data
o elimination of inactive virtual machines
o disaster recovery as a service
o object storage

Using Multiple Data Sources


 If you don't account for and carefully manage data sources during application design, there's a real risk
that the application will fail to meet performance, resilience and elasticity expectations.
 This risk is especially true in analytics applications that draw from multiple data sources.
 However, there are five ways to address the problem of multiple data sources in application architecture:
o Know what data you need to combine
o Use data visualization
o Add data blending tools
o Create abstracted virtual database services and
o Determine where to host data sources.

1. Know what data you need to combine


 The first thing to understand is what you should combine, when it comes to multiple data
sources, the right management tool depends on the storage formats involved and the goal you
have in mind for this data.
 For example, a relational database stores most business data and enables the application to
perform standard functions, such as queries that draw from multiple physical databases.
Whatever the data format, your queries need to align with the format of the data they address.
 In terms of mission, narrow the focus according to whether the application uses:
o real-time analytics inquiries that delve into a broad range of data;
o structured queries using some language; or
o direct application access via APIs.
 The application's query approach determines whether you need to use a visualization tool, like
Google Data Studio, to specifically manage your relational database management system
(RDBMS) or program specific design patterns. For instance, if you have real-time analytics,
you're best off using something like Data Studio.
2. Use data visualization
 Software managers trying to unify multiple data sources should use a data visualization
dashboard to map them out strategy.
 Data visualization provides value when the user plans to interactively analyse and query
information.
 The approach helps architects get a clear view of their data, and it also helps them manage the
relationship between data sources.
 There are a variety of visualization tools available from vendors such as Google, IBM and
Oracle.
3. Turn to data blending tools
 Data blending tools are useful for analytics applications.
 These tools turn multiple data sources into a unified data source through a join clause that lets
you define multisource data relationships and reuse them as necessary.
 Data blending is an increasingly common feature in visualization tools.
4. Create virtual database services through abstraction
 Many companies use data studios for interactive analytics and then make discrete API calls for
the same sort of data when they write applications. Don't fall into this trap.
 Forcing an application to process too many database formats can cause performance to suffer.
 This practice can also create a scenario where data source correlation isn't consistent for every
user.
 Abstract database services facilitate application access to multiple data sources. Because these
services define -- and hide -- the complex way that information can connect across sources, they
encourage standardized information use and reduce development complexity.
 They also create a small number of services you can use to identify the specific data sources and
determine what users are doing with them. For compliance purposes, this data abstraction is a
critical function.
5. Decide where to host data sources
 Finally, consider where you will host data sources and whether the network connections are
sufficient.
 Data sources are almost abstract in nature, because you access them through a logical name call,
rather than network or data centre addressing. Because the information's location is typically
hidden, it may not be obvious how accessing it from a specific data store will affect application
performance.
 Public cloud access to data sources is a prime example of the difficulties related to application
performance optimization.
 When cloud applications access data in the data centre or another cloud, traffic charges and
network transit delays can mount up.

Using multiple data sources in data science can significantly enrich your analysis and provide more
comprehensive insights. Incorporating diverse data sources allows you to leverage different perspectives,
validate findings, and uncover hidden patterns. Here are some key considerations for using multiple data
sources in data science:

1. Data Source Identification: Identify the relevant data sources that align with your analysis goals.
Consider both internal and external sources such as databases, APIs, publicly available datasets, text
corpora, sensor data, social media, or third-party data providers. Determine the suitability, accessibility, and
quality of each data source.

2. Data Integration: Integrate data from multiple sources into a unified dataset. This involves harmonizing
data formats, resolving inconsistencies, and aligning data structures. Data integration techniques may
include joining tables, merging datasets based on common identifiers, or using data fusion methods.

3. Data Quality Assessment: Assess the quality of each data source individually and collectively. Examine
factors such as data completeness, accuracy, consistency, and relevance. Identify any potential biases or
limitations associated with each source and consider their impact on your analysis.
4. Data Linkage: Establish connections or relationships between data points across different sources. This
involves identifying common attributes or unique identifiers that can be used to link and merge data from
different sources. Techniques like record linkage or entity resolution can be employed to match and
consolidate related records.

5. Data Transformation: Transform the data from various sources into a common representation that
facilitates analysis. Standardize variables, units of measurement, or categorical values to ensure consistency.
Normalize or scale the data if necessary for fair comparisons.

6. Data Pre-processing: Perform pre-processing steps on each data source individually before integration.
This includes handling missing values, outlier detection, feature engineering, and other data cleaning
techniques. The pre-processing steps may differ based on the characteristics of each data source.

7. Data Governance and Ethics: Ensure compliance with data governance policies and ethical
considerations when integrating and using multiple data sources. Respect privacy regulations, confidentiality
agreements, and intellectual property rights. Anonymize or aggregate data when necessary to protect
individuals' identities or sensitive information.

8. Cross-Validation and Validation: Use multiple data sources to validate and cross-check findings.
Compare results obtained from different sources to assess the consistency and robustness of your analysis.
Cross-validation can help mitigate biases or uncertainties associated with individual sources.

9. Dimensionality and Complexity Management: Analyzing multiple data sources can introduce increased
dimensionality and complexity. Consider dimensionality reduction techniques like principal component
analysis (PCA) or feature selection methods to focus on the most informative variables. Apply appropriate
visualization techniques to comprehend and communicate complex relationships.

10. Data Fusion and Enrichment: Combine data from different sources to create new variables or derived
features that provide additional insights. This may involve statistical aggregation, enrichment with external
data, or applying machine learning algorithms to create composite variables.

11. Real-Time Data Integration: In some cases, real-time or streaming data from multiple sources can be
valuable. Implement data pipelines or streaming frameworks to capture, process, and integrate data in real-
time. This enables timely analysis and decision-making.

12. Iterative Analysis: Analyzing multiple data sources is often an iterative process. As you uncover
insights, validate findings, or encounter challenges, refine your approach, and explore additional data
sources if needed.

Integrating and analyzing multiple data sources in data science requires careful consideration of data quality,
integration techniques, and ethical considerations. By leveraging diverse sources, you can gain a more
comprehensive understanding of the problem domain and drive more accurate and informed decision-
making.

You might also like