6001_DATASCIENCE WITH BIGDATA

UNIT-1
Data Science in a big data world:

Benefits and uses of data science and big data:
 Data science and big data are rapidly growing fields that offer a wide
range of benefits and uses across various industries. Some of the benefits
and uses of data science and big data are:
1. Improved decision-making: Data science and big data help
organizations make better decisions by analyzing and interpreting
large amounts of data. Data scientists can identify patterns,
trends, and insights that can be used to make informed decisions.
2. Increased efficiency: Data science and big data can help
organizations automate tasks, streamline processes, and optimize
operations. This can result in significant time and cost savings.
3. Personalization: With data science and big data, organizations
can personalize their products and services to meet the specific
needs and preferences of individual customers. This can lead to
increased customer satisfaction and loyalty.
4. Predictive analytics: Data science and big data can be used to
build predictive models that can forecast future trends and
behavior. This can be useful for businesses that need to anticipate
customer needs, market trends, or supply chain disruptions.
5. Fraud detection: Data science and big data can be used to detect
fraud and other types of financial crimes. By analyzing patterns
in financial data, data scientists can identify suspicious behavior
and prevent fraud.
6. Healthcare: Data science and big data can be used to improve
patient outcomes by analyzing large amounts of medical data.
This can lead to better diagnosis, treatment, and prevention of
diseases.
7. Marketing: Data science and big data can be used to improve
marketing strategies by analyzing consumer behavior and
preferences. This can help businesses target their marketing
campaigns more effectively and generate more leads and sales.
Facets of data:
Data can be characterized by several facets, including:
1. Volume: Refers to the amount of data that is generated and collected.
With the increasing prevalence of sensors, mobile devices, and social
media, data volumes are growing exponentially.
2. Velocity: Refers to the speed at which data is generated and processed.
Real-time data processing has become critical for many applications, such
as fraud detection and predictive maintenance.
3. Variety: Refers to the diversity of data sources and formats. Data can
come from structured sources such as databases, semi-structured sources
such as XML, or unstructured sources such as social media posts or
emails.
4. Veracity: Refers to the quality and accuracy of the data. Data can be
affected by errors, biases, and inconsistencies, which can impact the
results of data analysis.
5. Value: Refers to the usefulness and relevance of the data. Data must
provide meaningful insights or solve real-world problems to create value
for organizations.
6. Variability: Refers to the fluctuations and changes that occur in data
over time. For example, data may have seasonal patterns or show
different trends depending on the region or market.
7. Visualization: Refers to the ability to represent data in a way that is easy
to understand and analyze. Data visualization tools can help analysts and
decision-makers identify patterns and trends quickly.
8. Validity: Refers to the extent to which data measures what it is intended
to measure. Valid data is essential for making informed decisions based
on accurate insights.
The data science process:

The data science process typically involves the following steps:
1. Define the problem: The first step in the data science process is to define
the problem that you want to solve. This involves identifying the business
or research question that you want to answer and determining what data
you need to collect.
2. Collect and clean the data: Once you have identified the data that you
need, you will need to collect and clean the data to ensure that it is
accurate and complete. This involves checking for errors, missing values,
and inconsistencies.
3. Explore and visualize the data: After you have collected and cleaned the
data, the next step is to explore and visualize the data. This involves
creating summary statistics, visualizations, and other descriptive analyses
to better understand the data.
4. Prepare the data: Once you have explored the data, you will need to
prepare the data for analysis. This involves transforming and
manipulating the data, creating new variables, and selecting relevant
features.
5. Build the model: With the data prepared, the next step is to build a model
that can answer the business or research question that you identified in
step one. This involves selecting an appropriate algorithm, training the
model, and evaluating its performance.
6. Evaluate the model: Once you have built the model, you will need to
evaluate its performance to ensure that it is accurate and effective. This
involves using metrics such as accuracy, precision, recall, and F1 score to
assess the model's performance.
7. Deploy the model: After you have evaluated the model, the final step is to
deploy the model in a production environment. This involves integrating
the model into an application or workflow and ensuring that it can handle
real-world data and user inputs.
The big data ecosystem and data science:

 The big data ecosystem and data science are closely related, as the former
provides the infrastructure and tools that enable the latter.
 The big data ecosystem refers to the set of technologies, platforms, and
frameworks that are used to store, process, and analyze large volumes of
data.
 Some of the key components of the big data ecosystem include:
1. Storage: Big data storage systems such as Hadoop Distributed File
System (HDFS), Apache Cassandra, and Amazon S3 are designed to
store and manage large volumes of data across multiple nodes.
2. Processing: Big data processing frameworks such as Apache Spark,
Apache Flink, and Apache Storm are used to process and analyze large
volumes of data in parallel across distributed computing clusters.
3. Querying: Big data querying systems such as Apache Hive, Apache Pig,
and Apache Drill are used to extract and transform data stored in big data
storage systems.
4. Visualization: Big data visualization tools such as Tableau, D3.js, and
Apache Zeppelin are used to create interactive visualizations and
dashboards that enable data scientists and business analysts to explore
and understand data.
5. Machine learning: Big data machine learning platforms such as Apache
Mahout, TensorFlow, and Microsoft Azure Machine Learning are used to
build and deploy machine learning models at scale.
The data science process: Overview of the data science process:
The data science process can be summarized into a series of steps that are
typically followed in order to extract insights and knowledge from data. These
steps are as follows:
1. Problem definition: In this step, the problem that needs to be solved is

clearly defined. This involves identifying the goals, scope, and objectives
of the project, as well as any constraints and assumptions that need to be
considered.
2. Data collection: This step involves gathering the necessary data from
various sources. This may include internal data sources, such as databases
and spreadsheets, as well as external sources, such as public data sets and
web scraping.
3. Data preparation: Once the data has been collected, it needs to be

cleaned, preprocessed, and transformed into a format that can be used for
analysis. This may involve tasks such as data cleaning, data wrangling,
and data normalization.
4. Data exploration and visualization: This step involves exploring and

visualizing the data to gain a better understanding of its properties and
characteristics. This may include tasks such as data visualization,
summary statistics, and correlation analysis.
5. Data modeling: In this step, mathematical and statistical models are

developed to analyze the data and make predictions. This may include
tasks such as regression analysis, classification, clustering, and time
series analysis.
6. Model evaluation: Once the models have been developed, they need to
be evaluated to determine their accuracy and effectiveness. This may
involve tasks such as cross-validation, model selection, and hypothesis
testing.
7. Deployment: Finally, the insights and knowledge gained from the data
analysis are deployed in the form of reports, dashboards, and other
visualizations that can be used to inform decision-making and drive
business value.
Defining research goals and creating a project character:
Defining research goals and creating a project charter are important initial
steps in any data science project, as they set the stage for the entire project and
help ensure that it stays focused and on track.
Here are some key considerations for defining research goals and creating
a project charter in data science:
Identify the problem or question you want to answer: What is the business
problem or research question that you are trying to solve? It's important to
clearly define the problem or question at the outset of the project, so that
everyone involved is on the same page and working towards the same goal.
 Define the scope of the project: Once you have identified the problem or
question, you need to define the scope of the project. This includes
specifying the data sources you will be using, the variables you will be
analyzing, and the timeframe for the project.
 Determine the project objectives: What do you hope to achieve with the
project? What are your key performance indicators (KPIs)? This will help
you measure the success of the project and determine whether you have
achieved your goals.
 Identify the stakeholders: Who are the key stakeholders in the project?
This could include business leaders, data analysts, data scientists, and
other team members. It's important to identify all the stakeholders upfront
so that everyone is aware of their role in the project and can work
together effectively.
 Create a project charter: The project charter is a document that

summarizes the key information about the project, including the problem
or question, the scope of the project, the objectives, the stakeholders, and
any constraints or risks. It's a critical document that helps ensure
everyone involved in the project is on the same page and understands
what is expected of them.
Retrieving data:
Retrieving data is an essential step in the data science process as it provides the
raw material needed to analyze and derive insights. There are various ways to
retrieve data, and the methods used depend on the type of data and where it is
stored.
Here are some common methods for retrieving data in data science:
 File import: Data can be retrieved from files in various formats, such as
CSV, Excel, JSON, or XML. This is a common method used to retrieve
data that is stored locally.
 Web scraping: Web scraping involves using scripts to extract data from
websites. This is a useful method for retrieving data that is not readily
available in a structured format.
 APIs: Many applications and services provide APIs (Application

Programming Interfaces) that allow data to be retrieved
programmatically. APIs can be used to retrieve data from social media
platforms, weather services, financial data providers, and many other
sources.
 Databases: Data is often stored in databases, and SQL (Structured Query
Language) can be used to retrieve data from databases. Non-relational
databases such as MongoDB or Cassandra are also popular for storing
and retrieving data.
 Big Data platforms: When dealing with large amounts of data, big data
platforms such as Hadoop, Spark, or NoSQL databases can be used to
retrieve data efficiently.
Cleansing, integrating and transforming data
Cleansing, integrating, and transforming data are essential steps in the data
preparation process in data science. These steps are necessary to ensure that the
data is accurate, consistent, and usable for analysis. Here's an overview of each
step:
 Data Cleansing: This step involves identifying and correcting or

removing any errors, inconsistencies, or missing values in the data. Some
common techniques used for data cleansing include removing duplicates,
filling in missing values, correcting spelling errors, and dealing with
outliers.
 Data Integration: In many cases, data comes from multiple sources, and
data integration is needed to combine the data into a single dataset. This
can involve matching and merging datasets based on common fields or
keys, and handling any discrepancies or inconsistencies between the
datasets.
 Data Transformation: Data transformation involves converting the data

into a format that is more suitable for analysis. This can involve
converting categorical variables into numerical variables, scaling or
normalizing data, and creating new variables or features from existing
data.
Exploratory data analysis:
Exploratory data analysis (EDA) is the process of analyzing and

summarizing data sets in order to gain insights and identify patterns.
The main goal of EDA is to understand the data, rather than to test a
particular hypothesis. The process typically involves visualizing the data
using graphs, charts, and tables, as well as calculating summary statistics
such as mean, median, and standard deviation.
Some common techniques used in EDA include:
 Descriptive statistics: This involves calculating summary statistics such

as mean, median, mode, standard deviation, and range.
 Data visualization: This involves creating graphs, charts, and other

visual representations of the data, such as histograms, scatter plots, and
box plots.
 Data transformation: This involves transforming the data to make it

easier to analyze, such as normalizing or standardizing the data, or log
transforming skewed data.
 Outlier detection: This involves identifying and analyzing data points

that are significantly different from the other data points.
 Correlation analysis: This involves examining the relationship between

different variables in the data set, such as calculating correlation
coefficients or creating correlation matrices.
Overall, EDA is an important step in any data analysis project, as it helps to

identify any patterns, outliers, or other trends in the data that may be relevant to
the analysis. It also helps to ensure that the data is clean, complete, and ready
for further analysis.
UNIT – 2
What is machine learning and why should you care about it:
Machine learning is the process of using algorithms to analyze data in order to detect
patterns and make predictions. Machine learning has become increasingly important in recent
years due to the vast amounts of data being generated by companies, organizations, and
individuals. By leveraging machine learning, companies can gain insights on customer
behavior, purchase patterns, and even predict future trends and behaviors. This enables them
to make better decisions, optimize processes, and improve customer experience. As a result,
companies that use machine learning are more competitive and successful than those that
don't.
The modelling process:
The modeling process in data science is an iterative process that involves the following
steps:
1. Define the Problem: The first step of the modeling process is to define the problem that
needs to be solved. This involves understanding the context of the problem and the data that
is available.
2. Data Collection: The next step is to collect the data that is necessary to solve the problem.
This includes collecting data from sources such as databases, web APIs, and text files.
3. Data Preparation: After the data has been collected, it must be prepared for use in the
modeling process. This includes cleaning the data, filling in missing values, transforming the
data, and creating features.
4. Model Training: Once the data is ready, the model can be trained. This involves selecting
the appropriate algorithms, tuning their parameters, and training them on the data.
5. Model Evaluation: After the model has been trained, it must be evaluated to determine its
performance. This includes measuring the accuracy of the model and assessing its ability to
generalize.
6. Model Deployment: Finally, the trained model can be deployed in a production
environment. This involves integrating the model into a system and ensuring it is running
optimally.
Types of machine learning:
1. Supervised Learning: This type of machine learning involves training a model on a
labeled dataset, where the model is taught to predict the output for a given input.
Examples include classification and regression.
2. Unsupervised Learning: This type of machine learning involves training a model on

an unlabeled dataset, where the model is taught to find hidden patterns and insights
from the data without any external guidance. Examples include clustering and
dimensionality reduction.
3. Reinforcement Learning: This type of machine learning involves training a model to
take certain actions in an environment in order to maximize a reward. Examples
include playing games and robot navigation.
4. Transfer Learning: This type of machine learning involves using knowledge gained
from one task to improve performance on another task. Examples include using pre-
trained networks for image recognition and natural language processing.
Semi supervised learning in data science:

 Semi-supervised learning is an area of machine learning that deals with
training models using both labeled and unlabeled data.
 It is an approach used when labeled data is scarce and expensive to obtain.

Semi-supervised learning algorithms use both labeled and unlabeled data to
improve the accuracy of a model.
 The unlabeled data provides additional information that helps to improve the
generalization of the model.
 Semi-supervised learning techniques can be used in a variety of applications in

data science, including natural language processing, computer vision, and
bioinformatics.
Handling large data on a single computer:
Large data sets can be difficult to analyze on a single computer. To make it easier, there
are a few things you can do:
1. Use parallel computing: Parallel computing is a technique that allows you to split
up a large data set into smaller chunks and run them simultaneously on multiple
computers or cores. This can greatly reduce the amount of time it takes to analyze
the data.
2. Use cloud computing: Cloud computing allows you to store large data sets in the
cloud and analyze them using virtual machines. This eliminates the need to have
powerful hardware in-house, and can significantly reduce the cost of data analysis.
3. Use distributed computing: Distributed computing is a technique that allows you

to spread large data sets across multiple computers and analyze them in parallel.
This can significantly reduce the amount of time needed to analyze the data.
4. Use data compression: Data compression can reduce the size of large data sets,
making them easier to store and analyze on a single computer.
5. Use data visualization: Data visualization can help you get a better understanding
of your data, and can make it easier to analyze large data sets on a single
computer.
The problems you face when handling large data:
 Data Storage: Storing large data sets can be challenging due to the amount of
space and resources required. Data must be structured and organized to be
useful and efficient.
 Data Cleaning: Large data sets often contain missing values, outliers, and
incorrect data types, making it difficult to get an accurate picture of the data.
Data cleaning is essential to ensure the accuracy of any analysis.
 Data Analysis: Analyzing large data sets can be complex and time consuming.
Specialized techniques may be required to process, visualize, and interpret the data.
 Security: Large data sets can contain sensitive information, making it

important to maintain security and privacy. Appropriate measures must be
taken to protect the data from unauthorized access.
 Computing Power: Large data sets require large amounts of computing

power to process and analyze. This can be expensive and difficult to access.
 Data Analysis: Analyzing large data sets can be complex and time
consuming. Advanced techniques, such as machine learning, may be necessary
to gain meaningful insights from the data.
General techniques for handling large volumes of data:

1. Use Distributed Computing: Distributed computing involves breaking down large
tasks into smaller parts and distributing them to different machines to be processed in
parallel. This can greatly improve the speed and efficiency of data processing, and is
particularly helpful when dealing with large volumes of data.
2. Use a Database: Using a database to store and manage large volumes of data is a
great way to ensure data integrity and scalability. Most databases have built-in features to
help with querying, sorting, and filtering data, which can help make data analysis easier and
more efficient.
3. Use Streaming Data: Streaming data is a type of data that is delivered in near real-
time. This can be very helpful in dealing with large volumes of data, since it allows for
processing to occur as soon as the data is received, rather than waiting for the entire dataset to
be collected before beginning analysis.
4. Compress Data: Compression is a great way to reduce the size of large datasets,
which can help reduce the amount of time needed for processing. Compression algorithms
can also help reduce the amount of storage space needed to store large amounts of data.
General programming tips for dealing with large datasets:

1. Keep your data organized and structured. Use a database or spreadsheet program to
store, track and maintain your data.
2. Break up large datasets into smaller, more manageable chunks. This will help you
more easily find and access specific data points.
3. Utilize tools such as parallel computing, machine learning and data mining to help
analyze and process large datasets.
4. Make use of specialized software that is designed to handle large datasets.
5. Take advantage of cloud computing to store and manage large datasets.
6. Use data visualization tools to help you make sense of large datasets.
7. Utilize tools such as Apache Spark and Hadoop to help with processing large
datasets.
8. Regularly backup your data to protect against data loss.
9. Consider using data compression to reduce the size of datasets and make them
easier to store and manage.
10. Employ security measures to protect your data from unauthorized access.
UNIT – 3
Distributing data storage and processing with frameworks:

Distributing data storage and processing with frameworks involves using a
framework such as Apache Spark or Hadoop to process large amounts of data across multiple nodes.
A framework allows for the data to be efficiently stored and processed in a distributed manner,
allowing for faster and more efficient processing. This is often used for large-scale data analysis,
machine learning, and other tasks that require complex data processing. By using distributed data
storage and processing, companies can reduce costs and improve efficiency.
Join the NoSQL movement:

Joining the NoSQL movement in data science is a great way to get involved in the
data science community. NoSQL refers to a non-relational type of database that is designed to store
and retrieve data in a way that is more flexible and scalable than traditional relational databases.
NoSQL databases are often used for handling big data, since they can store and process large
amounts of data quickly and efficiently. NoSQL databases can also be used to store unstructured
data, making them an ideal choice for web applications and data science projects. To join the NoSQL
movement, consider taking courses in NoSQL technology, attending conferences, and networking
with other professionals in the field.
No SQL in data science:

No, SQL is not commonly used in data science. While SQL can be used to store,
retrieve, and manipulate data, it is not commonly used for data analysis or other data science tasks.
Data science tasks are typically done with a combination of programming languages, such as Python,
R, or SAS, as well as powerful statistical and machine learning libraries.
UNIT – 4
The rise of graph databases:
 In recent years, graph databases have become increasingly popular in data
science due to their ability to efficiently store and analyze complex and
interconnected data.
 Graph databases are a type of NoSQL database that uses graph theory to
represent and store data, where nodes represent entities and edges
represent the relationships between them.
 One of the key advantages of graph databases is their ability to easily
model and query highly connected data, such as social networks,
recommendation engines, and knowledge graphs. They can also be used to
perform real-time analysis and graph-based algorithms, such as centrality,
clustering, and pathfinding.
 Graph databases are particularly useful for data scientists who work with
complex and interconnected data, as they provide a more natural and
intuitive way to represent and query this type of data.
 They can also help data scientists to identify patterns and relationships in
their data that might not be immediately apparent using traditional
relational databases.
 Some popular graph databases include Neo4j, JanusGraph, and Amazon
Neptune.
 These databases are often used in combination with other tools and
technologies such as Python, R, and machine learning libraries to build
powerful data science applications.
 Overall, the rise of graph databases in data science has opened up new
possibilities for analyzing and understanding complex and interconnected
data, and is likely to continue to play an important role in the field of data
science in the coming years.
Introducing connected data and graph databases:

Connected data:
 Connected data refers to data that is inherently interconnected, where
relationships between different entities are just as important as the entities
themselves.
 This type of data is often found in social networks, recommendation
engines, knowledge graphs, and other domains where understanding the
relationships between entities is key to making sense of the data.
Graph database:
 Graph databases are a type of NoSQL database that are designed to store
and manage connected data.
 They use a graph model to represent the data, where nodes represent
entities and edges represent the relationships between them.
 Graph databases are particularly well-suited for managing complex and
highly interconnected data, and can be used to perform complex graph-
based algorithms and real-time analysis.
Text mining and text analytics:
Text mining in realworld:
 Text mining, also known as text analytics, is a process of analyzing and
extracting valuable insights from unstructured textual data.
 It has become an essential component of data science, as a large amount
of data is generated every day in the form of text, such as emails, social
media posts, customer feedback, and news articles.
 In the real world, text mining is used in various industries, including
finance, healthcare, marketing, and customer service.
 Here are a few examples of how text mining is applied in different
domains:
1. Finance: Text mining is used to analyze financial news and reports to

identify trends and predict market movements. Sentiment analysis is used
to analyze social media data and identify public opinion on financial
products and services.
2. Healthcare: Text mining is used to analyze clinical data, patient
feedback, and electronic medical records to identify patterns and gain
insights into patient care. It is also used for disease surveillance and drug
discovery.
3. Marketing: Text mining is used to analyze customer feedback, social
media data, and online reviews to identify customer preferences,
sentiment, and behavior. This information is used to improve marketing
strategies and customer experience.
4. Customer service: Text mining is used to analyze customer support
chats and emails to identify common issues and improve the customer
service experience. It can also be used to identify patterns and predict
customer behavior.
Text mining techniques:
 Text mining is the process of analyzing large volumes of unstructured
text data to extract useful insights and patterns.
 It is an important component of data science because it enables us to
extract insights from text data that would otherwise be difficult to obtain.
 There are several text mining techniques used in data science, including:
1. Text Preprocessing: This involves cleaning and preparing the text

data for analysis. Text preprocessing techniques include tokenization,
stemming, stop-word removal, and part-of-speech tagging.
2. Sentiment Analysis: This technique involves analyzing the
sentiment or emotion expressed in the text data. It can be used to
analyze social media posts, customer reviews, and other types of text
data.
3. Named Entity Recognition: This technique involves identifying and
extracting named entities such as people, places, organizations, and
dates from text data. It is commonly used in natural language
processing and information retrieval.
4. Topic Modeling: This technique involves identifying topics that are
discussed in a large corpus of text data. It can be used to cluster
similar documents, summarize text data, and extract key themes.
5. Text Classification: This technique involves categorizing text data
into predefined categories or classes. It can be used for tasks such as
spam filtering, sentiment analysis, and topic categorization.
6. Text Summarization: This technique involves generating a
summary of a large text document. It can be used to extract key
information from long documents and to create summaries for news
articles and other types of content.
Data visualization to the end user:
Data visualization is the process of presenting data in a visual format, such as charts, graphs,
and diagrams, to help people better understand the information and draw insights from it.
When it comes to presenting data visualization to end-users, there are several important
considerations to keep in mind:
Know your audience: The first step is to understand who your end-users are and what their
needs are. Different people have different levels of knowledge and experience with data, so
you need to tailor your visualizations to their level of expertise.
Choose the right visualization: There are many different types of visualizations, each suited
to different types of data and insights. Choose the one that best suits the data you are
presenting and the insights you want to convey.
Keep it simple: Don't overwhelm your audience with too much data or too many visual
elements. Keep your visualizations simple and easy to understand.
Use color wisely: Color can be a powerful tool in data visualization, but it can also be
distracting or misleading if not used correctly. Use color sparingly and with purpose.
Provide context: Make sure to provide context for your data, such as comparing it to
historical data or industry benchmarks. This will help your audience understand the
significance of the data you are presenting.
Make it interactive: Interactive visualizations can be more engaging and allow users to
explore the data in more depth. Consider using tools like sliders, filters, or hover-over effects
to make your visualizations more interactive.
Test and iterate: Finally, it's important to test your visualizations with your end-users and
iterate based on their feedback. This will help you create more effective visualizations that
meet their needs and help them draw meaningful insights from the data.
Data visualization options:
There are many options for data visualization, depending on the type of data you are working
with, the story you want to tell, and the audience you want to reach. Here are some common
data visualization options:
Bar Charts: These are useful for comparing different categories or groups of data, such as
sales figures for different products or the performance of different teams.
Line Charts: These are used to show trends over time, such as stock prices or website traffic.
Pie Charts: These are useful for showing proportions or percentages of a whole, such as the
market share of different companies.
Scatter Plots: These are used to show the relationship between two variables, such as the
correlation between temperature and ice cream sales.
Heat Maps: These are used to show the density or distribution of data across a geographic
region or other space.
Tree Maps: These are useful for showing hierarchical data, such as the breakdown of a
company's budget by department.
Network Diagrams: These are used to show connections between nodes, such as social
network connections or organizational charts.
Word Clouds: These are useful for showing the frequency of words or concepts in a text,
such as a survey response or social media analysis.
Infographics: These combine multiple data visualizations and other design elements to tell a
story or convey information in a visually appealing way.
There are many other data visualization options as well, and the choice of which one to use
depends on the specific data and the story you want to tell.
Cross filter :
Cross filtering is a technique used in data analysis to explore the relationships between
different variables in a dataset. In cross filtering, the user selects one or more values for a
variable, and the other variables in the dataset are filtered based on those selected values.
For example, imagine you have a dataset that includes information about customer purchases,
including the customer's age, gender, location, and purchase amount. Using cross filtering,
you could select a specific age range, and the dataset would be filtered to only show
purchases made by customers within that age range. You could then further refine the results
by selecting a specific location, or by filtering by gender.
Cross filtering can help identify patterns and trends in data, and can be useful in business,
marketing, and scientific research applications. It is often used in data visualization tools to
enable interactive exploration of data.
The JavaScript MapReducelibrary

JavaScript itself does not have a built-in MapReduce library, but there are third-party libraries
available that provide MapReduce functionality in JavaScript.
One popular library is called "map-reduce" (https://www.npmjs.com/package/map-reduce),

which provides a simple interface for running MapReduce jobs on arrays or objects.
Here's an example of using the "map-reduce" library to count the number of occurrences of
each word in a list of sentences:
javascript
Copy code
const MapReduce = require('map-reduce');
const sentences = [
'The quick brown fox jumps over the lazy dog.',
'She sells seashells by the seashore.',
'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'
];
const wordCounts = MapReduce(sentences)
.map(sentence => sentence.toLowerCase().split(/\W+/))
.reduce((acc, words) => {
words.forEach(word => {
acc[word] = (acc[word] || 0) + 1;
});
return acc;
}, {});
console.log(wordCounts);
// Output:
// {
// the: 2,
// quick: 1,
// brown: 1,
// fox: 1,
// jumps: 1,
// over: 1,
// lazy: 1,
// dog: 1,
// she: 1,
// sells: 1,
// seashells: 1,
// by: 1,
// seashore: 1,
// how: 1,
// much: 1,
// wood: 2,
// would: 1,
// a: 2,
// woodchuck: 2,
// chuck: 2,
// if: 1,
// could: 1
// }
In this example, we first create a MapReduce object with the list of sentences. We then use
the map function to split each sentence into an array of words, and convert each word to
lowercase. Finally, we use the reduce function to count the number of occurrences of each
word. The initial value for the reduce function is an empty object.
Note that this is just one example of how the "map-reduce" library can be used. There are
many other use cases and variations of MapReduce that can be implemented using this or
other JavaScript libraries.
Creating an interactive dashboard with dc.js :
Prepare the data: The first step is to prepare the data that will be used to create the dashboard.
The data should be in a format that can be easily imported into dc.js, such as CSV or JSON.
Set up the environment: You'll need to set up your environment with all the necessary
dependencies. You can use a package manager such as npm or yarn to install dc.js and its
dependencies.
Create the charts: Once the data is ready and the environment is set up, you can start creating
the charts that will make up the dashboard. dc.js provides a wide range of chart types, such as
bar charts, line charts, pie charts, and more.
Create the dashboard: Once you have created the individual charts, you can start to combine
them into a dashboard. This can be done using the dc.js library itself or by using a library
such as D3.js.
Add interactivity: The final step is to add interactivity to the dashboard. This can be done by
using dc.js features such as filtering, brushing, and zooming.
Here is a basic example of creating a dashboard using dc.js:
javascript
Copy code
// Import the data
d3.csv("data.csv", function(error, data) {
// Create a crossfilter object

var ndx = crossfilter(data);
// Define the dimensions

var dateDim = ndx.dimension(function(d) { return d.date; });
var categoryDim = ndx.dimension(function(d) { return d.category; });
// Define the groups

var dateGroup = dateDim.group().reduceSum(function(d) { return d.value; });
var categoryGroup = categoryDim.group().reduceSum(function(d) { return d.value; });
// Create the charts

var dateChart = dc.lineChart("#date-chart");
var categoryChart = dc.pieChart("#category-chart");
// Configure the charts

dateChart
.dimension(dateDim)
.group(dateGroup)
.renderArea(true);
categoryChart
.dimension(categoryDim)
.group(categoryGroup);
// Create the dashboard

var dashboard = dc.dashboard("#dashboard");
// Add the charts to the dashboard

dashboard
.addChart(dateChart)
.addChart(categoryChart);
// Render the dashboard

dc.renderAll();
});
This example creates two charts: a line chart that shows the total value of the data over time,
and a pie chart that shows the breakdown of the data by category. The charts are added to a
dashboard using the dc.dashboard() function, and the dashboard is rendered using the
dc.renderAll() function.
Dashboard development tools:
There are several dashboard development tools available in the market, both open-source and
commercial, that can be used to create interactive and visually appealing dashboards. Some
of the popular dashboard development tools are:
Tableau: Tableau is a leading business intelligence and data visualization tool that offers a
wide range of features to create interactive dashboards.
Power BI: Power BI is a Microsoft product that enables users to create and share interactive
dashboards, reports, and data visualizations.
QlikView: QlikView is a business intelligence tool that allows users to create interactive
dashboards and reports that can be accessed from anywhere.
Domo: Domo is a cloud-based platform that enables users to create and share dashboards,
reports, and data visualizations.
Google Data Studio: Google Data Studio is a free web-based tool that allows users to create
interactive dashboards and reports using data from multiple sources.
Klipfolio: Klipfolio is a cloud-based dashboard and reporting tool that offers a wide range of
customization options to create interactive dashboards.
Looker: Looker is a cloud-based data analytics and business intelligence platform that offers
a wide range of features to create interactive dashboards and reports.
Dash: Dash is an open-source framework for building analytical web applications that can be
used to create interactive dashboards.
These tools offer different features, pricing plans, and level of complexity. Therefore, it is
important to assess the specific requirements of your dashboard project before choosing a
tool.
Data Ethics:
Introduction
Data ethics refers to the moral principles and values that govern the collection, processing,
use, and storage of data. It involves the responsible handling of data, taking into account
issues such as privacy, security, transparency, fairness, and accountability. Data ethics is
becoming increasingly important as the volume and variety of data being collected by
organizations continues to grow, and as advances in technology make it easier to manipulate
and analyze this data.
Some key principles of data ethics include:
Privacy: Respecting the privacy rights of individuals and protecting their personal data from
unauthorized access, use, or disclosure.
Security: Ensuring that data is kept secure and protected from cyber threats, theft, or loss.
Transparency: Being open and honest about how data is collected, used, and shared.
Fairness: Ensuring that data is used fairly, without discrimination or bias.
Accountability: Taking responsibility for the use of data and being accountable for any
negative consequences that may arise.
Consent: Obtaining informed consent from individuals before collecting, processing, or

sharing their data.
Data ethics is important because it helps to build trust between organizations and their
stakeholders, including customers, employees, and the general public. By following ethical
principles when handling data, organizations can ensure that they are acting in the best
interests of their stakeholders, and that they are complying with legal and regulatory
requirements.
Building Bad Data Products:

Building bad data products can have negative consequences for both the developers and the
end-users. Here are some examples of how bad data products can cause problems:
Poor accuracy: If a data product provides inaccurate information or recommendations, it can

lead to incorrect decisions and actions by end-users. For example, a health app that provides
incorrect medical advice could be harmful to users.
Bias: If a data product is biased, it can lead to unfair or discriminatory outcomes. For
example, an AI-powered hiring tool that is biased against certain groups of candidates could
perpetuate existing inequalities.
Privacy concerns: If a data product collects or shares personal data without appropriate
consent or safeguards, it can lead to privacy violations and breach of trust. For example, a
fitness tracker app that shares user data with third-party advertisers without user consent
could be a violation of privacy.
Poor user experience: If a data product is difficult to use or understand, it can frustrate users
and lead to low adoption and usage rates. For example, a financial planning app that is overly
complex and difficult to navigate could turn users away.
To avoid building bad data products, developers should prioritize data quality, accuracy,
fairness, and user privacy. They should also involve diverse stakeholders and subject matter
experts in the development process to identify and mitigate potential risks and biases.
Additionally, they should regularly test and validate their products to ensure that they meet
user needs and expectations.
Trading Off Accuracy and Fairness:
In the context of machine learning, there is often a trade-off between accuracy and fairness.
Accuracy refers to the ability of a model to correctly predict outcomes, while fairness refers
to the equitable treatment of different groups or individuals.
For example, a model trained to predict creditworthiness may accurately predict whether
someone is likely to default on a loan, but may unfairly discriminate against certain groups of
people, such as those of a certain race or gender. In this case, there is a trade-off between
accuracy and fairness, as improving accuracy may come at the cost of fairness.
To address this trade-off, various techniques have been developed to ensure that machine
learning models are both accurate and fair. One such technique is called "fairness through
awareness," which involves explicitly taking into account the impact of the model's
predictions on different groups of people. This can be achieved by adjusting the model's
output to ensure that it does not unfairly discriminate against any particular group.
Another approach is to use a "trade-off" framework, where the model is optimized for both
accuracy and fairness simultaneously. This involves finding a balance between the two
objectives, rather than optimizing for one at the expense of the other.
Ultimately, achieving both accuracy and fairness in machine learning models requires careful
consideration of the trade-offs involved, as well as an understanding of the potential biases
and ethical implications of the model's predictions. It is important to ensure that machine
learning models are not only accurate, but also fair and ethical, in order to promote trust,
transparency, and social responsibility in their use.
Collaboration:
Collaboration is the act of working together with one or more individuals or groups to
achieve a common goal or objective. Collaboration can take many forms and can occur in a
variety of settings, including in the workplace, in academic environments, and in social and
community contexts.
Collaboration involves individuals sharing their knowledge, skills, and resources with others,
and working together to solve problems, complete tasks, or achieve shared goals.
Collaboration can be facilitated through a variety of methods, including communication tools,
technology platforms, and in-person meetings and workshops.
Collaboration can be beneficial in many ways, including by allowing individuals to learn

from one another, build stronger relationships, and achieve better results than they would
working alone. Successful collaboration requires effective communication, mutual respect,
and a shared commitment to the goal or objective at hand.
Interpretability
Interpretability refers to the ability to explain or understand the behavior or decisions of a
complex system or model in a way that is clear, concise, and understandable to humans. In
the context of machine learning and artificial intelligence, interpretability is an important
aspect that allows humans to understand the reasoning behind the decisions made by these
systems. It can help to build trust, improve accountability, and ensure fairness and ethical use
of the technology.
There are various techniques and methods used for interpretability, including feature
importance analysis, model visualization, sensitivity analysis, and explanation generation.
These techniques can help to provide insights into how a model works, what features are
most important for its decision-making, and how different input values affect its output.
Interpretability is especially important in domains where the consequences of decisions made

by machine learning models can have significant impact on people's lives, such as healthcare,
finance, and criminal justice. In such cases, the ability to explain the reasoning behind the
decisions is essential to ensure transparency, accountability, and fairness.
Recommendations
Data ethics is an essential aspect of the data-driven world we live in today. Here are some
recommendations for practicing ethical data handling:
Obtain consent: Always ensure that the data you collect is obtained with the consent of the
person whose data you are collecting. Provide clear and concise information on the purpose
of collecting the data and how it will be used.
Protect personal data: Protect personal data by implementing measures such as encryption,
anonymization, and access controls. Always ensure that the data you collect is kept secure
and that there is no unauthorized access.
Transparency: Be transparent about how you handle data. This means providing clear
information about the data you collect, the purpose of collecting it, and how it will be used.
Fairness: Ensure that data is handled fairly and that there is no discrimination or bias in how
it is collected, processed, or used.
Respect for privacy: Respect the privacy of individuals by ensuring that the data you collect
is used only for its intended purpose and not shared or used in ways that violate privacy.
Accountability: Take responsibility for your actions and the data you collect. Ensure that you
have processes in place to address any issues that may arise.
Continuous learning: Stay up-to-date with developments in data ethics and continually
evaluate and improve your practices.
By following these recommendations, you can ensure that you are handling data ethically and
responsibly.
Biased Data
When it comes to dealing with biased data, there are several steps you can take to mitigate the
problem and ensure your recommendations are as unbiased as possible:
Identify and acknowledge the bias: The first step in dealing with biased data is to recognize
that it exists. You should examine the data and identify any potential sources of bias, whether
they are related to the collection process, the sample size, or other factors.
Diversify your data sources: To reduce the impact of bias, it's important to gather data from a
variety of sources. This can help to counteract any individual biases that may be present in
the data.
Use unbiased metrics: When evaluating your data, it's important to use metrics that are
objective and unbiased. For example, if you are evaluating the effectiveness of a marketing
campaign, you might use metrics like conversion rate, click-through rate, or customer
retention rate, rather than subjective measures like brand awareness.
Regularly monitor and update your data: It's important to regularly review your data and
update it as necessary. This can help you identify any changes in the data or the underlying
environment that may impact the accuracy or bias of your recommendations.
Use machine learning techniques: Machine learning can help to identify and mitigate bias in
your data. For example, you might use techniques like data augmentation, feature selection,
or oversampling to address any imbalances or biases in your data.
Overall, dealing with biased data requires a thoughtful and proactive approach. By taking
steps to identify and address biases, you can ensure that your recommendations are as
accurate and unbiased as possible.
DATA PROTECTION
Data protection refers to the measures and practices taken to safeguard personal data and
ensure its privacy and security. Personal data can include information such as names,
addresses, phone numbers, email addresses, identification numbers, financial information,
medical records, and more.
Data protection is important because personal data can be vulnerable to theft, misuse, and
unauthorized access, which can lead to identity theft, financial fraud, and other types of harm.
Data protection laws and regulations aim to protect individuals' privacy rights and ensure that
their personal data is collected, processed, and stored securely and lawfully.
Some common data protection practices include data encryption, access controls, data backup
and recovery, regular data audits and assessments, and employee training on data handling
and protection. Many countries have laws and regulations in place to protect personal data,
such as the European Union's General Data Protection Regulation (GDPR) and the United
States' California Consumer Privacy Act (CCPA).
Go Forth and do Data Science:

As an AI language model, I am not capable of doing Data Science directly, but I can certainly
provide you with some guidance on how to get started with Data Science. Here are some
steps that can help you get started:
Learn the fundamentals of programming: Before you dive into Data Science, it's important to
have a strong foundation in programming. You should learn a programming language such as
Python, R, or SQL.
Study Mathematics and Statistics: Data Science is heavily reliant on mathematics and
statistics. Understanding concepts such as linear algebra, calculus, probability, and statistics
is crucial to become a good Data Scientist.
Study Machine Learning: Machine Learning is a critical component of Data Science. It is

important to have a solid understanding of different machine learning algorithms, how they
work, and when to use them.
Learn Data Wrangling: Data Wrangling involves cleaning, transforming, and preparing data
for analysis. It is a time-consuming but necessary process to ensure that your data is accurate
and usable.
Practice Data Visualization: Data visualization is the art of representing data in a graphical
form. It is important to be able to present data in a way that is easy to understand and visually
appealing.
Build a Portfolio: Build projects using different techniques and present them in your
portfolio. You can use Kaggle, a platform for data science competitions, to gain exposure and
build your portfolio.
Learn from Others: Attend meetups, conferences, and online communities where you can
learn from other Data Scientists, ask questions and get feedback.
Remember, Data Science is a constantly evolving field, and there is always more to learn.
Stay curious and keep learning. Good luck on your Data Science journey!
IPython
IPython is an interactive computing environment that is commonly used for data analysis and
scientific computing. In the context of data ethics, IPython can be a useful tool for exploring
ethical issues related to data, as well as for analyzing and visualizing data to gain insights that
can inform ethical decision-making.
One way IPython can be used in data ethics is for exploring biases in data. Data can be biased
in many ways, such as through selection bias, measurement bias, or confounding variables.
By using IPython to analyze and visualize data, researchers can identify and explore potential
biases in their data, which can inform decisions about how to collect, analyze, and interpret
data in an ethical manner.
Another way IPython can be used in data ethics is for exploring the ethical implications of
data-driven decisions. Data-driven decisions can have far-reaching impacts on individuals
and society, and it is important to consider the ethical implications of these decisions. By
using IPython to analyze and visualize data, researchers can explore the potential impacts of
different decision-making scenarios and identify potential ethical concerns that should be
taken into account.
Finally, IPython can be used for communicating about ethical issues related to data. By using
IPython notebooks to document data analyses and ethical considerations, researchers can
share their work with others and facilitate conversations about the ethical implications of
data. This can help ensure that ethical considerations are integrated into data analysis and
decision-making processes.
Mathematics
Mathematics plays an important role in data ethics because it provides a framework for
analyzing and interpreting data in a way that is fair, transparent, and unbiased. Here are some
specific ways in which mathematics is used in data ethics:
Statistical analysis: Statistics is a branch of mathematics that is used to analyze and interpret
data. In data ethics, statistical analysis can be used to identify biases in data and to ensure that
data is being collected and analyzed in a fair and unbiased way.
Machine learning algorithms: Machine learning algorithms are a type of mathematical model
that is used to analyze and interpret large datasets. In data ethics, machine learning algorithms
can be used to identify biases in data and to ensure that data is being collected and analyzed
in a fair and unbiased way.
Data privacy: Cryptography is a branch of mathematics that is used to protect data privacy. In
data ethics, cryptography can be used to protect sensitive data and ensure that data is being
used in an ethical way.
Fairness in algorithms: Mathematics can be used to develop algorithms that are fair and
unbiased. For example, fairness can be measured mathematically using statistical methods,
and algorithms can be designed to minimize unfairness and bias.
Overall, mathematics plays a crucial role in data ethics by providing the tools and techniques
needed to ensure that data is being collected, analyzed, and used in an ethical way.
Not from Scratch

Go is a programming language that is often used for building high-performance applications.
Although it is not as widely used in data science as languages like Python and R, it is still
possible to perform data science tasks in Go with the help of third-party libraries and tools.
One way to get started with data science in Go is to use existing libraries and tools rather than
building everything from scratch. Some popular libraries for data manipulation and analysis
in Go include:
Gonum: a set of numerical libraries for Go that includes packages for linear algebra,
optimization, and statistics.
Gorgonia: a library for machine learning and deep learning in Go that is similar to
TensorFlow and PyTorch.
Gota: a data frame and data manipulation library for Go that provides functionality similar to
that of the pandas library in Python.
In addition to these libraries, there are also a number of tools available for data visualization
in Go, such as Plotly and Gonum Plot.
Overall, while Go may not be the most popular language for data science, there are still
plenty of resources available for those who want to use it for this purpose. By using existing
libraries and tools, you can get up and running quickly and focus on the specific data science
tasks that you want to accomplish.
Find data
Go Forth is a programming language that was developed in the late 1970s and is mainly used
for embedded systems and other low-level applications. It is not a language that is typically
used for data science applications. However, if you have a dataset in a file format that can be
read by Go Forth, you can use Go Forth to perform some basic data analysis.
Here are some steps to perform data science in Go Forth:
Load the data: The first step is to load the data into memory. Go Forth has a built-in file
input/output system that you can use to read data from a file.
Parse the data: Once the data is loaded into memory, you can parse it to extract the relevant
information. Depending on the format of your data, you may need to write your own parsing
code.
Clean the data: Data cleaning is an important step in data science. You may need to remove
missing values, outliers, or duplicates from your data.
Analyze the data: Once the data is cleaned, you can perform some basic data analysis using
Go Forth. For example, you can calculate the mean, median, and mode of a variable, or
calculate the correlation between two variables.
Visualize the data: Data visualization is an important part of data science. You can use Go
Forth to create simple plots and charts to visualize your data.
While Go Forth is not the most popular language for data science, it can still be used for
simple data analysis tasks. However, for more complex data analysis tasks, you may want to
consider using a more powerful language such as Python, R, or Julia.
Do Data Science
As an AI language model, I do not have personal preferences or desires. However, I can
provide you with information about Data Science and Go programming language.
Go (also known as Golang) is a programming language developed by Google. It is known for

its simplicity, efficiency, and concurrency features. Go has a growing community of
developers and is often used for building backend systems, network programming, and cloud-
native applications.
Data Science, on the other hand, is an interdisciplinary field that combines statistical analysis,
machine learning, and computer science to extract insights from data. It involves various
stages such as data collection, data cleaning, data transformation, and model building.
If you are interested in pursuing a career in Data Science, learning Go programming language
can be beneficial in some cases, especially if you are interested in building data-intensive
applications or working with big data. Go has several libraries and frameworks that can be
used for data analysis and processing, such as Gota, Gonum, and GoLearn.
However, it is worth noting that other programming languages such as Python and R are
more commonly used in the Data Science community due to their extensive libraries, tools,
and community support specifically designed for Data Science. Therefore, if you are just
starting with Data Science, it may be more beneficial to learn Python or R first before
exploring other programming languages such as Go.

6001_DATASCIENCE WITH BIGDATA

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

6001_DATASCIENCE WITH BIGDATA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6001_DATASCIENCE WITH BIGDATA

Uploaded by

Copyright:

Available Formats

UNIT-1

Data Science in a big data world:

The data science process:

The big data ecosystem and data science:

The data science process: Overview of the data science process:

1. Problem definition: In this step, the problem that needs to be solved is

3. Data preparation: Once the data has been collected, it needs to be

4. Data exploration and visualization: This step involves exploring and

5. Data modeling: In this step, mathematical and statistical models are

Defining research goals and creating a project character:

 Create a project charter: The project charter is a document that

 APIs: Many applications and services provide APIs (Application

Cleansing, integrating and transforming data

 Data Cleansing: This step involves identifying and correcting or

 Data Transformation: Data transformation involves converting the data

Exploratory data analysis:

Exploratory data analysis (EDA) is the process of analyzing and

 Descriptive statistics: This involves calculating summary statistics such

 Data visualization: This involves creating graphs, charts, and other

 Data transformation: This involves transforming the data to make it

 Outlier detection: This involves identifying and analyzing data points

 Correlation analysis: This involves examining the relationship between

Overall, EDA is an important step in any data analysis project, as it helps to

2. Unsupervised Learning: This type of machine learning involves training a model on

Semi supervised learning in data science:

 It is an approach used when labeled data is scarce and expensive to obtain.

 Semi-supervised learning techniques can be used in a variety of applications in

3. Use distributed computing: Distributed computing is a technique that allows you

 Security: Large data sets can contain sensitive information, making it

 Computing Power: Large data sets require large amounts of computing

General techniques for handling large volumes of data:

General programming tips for dealing with large datasets:

Distributing data storage and processing with frameworks:

Join the NoSQL movement:

No SQL in data science:

Introducing connected data and graph databases:

1. Finance: Text mining is used to analyze financial news and reports to

1. Text Preprocessing: This involves cleaning and preparing the text

The JavaScript MapReducelibrary

One popular library is called "map-reduce" (https://www.npmjs.com/package/map-reduce),

Creating an interactive dashboard with dc.js :

// Create a crossfilter object

// Define the dimensions

// Define the groups

// Create the charts

// Configure the charts

// Create the dashboard

// Add the charts to the dashboard

// Render the dashboard

Dashboard development tools:

Some key principles of data ethics include:

Fairness: Ensuring that data is used fairly, without discrimination or bias.

Consent: Obtaining informed consent from individuals before collecting, processing, or

Building Bad Data Products:

Poor accuracy: If a data product provides inaccurate information or recommendations, it can

Collaboration can be beneficial in many ways, including by allowing individuals to learn

Interpretability is especially important in domains where the consequences of decisions made

Go Forth and do Data Science:

Study Machine Learning: Machine Learning is a critical component of Data Science. It is