Unit 1
Unit 1
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing –
Big Data:
Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as for example,
the RDBMS.
I. Data Science:
• Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains.
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V:
• Veracity: How accurate is the data?
• Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced today.
• Data scientist apart from a statistician are the ability to work with big data and
experience in machine learning, computing, and algorithm building. Tools Hadoop,
Pig, Spark, R, Python, and Java, among others.
• Data science and big data are used almost everywhere in both commercial and non-
commercial settings.
• Commercial companies in almost every industry use data science and big data to
gain insights into their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience.
o Eg: Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet
o MaxPoint - example of real-time personalized advertising.
• Human resource professionals:
o people analytics and text mining to screen candidates,
o monitor the mood of employees, and
o study informal networks among coworkers
• Financial institutions use data science:
o to predict stock markets, determine the risk of lending money, and
o learn how to attract new clients for their services
• Governmental organizations:
o internal data scientists to discover valuable information,
o share their data with the public
o Eg: Data.gov is but one example; it’s the home of the US Government’s open
data.
o organizations collected 5 billion data records from widespread applications
such as Google Maps, Angry Birds, email, and text messages, among many
other data sources.
• Nongovernmental organizations:
o World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Eg: DataKind is one such data scientist group that devotes its time to the
benefit of mankind.
• Universities:
o Use data science in their research but also to enhance the study experience of
their students.
o massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional
classes.
o Eg: Coursera, Udacity, and edX
Structured data:
• Structured data is data that depends on a data model and resides in a fixed field
• within a record.
• Easy to store structured data in tables within databases or Excel files or Structured
Query Language.
Unstructured data:
• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
• Eg: E-mail
• Email contains structured elements such as the sender, title, and body text
• Eg: It’s a challenge to find the number of people who have written an email
complaint about a specific employee because so many ways exist to refer to a
person.
• The thousands of different languages and dialects.
Natural language:
• A human-written email is also a perfect example of natural language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of specific data science
techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition, summarization, text
completion, and sentiment analysis.
• Human language is ambiguous in nature.
Machine-generated data:
• Machine-generated data is information that’s automatically created by a computer,
process, application, or other machines without human intervention.
• Machine-generated data is becoming a major data resource.
• Eg: Wikibon has forecast that the market value of the industrial Internet will be
approximately $540 billion in 2020.
• International Data Corporation has estimated there will be 26 times more
connected things than people in 2020.
• This network is commonly referred to as the internet of things.
• Examples of machine data are web server logs, call detail records, network event
logs, and telemetry.
Graph-based or network data:
• “Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a
• mathematical structure to model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical
• data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate the shortest path between two people.
• Graph-based data can be found on many social media websites.
• Eg: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are queried with
specialized
• query languages such as SPARQL.
Streaming data:
• The data flows into the system when an event happens instead of being loaded into
a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and
• the stock market.
• A structured data science approach helps you maximize your chances of success in
a data science project at the lowest cost.
• The first step of this process is setting a research goal.
• The main purpose here is to make sure all the stakeholders understand the what,
how, and why of the project.
• Draw the result in a project charter.
• To assess the relevance and quality of the data that’s readily available within the
company.
• Company data - data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes maintained by a team of IT
professionals.
• Data mart: A data mart is a subset of the data warehouse and will be serving a
specific business unit.
• Data lakes: Data lakes contain data in its natural or raw format.
• Challenge: As companies grow, their data becomes scattered around many places.
• Knowledge of the data may be dispersed as people change positions and leave the
company.
• Chinese Walls: These policies translate into physical and digital barriers called
Chinese walls. These “walls” are mandatory and well-regulated for customer data.
Don’t be afraid to shop around:
The model needs the data in a specific format, so data transformation will be the step.
It’s a good habit to correct data errors as early on in the process as possible.
Cleansing data:
Data cleansing is a subprocess of the data science process.
It focuses on removing errors in the data.
Then the data becomes a true and consistent representation of the processes.
Types of errors:
Interpretation error - a person’s age is greater than 300 years
Inconsistencies - class of errors is putting “Female” in one table and “F” in another when
they represent the same thing.
DATA ENTRY ERRORS:
• Data collection and data entry are error-prone processes.
• Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
• Eg: transmission errors
REDUNDANT WHITESPACE:
• Whitespaces tend to be hard to detect but cause errors like other redundant
characters.
• Eg: a mismatch of keys such as “FR ” – “FR”
• Fixing redundant whitespaces - Python can use the strip() function to remove
leading and trailing spaces.
OUTLIERS
• An outlier is an observation that seems to be distant from other observations.
• The normal distribution, or Gaussian distribution, is the most common distribution
in natural sciences.
The high values in the bottom graph can point to outliers when assuming a normal
distribution.
JOINING TABLES
• Joining tables allows you to combine the information of one observation found in
one
table with the information that you find in another table
• When these keys also uniquely define the records in the table theyare called
primary keys.
APPENDING TABLES
• Appending or stacking tables is effectively adding observations from one table to
another table.
• With brushing and linking we combine and link different graphs and tables or
views so changes in one graph are automatically transferred to the other graphs.
Pareto Diagram:
• With clean data in place and a good understanding of the content, we’re ready to
build models with the goal of making better predictions, classifying objects, or
gaining an understanding of the system that we’re modeling.
• The techniques we’ll use now are borrowed from the field of machine learning,
data mining, and/or statistics.
We need to select the variables you want to include in your model and a modeling
technique.
We’ll need to consider model performance and whether our project meets all the
requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain?
Model execution:
• The most programming languages, such as Python, already have libraries such as
StatsModels or Scikit-learn.
• These packages use several of the most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available
can speed up the process.
Predictor significance—Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there. This is what the p-value. It means there’s a 5% chance
the predictor doesn’t have any influence.
Model diagnostics and model comparison
Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used
to evaluate the model afterward. The principle here is simple: the model should work on
unseen data.
Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction.
X. Data Mining:
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery. The knowledge discovery process is shown in Figure
1.4 as an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
XI. DataWarehouses
• A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
• To facilitate decision making, the data in a data warehouse are organized around
major subjects (e.g., customer, item, supplier, and activity).
• The data are stored to provide information from a historical perspective, such as in
the past 6 to 12 months, and are typically summarized.
• For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or,
summarized to a higher level, for each sales region.
• A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as
count.