0% found this document useful (0 votes)
28 views25 pages

Unit 1

The document provides an overview of data science, its processes, and the characteristics of big data, emphasizing the importance of data cleansing and preparation. It outlines the various types of data, including structured, unstructured, and machine-generated data, and discusses the applications of data science across different sectors. Additionally, it details the data science process, which includes defining research goals, retrieving data, and transforming it for analysis.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views25 pages

Unit 1

The document provides an overview of data science, its processes, and the characteristics of big data, emphasizing the importance of data cleansing and preparation. It outlines the various types of data, including structured, unstructured, and machine-generated data, and discusses the applications of data science across different sectors. Additionally, it details the data science process, which includes defining research goals, retrieving data, and transforming it for analysis.

Uploaded by

ilayaraja.it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS3352 – Foundations of Data Science

UNIT I INTRODUCTION

Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing –

Big Data:

Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as for example,
the RDBMS.

I. Data Science:

• Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains.
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V:
• Veracity: How accurate is the data?
• Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced today.
• Data scientist apart from a statistician are the ability to work with big data and
experience in machine learning, computing, and algorithm building. Tools Hadoop,
Pig, Spark, R, Python, and Java, among others.

II. Benefits and uses of data science and big data

• Data science and big data are used almost everywhere in both commercial and non-
commercial settings.
• Commercial companies in almost every industry use data science and big data to
gain insights into their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience.
o Eg: Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet
o MaxPoint - example of real-time personalized advertising.
• Human resource professionals:
o people analytics and text mining to screen candidates,
o monitor the mood of employees, and
o study informal networks among coworkers
• Financial institutions use data science:
o to predict stock markets, determine the risk of lending money, and
o learn how to attract new clients for their services
• Governmental organizations:
o internal data scientists to discover valuable information,
o share their data with the public
o Eg: Data.gov is but one example; it’s the home of the US Government’s open
data.
o organizations collected 5 billion data records from widespread applications
such as Google Maps, Angry Birds, email, and text messages, among many
other data sources.
• Nongovernmental organizations:
o World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Eg: DataKind is one such data scientist group that devotes its time to the
benefit of mankind.
• Universities:
o Use data science in their research but also to enhance the study experience of
their students.
o massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional
classes.
o Eg: Coursera, Udacity, and edX

III. Facets of data:


The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming

Structured data:
• Structured data is data that depends on a data model and resides in a fixed field
• within a record.
• Easy to store structured data in tables within databases or Excel files or Structured
Query Language.

Unstructured data:

• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
• Eg: E-mail
• Email contains structured elements such as the sender, title, and body text
• Eg: It’s a challenge to find the number of people who have written an email
complaint about a specific employee because so many ways exist to refer to a
person.
• The thousands of different languages and dialects.

Natural language:
• A human-written email is also a perfect example of natural language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of specific data science
techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition, summarization, text
completion, and sentiment analysis.
• Human language is ambiguous in nature.

Machine-generated data:
• Machine-generated data is information that’s automatically created by a computer,
process, application, or other machines without human intervention.
• Machine-generated data is becoming a major data resource.
• Eg: Wikibon has forecast that the market value of the industrial Internet will be
approximately $540 billion in 2020.
• International Data Corporation has estimated there will be 26 times more
connected things than people in 2020.
• This network is commonly referred to as the internet of things.
• Examples of machine data are web server logs, call detail records, network event
logs, and telemetry.
Graph-based or network data:
• “Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a
• mathematical structure to model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical
• data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate the shortest path between two people.
• Graph-based data can be found on many social media websites.
• Eg: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are queried with
specialized
• query languages such as SPARQL.

Audio, image, and video:


• Audio, image, and video are data types that pose specific challenges to a data
scientist.
• Recognizing objects in pictures, turn out to be challenging for computers.
• Major League Baseball Advanced Media - video capture to approximately 7 TB per
• game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time.
• DeepMind succeeded at creating an algorithm that’s capable of learning how to
play video games.
• This algorithm takes the video screen as input and learns to interpret everything
via a complex process of deep learning.
• Google – Artificial Intelligence Development plans

Streaming data:
• The data flows into the system when an event happens instead of being loaded into
a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and
• the stock market.

The data science process:

• The data science process typically consists of six steps:


o Setting the research goal
o Retrieving data
o Data preparation
o Data exploration
o Data modeling or model building
o Presentation and automation

The data science process

IV. Overview of the data science process:

• A structured data science approach helps you maximize your chances of success in
a data science project at the lowest cost.
• The first step of this process is setting a research goal.
• The main purpose here is to make sure all the stakeholders understand the what,
how, and why of the project.
• Draw the result in a project charter.

Step 1: Defining research goals and creating a project charter


• A project starts by understanding your project's what, why, and how.
• The outcome should be a clear research goal, a good understanding of the context,
• well-defined deliverables, and a plan of action with a timetable.
• The information is then best placed in a project charter.

Spend time understanding the goals and context of your research:


• An essential outcome is the research goal that states the purpose of your
assignment
• in a clear and focused manner.
• Understanding the business goals and context is critical for project success.
• To asking questions and devising examples:
o for business expectations,
o how your research is going to change the business, and
o understand how they’ll use your results

Create a project charter:


• The formal agreement on the deliverables.
• All this information is best collected in a project charter.
• A project charter requires teamwork, and your input covers at least the following:
■A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline

V. Step 2: Retrieving data

• The next step in data science is to retrieve the required data.


• Sometimes we need to go into the field and design a data collection process.
• Many companies will have already collected and stored the data.
• That also can be bought from third parties.
• look outside your organization for data - high-quality data freely available for
• public and commercial use.
• Data can be stored in many forms, ranging from simple text files to tables in a
database.

Start with data stored within the company

• To assess the relevance and quality of the data that’s readily available within the
company.
• Company data - data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes maintained by a team of IT
professionals.
• Data mart: A data mart is a subset of the data warehouse and will be serving a
specific business unit.
• Data lakes: Data lakes contain data in its natural or raw format.
• Challenge: As companies grow, their data becomes scattered around many places.
• Knowledge of the data may be dispersed as people change positions and leave the
company.
• Chinese Walls: These policies translate into physical and digital barriers called
Chinese walls. These “walls” are mandatory and well-regulated for customer data.
Don’t be afraid to shop around:

• Many companies specialize in collecting valuable information.


• Nielsen and GFK - retail industry.
• Data as Service - Twitter, LinkedIn, and Facebook.

Do data quality checks now to prevent problems later:


• Data Correction and cleansing.
• Data retrieval - to see if the data is equal to the data in the source document and if
you have the right data types.
• Discover outliers in the exploratory phase, they can point to a data entry error.
VI. Step 3: Cleansing, integrating, and transforming data

The model needs the data in a specific format, so data transformation will be the step.
It’s a good habit to correct data errors as early on in the process as possible.

Cleansing data:
Data cleansing is a subprocess of the data science process.
It focuses on removing errors in the data.
Then the data becomes a true and consistent representation of the processes.
Types of errors:
Interpretation error - a person’s age is greater than 300 years
Inconsistencies - class of errors is putting “Female” in one table and “F” in another when
they represent the same thing.
DATA ENTRY ERRORS:
• Data collection and data entry are error-prone processes.
• Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
• Eg: transmission errors

REDUNDANT WHITESPACE:
• Whitespaces tend to be hard to detect but cause errors like other redundant
characters.
• Eg: a mismatch of keys such as “FR ” – “FR”
• Fixing redundant whitespaces - Python can use the strip() function to remove
leading and trailing spaces.

FIXING CAPITAL LETTER MISMATCHES:


• Capital letter mismatches - distinction between “Brazil” and “brazil”
• strings in lowercase, such as .lower() in Python. “Brazil”.lower() ==“brazil”.lower()
should result in true.

IMPOSSIBLE VALUES AND SANITY CHECKS:


• Sanity checks are another valuable type of data check.
• Check the value against physically or theoretically impossible values : such as
people taller than 3 meters or someone with an age of 299 years.

OUTLIERS
• An outlier is an observation that seems to be distant from other observations.
• The normal distribution, or Gaussian distribution, is the most common distribution
in natural sciences.

The high values in the bottom graph can point to outliers when assuming a normal
distribution.

DEALING WITH MISSING VALUES:

DEVIATIONS FROM A CODE BOOK:


• Detecting errors in larger data sets against a code book or against standardized
values
• can be done with the help of set operations.
• A code book is a description of your data form of metadata.

DIFFERENT UNITS OF MEASUREMENT


• When integrating two data sets, we have to pay attention to their respective units
of
• measurement.
• Eg: Data sets can contain prices per gallon and others can contain prices per liter.
DIFFERENT LEVELS OF AGGREGATION
• Having different levels of aggregation is similar to having different types of
measurement.
• Eg: A data set containing data per week versus one containing data per work week.

Correct errors as early as possible:


• A good practice is to mediate data errors as early as possible in the data collection
• chain and to fix as little as possible.
• The data collection process is error-prone, and in a big organization, it involves many
steps and teams.
• Data should be cleansed when acquired for many reasons:
• Not everyone spots the data anomalies
• If errors are not corrected early on in the process, the cleansing will have to be done.
• Data errors may point to a business process that isn’t working as designed.
• Data errors may point to defective equipment, etc.,
• Data errors can point to bugs in software or in the integration of software.
• Data manipulation doesn’t end with correcting mistakes; still need to combine your
incoming data.

Combining data from different data sources:


• Data varies in size, type, and structure, ranging from databases and Excel files to
text documents.

THE DIFFERENT WAYS OF COMBINING DATA:


• Two operations to combine information from different data.
• joining: enriching an observation from one table with information from another
table.
• The second operation is appending or stacking: adding the observations
• of one table to those of another table.

JOINING TABLES
• Joining tables allows you to combine the information of one observation found in
one
table with the information that you find in another table

• When these keys also uniquely define the records in the table theyare called
primary keys.
APPENDING TABLES
• Appending or stacking tables is effectively adding observations from one table to
another table.

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

• To avoid duplication of data, we can virtually combine data with views.


• How the sales data from the different months is combined virtually into a yearly
sales table instead of duplicating the data?
• A table join is only performed once, the join that creates the view is recreated every
time it’s queried.

ENRICHING AGGREGATED MEASURES


• Data enrichment can also be done by adding calculated information to the table.
• Eg: such as the total number of sales or what percentage of total stock has been
sold in a certain region.
TRANSFORMING DATA

• Relationships between an input variable and an output variable aren’t always


linear.
• Take, for instance, a relationship of the form y = aebx.
• Taking the log of the independent variables simplifies the estimation problem.

REDUCING THE NUMBER OF VARIABLES


• We have too many variables and need to reduce the number because they don’t
add new information to the model.
• Having too many variables in your model makes the model difficult to handle, and
certain techniques don’t perform well when you overload them with too many
input variables.
• Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data.
• Reducing the number of variables makes it easier to understand the key values.
• These variables, called “component1” and “component2,” are both combinations of
the original variables.
• They’re the principal components of the underlying data structure.

TURNING VARIABLES INTO DUMMIES


• Variables can be turned into dummy variables.
• Dummy variables can only take two values: true(1) or false(0).
• They’re used to indicate the absence of a categorical effect that may explain the
observation.
• An example is turning one column named Weekdays into the columns Monday
through Sunday.
• We use an indicator to show if the observation was on a Monday; you put 1 on
Monday and 0 elsewhere.
VII. Step 4: Exploratory data analysis

Information becomes much easier to grasp when shown in a picture.


The graphical techniques to gain an understanding of your data and the interactions
between variables.

visualization techniques : simple line graphs or histograms


Brushing and Linking:

• With brushing and linking we combine and link different graphs and tables or
views so changes in one graph are automatically transferred to the other graphs.
Pareto Diagram:

• A Pareto diagram is a combination of the values and a cumulative distribution.


• It’s easy to see from this diagram that the first 50% of the countries contain slightly
less than 80% of the total
• amount.
• If this graph represented customer buying power and we sell expensive products,
we probably don’t need to spend our marketing budget in every country; we could
start with the first 50%.

• In a histogram a variable is cut into discrete categories and the number of


occurrences in each category are summed up and shown in the graph.
• The boxplot, doesn’t show how many observations are present but does offer an
• impression of the distribution within categories.
• It can show the maximum, minimum, median, and other characterizing measures at
the same time.
Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis.

VIII. Step 5: Build the models

• With clean data in place and a good understanding of the content, we’re ready to
build models with the goal of making better predictions, classifying objects, or
gaining an understanding of the system that we’re modeling.

• The techniques we’ll use now are borrowed from the field of machine learning,
data mining, and/or statistics.

• Building a model is an iterative process.


• The way we build our model depends on whether we go with classic statistics or
the recent machine learning
• and the type of technique we want to use.
• Either way, most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison

Model and variable selection

We need to select the variables you want to include in your model and a modeling
technique.
We’ll need to consider model performance and whether our project meets all the
requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain?

Model execution:

• The most programming languages, such as Python, already have libraries such as
StatsModels or Scikit-learn.
• These packages use several of the most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available
can speed up the process.

Model fit—For this the R-squared or adjusted R-squared is used.


• This measure is an indication of the amount of variation in the data that gets
captured by the model.
• The difference between the adjusted R-squared and the R-squared is minimal here
because the adjusted one is the normal one + a penalty for model complexity.
• A model gets complex when many variables or features are introduced.

Predictor variables have a coefficient—For a linear model this is easy to interpret.


• In our example if you add “1” to x1, it will change y by “0.7658”.
• It’s easy to see how finding a good predictor can be your route.
• Eg: If, for instance, you determine that a certain gene is significant as a cause for
cancer, this is important knowledge, even if that gene in itself doesn’t determine
whether a person will get cancer.
• When to a gene has that impact? This is called significance.

Predictor significance—Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there. This is what the p-value. It means there’s a 5% chance
the predictor doesn’t have any influence.
Model diagnostics and model comparison
Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used
to evaluate the model afterward. The principle here is simple: the model should work on
unseen data.

Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction.

IX. Step 6: Presenting findings and building applications on top of them


• Some work need to repeat it over and over again because they value the
predictions of our models or the insights that you produced.
• For this reason, we need to automate your models.
• This doesn’t always mean that we have to redo all of your analysis all the time.
• Sometimes it’s sufficient that we implement only the model scoring; other times we
might build an application
• that automatically updates reports, Excel spreadsheets, or PowerPoint
presentations.

X. Data Mining:

• Data mining turns a large collection of data into knowledge.


• A search engine (e.g., Google) receives hundreds of millions of queries every day.
• Each query can be viewed as a transaction where the user describes her or his
information need.
• For example, Google’s Flu Trends uses specific search terms as indicators of flu
activity.
• It found a close relationship between the number of people who search for flu-
related information and the number of people who actually have flu symptoms.
• In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information-poor situation (Figure 1.2).
• The fast-growing, tremendous amount of data, collected and stored in large and
numerous data repositories, has far exceeded our human ability for comprehension
without powerful
• tools.
• As a result, data collected in large data repositories become “data tombs”—data
archives that are seldom visited.
• Unfortunately, however, the manual knowledge input procedure is prone to biases
and errors and is extremely costly and time consuming.
• The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden nuggets” of
knowledge.
• other terms have a similar meaning to data mining—for example, knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.

Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in
the process of knowledge discovery. The knowledge discovery process is shown in Figure
1.4 as an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)

XI. DataWarehouses
• A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.

Eg: All Electronics

• To facilitate decision making, the data in a data warehouse are organized around
major subjects (e.g., customer, item, supplier, and activity).
• The data are stored to provide information from a historical perspective, such as in
the past 6 to 12 months, and are typically summarized.
• For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item type for each store or,
summarized to a higher level, for each sales region.
• A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as
count.

• By providing multidimensional data views and the precomputation of summarized


data, data warehouse systems can provide inherent support for OLAP.
• Online analytical processing operations make use of background knowledge
regarding the domain of the data being studied to allow the presentation of data at
different levels of abstraction.
• Such operations accommodate different user viewpoints.
• Examples of OLAP operations include drill-down and roll-up, which allow the user
to view the data at differing degrees of summarization, as illustrated.

You might also like