Unit 1 Notes
Unit 1 Notes
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research
goals – Retrieving data – Data Preparation - Exploratory Data analysis – build the model – presenting
findings and building applications - Data Mining - Data Warehousing – Basic statistical descriptions of
Data.
BIG DATA
Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as, for example, the
RDBMS (relational database management systems).
Characteristics of Big Data:
The characteristics of big data are often referred to as the three Vs:
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?
Often these characteristics are complemented with a fourth V, veracity: How accurate is the
data? These four properties make big data different from the data found in traditional data management
tools. Consequently, the challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition, big data calls for specialized
techniques to extract the insights.
DATA SCIENCE
As the amount of data continues to grow and the need to leverage it becomes more important
Data science involves using methods to analyze massive amounts of data and extract the knowledge it
contains. Data science and big data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.
Data science is an evolutionary extension of statistics capable of dealing with the massive
amounts of data produced today. It adds methods from computer science to the repertoire of statistics.
Benefits and uses of data science:
Data science and big data are used almost everywhere in both commercial and non-commercial
settings.
Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience, as well as to
cross-sell, up-sell, and personalize their offerings.
• A good example of this is Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet.
• MaxPoint (http://maxpoint.com/us) is another example of real-time personalized advertising.
• Human resource professionals use people analytics and text mining to screen candidates,
monitor the mood of employees, and study informal networks among coworkers.
• Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services.
Governmental organizations are also aware of data’s value. Many governmental organizations not
only rely on internal data scientists to discover valuable information, but also share their data with the
public.
• This data is used to gain insights or build data-driven applications.
• Data.gov is the home of the US Government’s open data. A data scientist in a governmental
organization gets to work on diverse projects such as detecting fraud and other criminal activity
or optimizing project funding.
• The American National Security Agency and the British Government Communications
Headquarters use data science and big data to monitor millions of individuals.
• These organizations collected 5 billion data records from widespread applications such as
Google Maps, Angry Birds, email, and text messages, among many other data sources. Then
they applied data science techniques to distill information.
Nongovernmental Organizations (NGOs) use data to raise money and defend their causes.
• The World Wildlife Fund (WWF), for instance, employs data scientists to increase the
effectiveness of their fundraising efforts.
• Many data scientists devote part of their time to helping NGOs, because NGOs often lack the
resources to collect data and employ data scientists.
• Data Kind is one such data scientist group that devotes it’s time to the benefit of mankind.
Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes. Examples of
MOOCs are Coursera, Udacity and edX.
FACETS OF DATA
In data science and big data, you’ll come across many different types of data, and each of them
tends to require different tools and techniques. The main categories of data are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Structured Data
Structured data is data that depends on a data model and resides in a fixed field within a record.
It is easy to store structured data in tables within databases or Excel files as shown in figure. SQL or
Structured Query Language, is the preferred way to manage and query data that resides in databases.
There can be structured data that is difficult to store in a traditional relational database.
Hierarchical data such as a family tree is one such example. More often, data comes unstructured.
Unstructured Data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying. One example of unstructured data is the regular email as shown in figure.
Although email contains structured elements such as the sender, title and body text, it is a challenge to
find the number of people who have written an email complaint about a specific employee because so
many ways exist to refer to a person. For example, The thousands of different languages and dialects
out there further complicate this.
Natural Language
Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics. The natural language
processing community has had success in entity recognition, topic recognition, summarization, text
completion, and sentiment analysis, but models trained in one domain don’t generalize well to other
domains. Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text as
it is ambiguous by nature.
Machine-generated Data
Machine-generated data is information that is automatically created by a computer, process,
application, or other machine without human intervention.Machine-generated data is becoming a
major data resource and will continue to do so.
Wikibon has forecast that the market value of the industrial Internet (a term coined by Frost &
Sullivan to refer to the integration of complex physical machinery with networked sensors and
software) will be approximately $540 billion in 2020.
IDC (International Data Corporation) has estimated there will be 26 times more connected
things than people in 2020. This network is commonly referred to as the internet of things.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs and telemetry as
shown in figure.
The machine data shown in figure. would fit in a classic table-structured database. This isn’t
the best approach for highly interconnected or “networked” data, where the relationships between
entities have a valuable role to play.
The outcome should be a clear research goal, a good understanding of the context, well-defined
deliverables and a plan of action with a timetable. This information is then best placed in a project
charter. The length and formality differ between projects and companies.
Spend time understanding the goals and context of your research
An essential outcome is the research goal that states the purpose of your assignment in a clear
and focused manner. Understanding the business goals and context is critical for project success.
Create a project charter
Clients like to know upfront what they are paying for, so after getting a good understanding of
the business problem, try to get a formal agreement on the deliverables. All this information is best
collected in a project charter. For any significant project this would be mandatory.
A project charter requires teamwork and the inputs should cover the following:
• A clear research goal
• The project mission and context
• How to perform analysis
• What resources are expected to use
• Proof that it’s an achievable project or proof of concepts
• Deliverables and a measure of success
• A timeline that the client can use this information to make an estimation of the project costs
and the data and people required for the project to become a success.
RETRIEVING DATA
The next step in data science is to retrieve the required data as shown in figure. Sometimes
there is a need to go into the field and design a data collection process. Many companies will have
already collected and stored the data and what they don’t have can often be bought from third parties.
The organizations are also making high-quality data freely available for public and commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective is acquiring all the data needed. This may be difficult, and even if you succeed, data needs
polishing to be of any use to you.
Start with data stored within the company (Internal data)
The first step is to assess the relevance and quality of the data that is readily available within
the company. Most companies have a program for maintaining key data, so much of the cleaning work
may already be done.
This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals. The primary goal of a database is
data storage, while a data warehouse is designed for reading and analyzing that data.
A data mart is a subset of the data warehouse and geared toward serving a specific business
unit. While data warehouses and data marts are home to preprocessed data, data lakes contain data in
its natural or raw format. Getting access to data is another difficult task.
Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more. These policies translate into physical and
digital barriers called Chinese walls. These “walls” are mandatory and well-regulated for customer
data in most countries. Getting access to the data may take time and involve company politics.
Don’t be afraid to shop around (External Data)
If data isn’t available inside your organization, look outside your organization’s walls. Many
companies specialize in collecting valuable information. Other companies provide data so that you, in
turn, can enrich their services and ecosystem. Such is the case with Twitter, LinkedIn, and Facebook.
Although data is considered an asset more valuable by certain companies, more and more
governments and organizations share their data for free with the world. This data can be of excellent
quality and it depends on the institution that creates and manages it.
The information they share covers a broad range of topics such as the number of accidents or
amount of drug abuse in a certain region and its demographics. This data is helpful when you want to
enrich proprietary data but also convenient when training your data science skills.
Table shows only a small selection from the growing number of open-data providers.
Do data quality checks to prevent problems
A good portion of the project time should be spent on doing data correction and cleansing. The
retrieval of data is the first time we’ll inspect the data in the data science process. Most of the errors
encountered during the data gathering phase are easy to spot.
The data is investigated during the import, data preparation and exploratory phases. The
difference is in the goal and the depth of the investigation. During data retrieval, check to see if the
data is equal to the data in the source document and to see if we have the right data types.
DATA PREPARATION (Cleansing, Integrating and transforming data)
The data received from the data retrieval phase is likely to be “a diamond in the rough.” so the
task is to prepare it for use in the modeling and reporting phase. The model needs the data in specific
format, so data transformation will always come into play.
It is good to correct data errors as early on in the process as possible. Figure shows the most
common actions to take during the data cleansing, integration, and transformation phase.
Cleansing Data:
Data cleansing is a subprocess of the data science process that focuses on removing errors in
the data so that the data becomes a true and consistent representation of the processes it originates
from. There are at least two types of errors.
The first type is the interpretation error, such as when you take the value in your data for
granted, like saying that a person’s age is greater than 300 years. The second type of error points to
inconsistencies between data sources or against your company’s standardized values. The table shows
an overview of the types of errors that can be detected with easy checks.
General solution: Try to fix the problem early in the data acquisition chain or else fix it in the
program. Sometimes there is a need to use more advanced methods, such as simple modeling, to find
and identify data errors; diagnostic plots can be especially insightful.
Data entry errors
Data collection and data entry are error-prone processes. They often require human
intervention, as humans make typos or lose their concentration for a second and introduce an error into
the chain. But data collected by machines or computers isn’t free from errors either.
Errors can arise from human sloppiness, whereas others are due to machine or hardware
failure. Examples of errors originating from machines are transmission errors or bugs in the extract,
transform, and load phase (ETL). For small data sets we can check every value by hand.
Detecting data errors when the variables we study don’t have many classes can be done by
tabulating the data with counts. When we have a variable that can take only two values: “Good” and
“Bad”, we can create a frequency table and see if those are truly the only two values present.
In table, the values “Godo” and “Bade” point out something went wrong in at least 16 cases.
Table Detecting outliers on simple variables with a frequency table Most errors of this type are
easy to fix with simple assignment statements and if-then else rules:
Redundant whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
Fixing redundant whitespaces can be done by most programming languages. They all provide string
functions that will remove the leading and trailing whitespaces. For instance, in Python you can use the
strip() function to remove leading and trailing spaces.
Fixing capital letter mismatches
Capital letter mismatches are common. Most programming languages make a distinction
between “Brazil” and “brazil”. In this case we can solve the problem by applying a function that
returns both strings in lowercase, such as .lower() in Python.“Brazil”.lower() == “brazil”.lower()
should result in true.
Impossible values and sanity checks
Sanity checks are another valuable type of data check. Here we check the value against physically or
theoretically impossible values such as people taller than 3 meters or someone with an age of 299
years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
Outliers
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values. An example is shown in figure.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers
on the upper side when a normal distribution is expected. The normal distribution or Gaussian
distribution is the most common distribution.
Expected Distribution
Distribution plots are helpful in detecting outliers and helps in understanding the variable. It
shows most cases occurring around the average of the distribution and the occurrences decrease when
further away from it. The high values in the bottom graph can point to outliers when assuming a
normal distribution.
Dealing with missing values
Missing values aren’t necessarily wrong, but we still need to handle them separately. Certain
modeling techniques can’t handle missing values. They might be an indicator that something went
wrong in data collection or that an error happened in the ETL process. Common techniques used are
listed in table. Which technique to use at what time is dependent on a particular case.
Deviations from a code book
Detecting errors in larger data sets against a code book or against standardized values can be
done with the help of set operations. A code book is a description of data, a form of metadata.
It contains things such as the number of variables per observation, the number of observations,
and what each encoding within a variable means. For instance “0” equals “negative”, “5” stands for
“very positive”. A code book also tells the type of data we are looking at: is it hierarchical, graph,
something else.
Combining data from different data sources
The data comes from several different places and we need to integrate these different sources.
Data varies in size, type and structure, ranging from databases and Excel files to text documents.
The different ways of combining the data
There are two operations to combine information from different data sets.
• The first operation is joining: enriching an observation from one table with information from
another table.
• The second operation is appending or stacking: adding the observations of one table to those of
another table.
When the data is combined, we have the option to create a new physical table or a virtual table
by creating a view. The advantage of a view is that it doesn’t consume more disk space.
Joining Tables
Joining tables allows us to combine the information of one observation found in one table with
the information that is found in another table. The focus is on enriching a single observation.
Let’s say that the first table contains information about the purchases of a customer and the
other table contains information about the region where your customer lives. Joining the tables allows
us to combine the information so that we can use it for a model, as shown in figure.
To join tables, we use variables that represent the same object in both tables, such as a date, a
country name, or a Social Security number. These common fields are known as keys. When these keys
also uniquely define the records in the table they are called primary keys.
One table may have buying behaviour and the other table may have demographic information
on a person. In figure both tables contain the client’s name, and this makes it easy to enrich the client
expenditures with the region of the client.
Appending Tables
Appending or stacking tables is effectively adding observations from one table to another table.
Figure shows an example of appending tables. One table contains the observations from the month
January and the second table contains observations from the month February. The result of appending
these tables is a larger one with the observations from January as well as February.
Appending data from tables is a common operation but requires an equal structure in the tables
being appended.
Using views to simulate data joins and appends
To avoid duplication of data, we virtually combine data with views. The problem is that when
we duplicate the data, more storage space is needed. In case of less data, that may not cause problems.
But if every table consists of terabytes of data, then it becomes problematic to duplicate the data. For
this reason, the concept of a view was invented.
A view behaves as if we are working on a table, but this table is nothing but a virtual layer that
combines the tables. Figure shows how the sales data from the different months is combined virtually
into a yearly sales table instead of duplicating the data.
Views do come with a drawback, however. While a table join is only performed once, the join
that creates the view is recreated every time it’s queried, using more processing power than a pre-
calculated table would have.
A Pareto diagram is a combination of the values and a cumulative distribution. It’s easy to see
from this diagram that the first 50% of the countries contain slightly less than 80% of the total amount.
Figure shows another technique: brushing and linking. With brushing and linking we combine
and link different graphs and tables (or views) so changes in one graph are automatically transferred to
the other graphs. This interactive exploration of data facilitates the discovery of new insights.
Figure shows the average score per country for questions. Not only does this indicate a high
correlation between the answers, but it’s easy to see that when we select several points on a subplot,
the points will correspond to similar points on the other graphs. In this case the selected points on the
left graph correspond to points on the middle and right graphs, although they correspond better in the
middle and right graphs.
Two other important graphs are the histogram shown in figure and the boxplot shown in figure.
In a histogram a variable is cut into discrete categories and the number of occurrences in each category
are summed up and shown in the graph. The boxplot, on the other hand, doesn’t show how many
observations are present but does offer an impression of the distribution within categories. It can show
the maximum, minimum, median, and other characterizing measures at the same time.
These techniques are mainly visual, but in practice they’re certainly not limited to visualization
techniques. Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis. Even building simple models can be a part of this step.
BUILD THE MODELS
With clean data in place and a good understanding of the content, we are ready to build models
with the goal of making better predictions, classifying objects, or gaining an understanding of the
system that we are modeling. This phase is much more focused than the exploratory analysis step,
because we know what we are looking for and what we want the outcome to be. Figure shows the
components of model building.
Building a model is an iterative process. Most of the models consist of the following main
steps:
• Selection of a modeling technique and variables to enter in the model
• Execution of the model
• Diagnosis and model comparison
Model and variable selection
We need to select the variables we want to include in our model and a modeling technique.
Many modeling techniques are available, we need to choose the right model for a problem. The model
performance and all the requirements to use the model has to be considered. The other factors are:
• Must the model be moved to a production environment and, if so, would it be easy to
implement?
• How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
• Does the model need to be easy to explain?
Model execution
Once we have chosen a model, we will need to implement it in code. The following listing
shows the execution of a linear prediction model.
Executing a linear prediction model on semi-random data
Linear regression tries to fit a line while minimizing the distance to each point
We created predictor values that are meant to predict how the target variables behave. For a
linear regression, a “linear relation” between each x (predictor) and the y (target) variable is assumed,
as shown in figure. We created the target variable based on the predictor by adding a bit of
randomness, this gives us a well-fitting model. The results.summary() outputs the table in figure. The
exact outcome depends on the random variables we got.
R-squared and Adj.R-squared is Model fit: higher is better but too high is suspicious.
The p-value to show whether a predictor variable has a significant influence on the target. Lower is
better and <0.05 is often considered “significant.”
Linear equation coefficients. y = 0.7658xl + 1.1252x2.
PRESENTING FINDINGS AND BUILDING APPLICATIONS
After analysing the data and building a well-performing model, we have to present the findings
as shown in figure. The predictions of the models or the insights produced is of great value. For this
reason, we need to automate the models. We can also build an application that automatically updates
reports, Excel spreadsheets, or PowerPoint presentations.
DATA MINING
Why data mining?
We live in a world where vast amounts of data are collected daily. Analyzing such data is an important
need. Data mining provides tools to discover knowledge from data. It can be viewed as a result of the
natural evolution of information technology. Data mining turns a large collection of data into
knowledge.
• Terabytes or petabytes of data pour into computer networks, the World Wide Web (WWW),
and various data storage devices every day from business, society, science and engineering,
medicine, and almost every other aspect of daily life. This growth of available data volume is a
result of the computerization of our society and the fast development of powerful data
collection and storage tools.
• Businesses worldwide generate gigantic data sets, including sales transactions, stock trading
records, product descriptions, sales promotions, company profiles and performance, and
customer feedback. For example, large stores, such as Wal-Mart, handle hundreds of millions
of transactions per week at thousands of branches around the world.
• Scientific and engineering practices generate high orders of petabytes of data in a continuous
manner, from remote sensing, process measuring, scientific experiments, system performance,
engineering observations, and environment surveillance. Global backbone telecommunication
networks carry tens of petabytes of data traffic every day.
• The medical and health industry generates tremendous amounts of data from medical records,
patient monitoring, and medical imaging.
• Billions of Web searches supported by search engines process tens of petabytes of data daily.
Communities and social media have become increasingly important data sources, producing
digital pictures and videos, blogs, Web communities and various kinds of social networks.
• The list of sources that generate huge amounts of data is endless. Powerful and versatile tools
are needed to automatically uncover valuable information from the tremendous amounts of data
and to transform such data into organized knowledge. This necessity has led to the birth of data
mining.
What is data mining?
Data mining is searching for knowledge (interesting patterns) in data. Data mining is an essential step
in the process of knowledge discovery. The knowledge discovery process is shown in Figure as an
iterative sequence of the following steps:
• Data cleaning: To remove noise and inconsistent data
• Data integration: Where multiple data sources may be combined
• Data selection: Where data relevant to the analysis task are retrieved from the database
• Data transformation: Where data are transformed and consolidated into form appropriate for
mining by performing summary or aggregation operations.
• Data mining: An essential process where intelligent methods are applied to extract data
patterns.
• Pattern evaluation: To identify the interesting patterns representing knowledge based on
interestingness measures.
• Knowledge presentation: Where visualization and knowledge representation techniques are
used to present mined knowledge to users.
Steps 1 through 4 are different forms of data pre-processing, where data is prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, web, other information
repositories or data that are streamed into the system dynamically.
What kinds of data can be mined?
The most basic forms of data for mining applications are database data, data warehouse data
and transactional data. Data mining can also be applied to other forms of data such as data streams,
ordered/sequence data, graph or networked data, spatial data, text data, multimedia data and the
WWW.
Database data
A database system, also called a database management system (DBMS), consists of a
collection of interrelated data known as a database and a set of software programs to manage and
access the data. The software programs provide mechanisms for defining database structures and data
storage, for specifying and managing concurrent, shared, or distributed data access and for ensuring
consistency and security of the information stored despite system crashes or attempts at unauthorized
access.
Relational databases are one of the most commonly available and richest information
repositories and thus they are a major data form in the study of data mining.
Data warehouses
A data warehouse is a repository of information collected from multiple sources stored under a
unified schema and usually residing at a single site. Data warehouses are constructed via a process of
data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
For example, rather than storing the details of each sales transaction, the data warehouse may
store a summary of the transactions per item type for each store or summarized to a higher level for
each sales region.
A data warehouse is usually modeled by a multidimensional data structure, called a data cube,
in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure such as count or sum (sales amount). A data cube provides
a multidimensional view of data and allows the precomputation and fast access of summarized data.
Transactional Data
Each record in a transactional database captures a transaction such as a customer’s purchase, a
flight booking or a user’s clicks on a webpage. A transaction typically includes a unique transaction
identity number (trans ID) and a list of the items making up the transaction, such as the items
purchased in the transaction. A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information about the salesperson or
the branch, and so on.
Other kinds of data
There are many other kinds of data that have versatile forms and structures and rather different
semantic meanings. Such kinds of data can be seen in many applications:
• Time-related or sequence data e.g historical records, stock exchange data and time-series and
biological sequence data.
• Data streams e.g video surveillance and sensor data, which are continuously transmitted.
• Spatial data e.g maps.
• Engineering design data e.g the design of buildings, system components, or integrated circuits.
• Hypertext and multimedia data including text, image, video, and audio data.
• Graph and networked data e.g social and information networks.
• The Web which is a huge, widely distributed information repository made available by the
Internet.
These applications bring about new challenges like how to handle data carrying special
structures such as sequences, trees, graphs, and networks and specific semantics such as ordering,
image, audio and video contents, and connectivity and how to mine patterns that carry rich structures
and semantics.
What kinds of patterns can be mined?
• Class/Concept Description: Characterization and Discrimination
• Mining Frequent Patterns, Associations, and Correlations
• Classification and Regression for Predictive Analysis
• Cluster Analysis
• Outlier Analysis
Technologies used in data mining
Statistics
Statistics studies the collection, analysis, interpretation or explanation and presentation of data.
Data mining has an inherent connection with statistics. A statistical model is a set of mathematical
functions that describe the behaviour of the objects in a target class in terms of random variables and
their associated probability distributions. Statistical models are widely used to model data and data
classes. For example, in data mining tasks like data characterization and classification, statistical
models of target classes can be built. Statistical methods can also be used to verify data mining results.
Machine Learning
For classification and clustering tasks, machine learning research often focuses on the accuracy
of the model. In addition to accuracy, data mining research places strong emphasis on the efficiency
and scalability of mining methods on large data sets, as well as on ways to handle complex types of
data and explore new, alternative methods.
Database Systems and Data Warehouses
Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Many data mining tasks need to handle large data sets or even real-time,
fast streaming data. Therefore, data mining can make good use of scalable database technologies to
achieve high efficiency and scalability on large datasets. Data mining tasks can be used to extend the
capability of existing database systems to satisfy advanced user’s sophisticated data analysis
requirements.
A data warehouse integrates data originating from multiple sources and various timeframes. It
consolidates data in multidimensional space to form partially materialized data cubes. The data cube
model not only facilitates OLAP in multidimensional databases but also promotes multidimensional
data mining.
Information Retrieval
Information retrieval (IR) is the science of searching for documents or information in
documents. Documents can be text or multimedia, and may reside on the Web. A text document which
may involve one or multiple topics can be regarded as a mixture of multiple topic models. By
integrating information retrieval models and data mining techniques, we can find the major topics in a
collection of documents and for each document in the collection, the major topics involved.
Which kinds of applications are targeted?
Business Intelligence
Business intelligence (BI) technologies provide historical, current, and predictive views of
business operations. Examples include reporting, online analytical processing, business performance
management, competitive intelligence, benchmarking, and predictive analytics.
Data mining is the core of business intelligence. Online analytical processing tools in business
intelligence rely on data warehousing and multidimensional data mining.
Classification and prediction techniques are the core of predictive analytics in business
intelligence, for which there are many applications in analyzing markets, supplies, and sales.
Web search engines
A Web search engine is a specialized computer server that searches for information on the
Web. The search results of a user query are often returned as a list sometimes called hits. The hits may
consist of web pages, images, and other types of files. Some search engines also search and return data
available in public databases or open directories.
Web search engines are essentially very large data mining applications. Various data mining
techniques are used in all aspects of search engines, ranging from crawling e.g, deciding which pages
should be crawled and the crawling frequencies, indexing e.g, selecting pages to be indexed and
deciding to which extent the index should be constructed, and searching e.g, deciding how pages
should be ranked, which advertisements should be added, and how the search results can be
personalized or made “context aware”.
Search engines pose grand challenges to data mining. First, they have to handle a huge and
ever-growing amount of data. Second, Web search engines often have to deal with online data.
Another challenge is maintaining and incrementally updating a model on fast growing data streams.
Third, Web search engines often have to deal with queries that are asked only a very small number of
times.
Major issues in data mining
Major issues in data mining research are categorized into five groups: mining methodology,
user interaction, efficiency and scalability, diversity of data types, and data mining and society.
Mining Methodology:
Mining various and new kinds of knowledge: Data mining covers a wide spectrum of data
analysis and knowledge discovery tasks, from data characterization and discrimination to association
and correlation analysis, classification, regression, clustering, outlier analysis, sequence analysis, and
trend and evolution analysis. These tasks may use the same database in different ways and require the
development of numerous data mining techniques.
Mining knowledge in multidimensional space: When searching for knowledge in large data
sets, we can explore the data in multidimensional space. That is, we can search for interesting patterns
among combinations of dimensions (attributes) at varying levels of abstraction. Such mining is known
as (exploratory) multidimensional data mining. In many cases, data can be aggregated or viewed as a
multidimensional data cube. Mining knowledge in cube space can substantially enhance the power and
flexibility of data mining.
Data mining an interdisciplinary effort: The power of data mining can be substantially
enhanced by integrating new methods from multiple disciplines. Boosting the power of discovery in a
networked environment: Most data objects reside in a linked or interconnected environment, whether it
be the Web, database relations, files, or documents. Semantic links across multiple data objects can be
used to advantage in data mining. Handling uncertainty, noise, or incompleteness of data: Data often
contain noise, errors, exceptions, or uncertainty, or are incomplete. Errors and noise may confuse the
data mining process, leading to the derivation of erroneous patterns.
User Interaction:
The user plays an important role in the data mining process.
Interactive mining: The data mining process should be highly interactive. Thus, it is important
to build flexible user interfaces and an exploratory mining environment, facilitating the user’s
interaction with the system. Incorporation of background knowledge: Background knowledge,
constraints, rules, and other information regarding the domain under study should be incorporated into
the knowledge discovery process.
Ad hoc data mining and data mining query languages: Query languages e.g, SQL have played
an important role in flexible searching because they allow users to pose ad hoc queries. Similarly,
high-level data mining query languages or other high-level flexible user interfaces will give users the
freedom to define ad hoc data mining tasks.
Presentation and visualization of data mining results: The data mining system should present
data mining results vividly and flexibly, so that the discovered knowledge can be easily understood
and directly usable by humans. This is especially crucial if the data mining process is interactive. It
requires the system to adopt expressive knowledge representations, user-friendly interfaces, and
visualization techniques.
Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining algorithms. As
data amounts continue to multiply, these two factors are especially critical.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be efficient
and scalable in order to effectively extract information from huge amounts of data in many data
repositories or in dynamic data streams.
Parallel, distributed and incremental mining algorithms: The enormous size of many data sets,
the wide distribution of data and the computational complexity of some data mining methods are
factors that motivate the development of parallel and distributed data-intensive mining algorithms.
Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active research themes in
parallel data mining.
Diversity of database types
The wide diversity of database types brings about challenges to data mining.
These include Handling complex types of data: Diverse applications generate a wide spectrum
of new data types, from structured data such as relational and data warehouse data to semi-structured
and unstructured data, from stable data repositories to dynamic data streams, from simple data objects
to temporal data, biological sequences, sensor data, spatial data, hypertext data, multimedia data,
software program code, Web data and social network data.
Mining dynamic, networked, and global data repositories: Multiple sources of data are
connected by the internet and various kinds of networks, forming gigantic, distributed, and
heterogeneous global information systems and networks. The discovery of knowledge from different
sources of structured, semi-structured, or unstructured yet interconnected data with diverse data
semantics poses great challenges to data mining.
Data mining and society
Social impacts of data mining: The improper disclosure or use of data and the potential
violation of individual privacy and data protection rights are areas of concern that need to be
addressed. Privacy-preserving data mining: Data mining will help scientific discovery, business
management, economy recovery and security protection e.g, the real-time discovery of intruders and
cyber-attacks.
Invisible data mining: More and more systems should have data mining functions built within
so that people can perform data mining or use data mining results simply by mouse clicking, without
any knowledge of data mining algorithms. Intelligent search engines and Internet-based stores perform
such invisible data mining by incorporating data mining into their components to improve their
functionality and performance.
BASIC STATISTICAL DESCRIPTIONS OF DATA
Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
Measuring the Central Tendency: Mean, Median, and Mode
The attribute X, like salary, is recorded for a set of objects. Let x1,x2,...,xN be the set of N
observed values or observations for X. Here, these values may also be referred to as the data set (for
X). To plot the observations for salary, where most of the values fall gives us an idea of the central
tendency of the data.
Measures of central tendency include the mean, median, mode, and midrange. The most
common and effective numeric measure of the “center” of a set of data is the (arithmetic) mean. Let
x1,x2,...,xN be a set of N values or observations, such as for some numeric attribute X, like salary. The
mean of this set of values is
This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational
database systems.
Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation and
Interquartile Range
The measures such as range, quantiles, quartiles, percentiles, and the interquartile range are
used to assess the dispersion or spread of numeric data. The five-number summary which can be
displayed as a boxplot is useful in identifying outliers. Variance and standard deviation also indicate
the spread of a data distribution.
Range, Quartiles and Interquartile Range
Let x1,x2,...,xN be a set of observations for some numeric attribute, X. The range of the set is
the difference between the largest (max()) and smallest (min()) values. Quantiles are points taken at
regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It
corresponds to the median. The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution. They are more commonly
referred to as quartiles.
The 100-quantiles are more commonly referred to as percentiles, they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles are the most
widely used forms of quantiles.
The quantiles plotted are quartiles. The three quartiles divide the distribution into four equal-
size consecutive subsets. The second quartile corresponds to the median. The quartiles give an
indication of a distribution’s center, spread and shape.
The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data.
The third quartile, denoted by Q3, is the 75th percentile. It cuts off the lowest 75% (or highest 25%) of
the data. The second quartile is the 50th percentile. As the median, it gives the center of the data
distribution.
The distance between the first and third quartiles is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the inter quartile range (IQR) and
is defined as
where x is the mean value of the observations, as defined in Equation. The standard deviation, σ, of the
observations is the square root of the variance, σ2.
Graphic Displays of Basic Statistical Descriptions of Data
Quantile plots, quantile–quantile plots, histograms, and scatter plots. These graphs are helpful
for the visual inspection of data, which is useful for data preprocessing. The first three of these show
univariate distributions (i.e., data for one attribute), while scatter plots show bivariate distributions
(i.e., involving two attributes).