Data Science Module 1 Notes
Data Science Module 1 Notes
KDD is a computer science field specializing in extracting previously unknown and interesting
information from raw data. KDD is the whole process of trying to make sense of data by
developing appropriate methods or techniques. The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a common
source(DataWarehouse). Data integration using Data Migration tools, Data Synchronization tools
and ETL(Extract-Load- Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees, Naive
bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture
transformations.
• Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model using classification or
characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
based on given measures. It find interestingness score of each pattern, and uses summarization
and Visualization to make data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make decisions.
Main Difference: DBMS is the infrastructure for storing and managing data, while data mining is
a process of analyzing and extracting knowledge from the data stored in the DBMS.
Scope: DBMS focuses on efficiently managing and storing data, ensuring data integrity and
security. Data mining, on the other hand, focuses on analyzing data to discover meaningful
patterns and insights.
Purpose: DBMS is used for data storage, retrieval, and management. Data mining is used for
knowledge discovery and gaining insights from the data.
Functionality: DBMS provides functionalities for data storage, retrieval, and manipulation. Data
mining employs algorithms and statistical techniques to identify patterns and relationships within
the data.
Role: DBMS serves as the foundation for data storage and retrieval, enabling efficient data
handling. Data mining is a process that builds on top of the data stored in the DBMS to extract
valuable information.
What is OLAP?
OLAP stands for Online Analytical Processing. It is a computing method that allows users to
extract useful information and query data in order to analyze it from different angles. For
example, OLAP business intelligence queries usually aid in financial reporting, budgeting, predict
future sales, trends analysis and other purposes. It enables the user to analyze database
information from different database systems simultaneously. OLAP data is stored in
multidimensional databases.
OLAP and data mining look similar since they operate on data to gain knowledge, but the major
difference is how they operate on data. OLAP tools provide multidimensional data analysis and a
summary of the data.
Data mining techniques are algorithms and methods used to extract information and insights
from data sets.
1. Regression
Regression is a data mining technique that is used to model the relationship between a
dependent variable and one or more independent variables. In regression analysis, the goal is to
fit a mathematical model to the data that can be used to make predictions or forecasts about the
dependent variable based on the values of the independent variables.
There are many different types of regression models, including linear regression, logistic
regression, and non-linear regression. In general, regression models are used to answer questions
such as:
• What is the relationship between the dependent and independent variables?
• How well does the model fit the data?
• How accurate are the predictions or forecasts made by the model?
2.Classification
Classification is a data mining technique that is used to predict the class or category of an item or
instance based on its characteristics or attributes. There are many different types of classification
models, including decision trees, k-nearest neighbours, and support vector machines. In general,
classification models are used to answer questions such as:
• What is the relationship between the classes and the attributes
• How well does the model fit the data?
• How accurate are the predictions made by the model?
3.Clustering
Clustering is a data mining technique that is used to group items or instances in a data set into
clusters or groups based on their similarity or proximity. In clustering analysis, the goal is to
identify and explore the natural structure or organization of the data, and to uncover hidden
patterns and relationships.
There are many different types of clustering algorithms, including k-means clustering, hierarchical
clustering, and density-based clustering. In general, clustering is used to answer questions such
as:
• What is the natural structure or organization of the data?
• What are the main clusters or groups in the data?
• How similar or dissimilar are the items in the data set?
There are many different algorithms and methods for association rule mining, including the
Apriori algorithm and the FP-growth algorithm. In general, association rule mining is used to
answer questions such as
• What are the main patterns and rules in the data?
• How strong and significant are these patterns and rules?
• What are the implications of these patterns and rules for the data set and the domain?
5.Dimensionality Reduction
Dimensionality reduction is a data mining technique that is used to reduce the number of
dimensions or features in a data set while retaining as much information and structure as
possible. There are many different methods for dimensionality reduction, including principal
component analysis (PCA), independent component analysis (ICA), and singular value
decomposition (SVD). In general, dimensionality reduction is used to answer questions such as:
• What are the main dimensions or features in the data set?
• How much information and structure can be retained in a lower- dimensional space?
• How can the data be visualized and analyzed in a lower-dimensional space?
6.Anomaly Detection: Anomaly detection identifies outliers or anomalies in data that deviate
from normal patterns. It is used for detecting fraud, network intrusions, and equipment failures.
Techniques include statistical methods, clustering-based approaches, and machine learning
algorithms such as isolation forests and one-class SVM(Support Vector Machine).
7.Sequential Pattern Mining: Sequential pattern mining discovers patterns that occur
sequentially or temporally in data. It is used in applications such as analyzing customer behavior
over time or identifying patterns in sequences of events. Examples include the Prefix Span
algorithm and the GSP (Generalized Sequential Pattern)algorithm.
8.Text Mining: Text mining techniques extract useful information from unstructured text data.
This includes tasks such as sentiment analysis, topic modeling, named entity recognition, and
document classification. Techniques such as natural language processing (NLP) and machine
learning algorithms are commonly used in text mining.
• Enhanced competitiveness: Data mining can help organizations gain a competitive edge by
uncovering new business opportunities and identifying areas for improvement.
• Improved customer service: Data mining can help organizations better understand their
customers and tailor their products and services to meet their needs.
• Fraud detection: Data mining can be used to identify fraudulent activities by detecting
unusual patterns and anomalies in data.
• Predictive modeling: Data mining can be used to build models that can predict future
events and trends, which can be used to make proactive decisions.
• New product development: Data mining can be used to identify new product opportunities
by analyzing customer purchase patterns and preferences.
• Risk management: Data mining can be used to identify potential risks by analyzing data on
customer behavior, market conditions, and other factors.
Challenges and Issues in Data Mining
1. Data Quality
The quality of data used in data mining is one of the most significant challenges. The accuracy,
completeness, and consistency of the data affect the accuracy of the results obtained. The
data may contain errors, omissions, duplications, or inconsistencies, which may lead to
inaccurate results.
To address these challenges, data mining practitioners must apply data cleaning and data
preprocessing techniques to improve the quality of the data
2. Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as
sensors, social media, and the internet of things (IoT). The complexity of the data may make it
challenging to process, analyze, and understand. In addition, the data may be in different
formats, making it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining.
3. Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is
collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The data
may contain personal, sensitive, or confidential information that must be protected.
Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict rules on how
data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data anonymization and data
encryption techniques to protect the privacy and security of the data. Data anonymization
involves removing personally identifiable information (PII) from the data, while data
encryption involves using algorithms to encode the data to make it unreadable to
unauthorized users.
4. Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of the
dataset increases, the time and computational resources required to perform data mining
operations also increase.
To address this challenge, data mining practitioners use distributed computing frameworks
such as Hadoop and Spark.
5. Interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is
because the algorithms use a combination of statistical and mathematical techniques to
identify patterns and relationships in the data.
Data Mining Applications
Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes
data collected from nuclear laboratories, data about human psychology, etc. Data mining
techniques are capable of the analysis of these data. Now we can capture and store more new
data faster than we can analyze the old data already accumulated. Example of scientific
analysis:
Sequence analysis in bioinformatics
Classification of astronomical objects
Medical decision support.
Business Transactions: Every business industry is memorized for perpetuity. Such transactions
are usually time-related and can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time frame for competitive decision-
making is definitely the most important problem to solve for businesses that struggle to survive
in a highly competitive world. Data mining helps to analyze these business transactions and
identify marketing approaches and decision-making. Example :
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of
purchases done by a customer in a supermarket. This concept identifies the pattern of frequent
purchase items by customers. This analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this analysis task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM)
method. This method generates patterns that can be used both by learners and educators. By
using data mining EDM we can perform some educational task:
Predicting students admission in higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and
their outcomes to improve the focusing of high-value physicians and figure out which promoting
activities will have the best effect in the following upcoming months, Whereas the Insurance
sector, data mining can help to predict which customers will buy new policies, identify behavior
patterns of risky customers and identify fraudulent behavior of customers.
Claims analysis i.e which medical procedures are claimed together.
Identify successful medical therapies for different illnesses.
Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can apply
data mining to identify the best prospects for its services. A large consumer merchandise
organization can apply information mining to improve its business cycle to retailers.
Determine the distribution schedules among outlets.
Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of customer
transaction data to identify customers most likely to be interested in a new credit product.
Credit card fraud detection.
Identify ‘Loyal’ customers.
Extraction of information related to customers.
Determine credit card spending by customer groups.