0% found this document useful (0 votes)

214 views15 pages

FDS-Content Beyond Syllabus

Content Beyond Syllabus

Uploaded by

Devika Palanisamy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

214 views15 pages

FDS-Content Beyond Syllabus

Content Beyond Syllabus

Uploaded by

Devika Palanisamy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

FOUNDATIONS OF DATA SCIENCE

CONTENT BEYOND SYLLABUS

UNIT-1

Text mining and text analytics

What is Text Mining

Text mining is a component of data mining that deals specifically with unstructured text data. It
involves the use of natural language processing (NLP) techniques to extract useful information
and insights from large amounts of unstructured text data. Text mining can be used as a
preprocessing step for data mining or as a standalone process for specific tasks.

Text Mining in Data Mining

Text mining in data mining is mostly used for, the unstructured text data that can be
transformed into structured data that can be used for data mining tasks such as classification,
clustering, and association rule mining. This allows organizations to gain insights from a wide
range of data sources, such as customer feedback, social media posts, and news articles.

Text Mining vs. Text Analytics

Text mining and text analytics are related but distinct processes for extracting insights from
textual data. Text mining involves the application of natural language processing and machine
learning techniques to discover patterns, trends, and knowledge from large volumes of
unstructured text.

However, Text Analytics focuses on extracting meaningful information, sentiments, and context
from text, often using statistical and linguistic methods. While text mining emphasizes
uncovering hidden patterns, text analytics emphasizes deriving actionable insights for decision-
making. Both play crucial roles in transforming unstructured text into valuable knowledge, with
text mining exploring patterns and text analytics providing interpretative context.

Why is Text Mining Important?

Text mining is widely used in various fields, such as natural language processing, information
retrieval, and social media analysis. It has become an essential tool for organizations to extract
insights from unstructured text data and make data-driven decisions.

Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases. There exist various strategies and devices to mine the text and find
important data for the prediction and decision-making process. The selection of the right and
accurate text mining procedure helps to enhance the speed and the time complexity also.
Text Mining Process

 Gathering unstructured information from various sources accessible in various document

organizations, for example, plain text, web pages, PDF records, etc.

 Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the genuine
text, and it is performed to eliminate stop words stemming (the process of identifying the
root of a certain word and indexing the data.

 Processing and controlling tasks are applied to review and further clean the data set.

 Pattern analysis is implemented in Management Information System.

 Information processed in the above steps is utilized to extract important and applicable
data for a powerful and convenient decision-making process and trend analysis.

Common Methods for Analyzing Text Mining

 Text Summarization: To extract its partial content and reflect its whole content
automatically.

 Text Categorization: To assign a category to the text among categories predefined by

users.

 Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.
Text Mining Techniques

Information Retrieval

In the process of Information retrieval, we try to process the available documents and the text
data into a structured form so, that we can apply different pattern recognition and analytical
processes. It is a process of extracting relevant and associated patterns according to a given set of
words or text documents.

For this, we have processes like Tokenization of the document or the stemming process in
which we try to extract the base word or let’s say the root word present there.

Information Extraction

It is a process of extracting meaningful words from documents.

 Feature Extraction – In this process, we try to develop some new features from existing
ones. This objective can be achieved by parsing an existing feature or combining two or
more features based on some mathematical operation.

 Feature Selection – In this process, we try to reduce the dimensionality of the dataset
which is generally a common issue while dealing with the text data by selecting a subset
of features from the whole dataset.

Text Mining Applications

Digital Library: Various text mining strategies and tools are being used to get the pattern and
trends from journal and proceedings which is stored in text database repositories.

Academic and Research Field: In the education field, different text-mining tools and strategies
are utilized to examine the instructive patterns in a specific region/research field. The main
purpose of text mining utilization in the research field is help to discover and arrange research
papers and relevant material from various fields on one platform.

Life Science: Life science and healthcare industries are producing an enormous volume of
textual and mathematical data regarding patient records, sicknesses, medicines, symptoms, and
treatments of diseases, etc. It is a major issue to filter data and relevant text to make decisions
from a biological data repository.

Social-Media: Text mining is accessible for dissecting and analyzing web-based media
applications to monitor and investigate online content like the plain text from internet news, web
journals, emails, blogs, etc. Text mining devices help to distinguish and investigate the number
of posts, likes, and followers on the web-based media network.

Business Intelligence: Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and competitors to make better
decisions.

Advantages of Text Mining

 Large Amounts of Data: Text mining allows organizations to extract insights from large
amounts of unstructured text data.

 Variety of Applications: Text mining has a wide range of applications, including

sentiment analysis, named entity recognition, and topic modeling.

 Improved Decision Making

 Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for
manual data entry.

Disadvantages of Text Mining

 Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.

 Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.

 High Computational Cost: Text mining requires high computational resources, and it
may be difficult for smaller organizations to afford the technology.
 Limited to Text Data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.

 Noise in text mining results: Text mining of documents may result in mistakes. It’s
possible to find false links or to miss others.

 Lack of transparency: Text mining is frequently viewed as a mysterious process where

large corpora of text documents are input and new information is produced.

UNIT-II

INFERENTIAL STATISTICS

What is Inferential Statistics?

Inferential statistics is a branch of statistics that involves using sample data to make inferences
or draw conclusions about a larger population. It allows researchers to generalize their findings
beyond the specific data they have collected and to make predictions or hypotheses about the
population based on the sample data.

Inferential statistics includes techniques such as hypothesis testing, confidence intervals, and
regression analysis. These techniques help researchers assess the reliability of their findings and
determine whether they are likely to apply to the broader population.

nferential statistics are important because they help us make big conclusions based on small
amounts of data. This is really useful in areas like science, business, economics, and social
sciences where we need to make decisions based on data. It helps us understand things better and
predict outcomes with some confidence.

Inferential vs Descriptive Statistics

Inferential statistics and descriptive statistics are two branches of statistics that serve different
purposes:

1. Descriptive Statistics: Descriptive statistics is concerned with describing and

summarizing the features of a dataset. It involves methods such as calculating measures
of central tendency (mean, median, mode), measures of dispersion (variance, standard
deviation), and visualizing data through graphs and charts (histograms, box plots).
Descriptive statistics are used to understand the basic characteristics of the data, such as
its distribution, variability, and central tendency.

2. Inferential Statistics: Inferential statistics, on the other hand, involves using sample data
to make inferences or draw conclusions about a larger population. It allows researchers to
generalize their findings from the sample to the population and to make predictions or
hypotheses about the population based on the sample data. Inferential statistics includes
techniques such as hypothesis testing, confidence intervals, and regression analysis.
These techniques help researchers assess the reliability of their findings and determine
whether they are likely to apply to the broader population.

Types of Inferential Statistics

1. Hypothesis Testing: Hypothesis testing involves making decisions about a population

parameter based on sample data. It typically involves formulating a null hypothesis (H0)
and an alternative hypothesis (Ha), collecting sample data, and using statistical tests to
determine whether there is enough evidence to reject the null hypothesis in favor of the
alternative hypothesis.

2. Regression Analysis: Regression analysis is used to examine the relationship between

one or more independent variables and a dependent variable. It helps in predicting the
value of the dependent variable based on the values of the independent variables.

3. Confidence Intervals: Confidence intervals provide a range of values within which the
true population parameter is likely to fall with a certain level of confidence. For example,
a 95% confidence interval for the population mean indicates that we are 95% confident
that the true population mean falls within the interval.

Inferential Statistics: Evaluating the Efficacy of New Weight Loss Drugs

Consider a scenario where researchers aim to determine whether a new weight loss drug
outperforms the market’s leading medication. They conduct a study involving 100 overweight
individuals, randomly assigning 50 to receive the new drug and the remaining 50 to the current
medication. After a 12-week period, the average weight loss in each group is recorded.

Here’s a simple example of inferential statistics calculation:

Hypotheses:

 Null Hypothesis (H0): The new weight loss drug is not more effective than the current
leading medication.

 Alternative Hypothesis (H1): The new weight loss drug is more effective than the
current leading medication.

Significance Level:

Let’s set the significance level at α = 0.05, indicating a 5% chance of rejecting the null
hypothesis when it is actually true.

Test Statistic:

We can use the difference in average weight loss between the two groups as our test statistic.
Steps:

1. Collect Data: Measure the weight loss for each individual in both groups after 12 weeks.

2. Calculate Test Statistic: Find the difference in average weight loss between the two
groups.

3. Assumptions Check: Ensure that the conditions for using a t-test or z-test (depending on
sample size and other factors) are met.

4. Determine Critical Value or p-value: Using the appropriate statistical test (e.g., t-test
for smaller samples, z-test for larger samples), find the critical value or p-value
associated with the test statistic.

5. Make Decision: If the p-value is less than the significance level (α), reject the null
hypothesis. Otherwise, fail to reject the null hypothesis.

UNIT-III

HYPOTHESIS TESTING

Hypothesis testing is a fundamental concept in statistics used to make decisions or inferences

about population parameters based on sample data. It involves the formulation of a hypothesis,
collecting data, and using statistical techniques to evaluate the evidence against the hypothesis.
Here's a detailed explanation of hypothesis testing:

1. Formulating Hypotheses:

 Null Hypothesis (H0): This is the default assumption or the statement of no effect. It
represents what you're trying to test against. For example, if you're testing a new drug,
the null hypothesis might be that the drug has no effect.
 Alternative Hypothesis (H1 or Ha): This is the opposite of the null hypothesis. It
represents what you're trying to find evidence for. Using the drug example, the alternative
hypothesis might be that the drug has a positive effect.

2. Choosing a Test Statistic:

 The choice of test statistic depends on the nature of the data and the hypothesis being
tested. Common test statistics include the t-test, z-test, chi-square test, ANOVA, etc.
 For example, if you're comparing means between two groups, you might use a t-test.

3. Setting a Significance Level (α):

 The significance level, denoted by α, is the probability of rejecting the null hypothesis
when it is true. Common values for α are 0.05 or 0.01, indicating a 5% or 1% chance of
making a Type I error (rejecting the null hypothesis when it's actually true).
4. Collecting Data and Calculating the Test Statistic:

 Collect a sample of data that is representative of the population you're interested in.
 Calculate the test statistic based on the sample data and the chosen test.

5. Determining the Critical Region:

 The critical region is the set of values of the test statistic for which the null hypothesis
will be rejected.
 This critical region is determined based on the chosen significance level and the
distribution of the test statistic under the null hypothesis.

6. Making a Decision:

 If the test statistic falls within the critical region, the null hypothesis is rejected in favor
of the alternative hypothesis. This means there is enough evidence to suggest that the
alternative hypothesis is true.
 If the test statistic does not fall within the critical region, the null hypothesis is not
rejected. This means there is not enough evidence to suggest that the alternative
hypothesis is true.

7. Drawing Conclusions:

 Based on the decision made in step 6, conclusions are drawn about the population
parameter being tested.
 If the null hypothesis is rejected, it suggests that there is a significant difference or effect.
If it is not rejected, it suggests that there is no significant difference or effect.

Types of Errors:

 Type I Error: Rejecting the null hypothesis when it is actually true (false positive).
 Type II Error: Failing to reject the null hypothesis when it is actually false (false
negative).

Considerations:

 Sample size: Larger sample sizes generally provide more reliable results.
 Assumptions: Many hypothesis tests rely on certain assumptions about the data, such as
normality or independence. These assumptions should be checked.
 Power analysis: This assesses the probability of correctly rejecting the null hypothesis
when it is false.

Hypothesis testing is a powerful tool for drawing conclusions from data, but it's important to
interpret the results carefully and understand the limitations of the chosen test and the data.
UNIT-IV

SORTING ARRAYS AND VECTORIZED STRING OPERATIONS

Sorting Arrays

Sorting means putting elements in an ordered sequence.

Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.

The NumPy ndarray object has a function called sort(), that will sort a specified array.

import numpy as np

arr = np.array([3, 2, 0, 1])

print(np.sort(arr))

Sort the array alphabetically:

import numpy as np

arr = np.array(['banana', 'cherry', 'apple'])

print(np.sort(arr))

Sorting a 2-D Array

If you use the sort() method on a 2-D array, both arrays will be sorted:

Example

Sort a 2-D array:

import numpy as np

arr = np.array([[3, 2, 4], [5, 0, 1]])

print(np.sort(arr))
Introducing Pandas String Operations

Vectorization of operations simplifies the syntax of operating on arrays of data.

For array of strings, Numpy does not provide simple access, and thus you're stuck using a more
verbose(=long-winded) loop syntax.

# In[1]

data=['peter','Paul','MARY','gUIDO']

[s.capitalize() for s in data]

# Out[1]

['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing
values, so this approach requires putting in extra checks.

# In[2]

data=['peter','Paul',None,'MARY','gUIDO']

[s if s is None else s.capitalize() for s in data]

# Out[2]

['Peter', 'Paul', None, 'Mary', 'Guido']

Pandas includes features to address both this need for vectorized string operations as well as the
need for correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings.

# In[3]

names=pd.Series(data)

names.str.capitalize()

# Out[3]

0 Peter

1 Paul
2 None

3 Mary

4 Guido

dtype: object

For example:

import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
array([ 4, 6, 10, 14, 22, 26])

This vectorization of operations simplifies the syntax of operating on arrays of data: we no

longer have to worry about the size or shape of the array, but just about what operation we want
done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck
using a more verbose loop syntax:

data = ['peter', 'Paul', 'MARY', 'gUIDO']

[s.capitalize() for s in data]
['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing
values. For example:

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

[s.capitalize() for s in data]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings. So, for example, suppose we create a Pandas Series with this data:

import pandas as pd
names = pd.Series(data)
names
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any
missing values:

names.str.capitalize()
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object

Using tab completion on this str attribute will list all the vectorized string methods available to
Pandas.

UNIT-V

K-MEANS CLUSTERING

Introducing k-Means

The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:

• The “cluster center” is the arithmetic mean of all the points belonging to the cluster.

• Each point is closer to its own cluster center than to other cluster centers.

Those two assumptions are the basis of the k-means model. We will soon dive into exactly how
the algorithm reaches this solution, but for now let’s take a look at a simple dataset and see the k-
means result.

First, let’s generate a two-dimensional dataset containing four distinct blobs. To emphasize that
this is an unsupervised algorithm, we will leave the labels out of the visualization

In[2]: from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4,

cluster_std=0.60, random_state=0)

plt.scatter(X[:, 0], X[:, 1], s=50);

By eye, it is relatively easy to pick out the four clusters. The k-means algorithm does

this automatically, and in Scikit-Learn uses the typical estimator API:

In[3]: from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(X)

y_kmeans = kmeans.predict(X)

Let’s visualize the results by plotting the data colored by these labels. We will also plot the
cluster centers as determined by the k-means estimator

In[4]: plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200,

alpha=0.5);
The good news is that the k-means algorithm (at least in this simple case) assigns the
points to clusters very similarly to how we might assign them by eye. But you might
wonder how this algorithm finds these clusters so quickly!

After all, the number of possible combinations of cluster assignments is exponential in

the number of data points—an exhaustive search would be very, very costly. Fortunately
for us, such an exhaustive search is not necessary; instead, the typical approach to k-
means involves an intuitive iterative approach known as expectation–maximization.

k-Means Algorithm: Expectation–Maximization

Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of

contexts within data science. k-means is a particularly simple and easy-to- understand
application of the algorithm, and we will walk through it briefly here. In short, the
expectation–maximization approach consists of the following procedure:

1. Guess some cluster centers

2. Repeat until converged

a. E-Step: assign points to the nearest cluster center

b. M-Step: set the cluster centers to the mean

Here the “E-step” or “Expectation step” is so named because it involves updating our
expectation of which cluster each point belongs to. The “M-step” or “Maximization step”
is so named because it involves maximizing some fitness function that defines the
location of the cluster centers—in this case, that maximization is accomplished by taking
a simple mean of the data in each cluster.

The literature about this algorithm is vast, but can be summarized as follows: under
typical circumstances, each repetition of the E-step and M-step will always result in a
better estimate of the cluster characteristics.

Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
Solutions To Problem Set 1
No ratings yet
Solutions To Problem Set 1
6 pages
Business Statistics MCQs
100% (2)
Business Statistics MCQs
24 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
Thesis Chapterwise
No ratings yet
Thesis Chapterwise
52 pages
Module 4
No ratings yet
Module 4
63 pages
Unit 5 - Testing of Hypothesis - SLM
No ratings yet
Unit 5 - Testing of Hypothesis - SLM
46 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
100% (1)
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
506 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Lec1 PDF
No ratings yet
Lec1 PDF
20 pages
Time Series Characteristic
No ratings yet
Time Series Characteristic
72 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
6 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Text Mining: Techniques and Its Application: December 2014
100% (1)
Text Mining: Techniques and Its Application: December 2014
5 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Text Mining: Concepts, Process and Applications: January 2013
No ratings yet
Text Mining: Concepts, Process and Applications: January 2013
5 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Final A Study On Consumer Preference On Cafe Coffee Da1
100% (3)
Final A Study On Consumer Preference On Cafe Coffee Da1
29 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
A Detailed Study On Text Mining Techniques
No ratings yet
A Detailed Study On Text Mining Techniques
4 pages
Search Engines - Text Mining in Action
No ratings yet
Search Engines - Text Mining in Action
18 pages
Method Section-Seminar Paper
No ratings yet
Method Section-Seminar Paper
6 pages
Comparative Analysis of Text Mining Techniques For
No ratings yet
Comparative Analysis of Text Mining Techniques For
12 pages
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
Astma Lab Manual
No ratings yet
Astma Lab Manual
17 pages
Text Mining in Big Data Analytics
No ratings yet
Text Mining in Big Data Analytics
34 pages
Assignment Rubel - Data Mining
No ratings yet
Assignment Rubel - Data Mining
12 pages
Text Mining
No ratings yet
Text Mining
3 pages
Isba 1 Finals Reviewer
No ratings yet
Isba 1 Finals Reviewer
3 pages
Text Analytics
No ratings yet
Text Analytics
9 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Further Statistics 1 Unit Test 7 Central Limit Theorem
No ratings yet
Further Statistics 1 Unit Test 7 Central Limit Theorem
3 pages
Diborinaye 2
No ratings yet
Diborinaye 2
7 pages
Stella Maris Test 1
No ratings yet
Stella Maris Test 1
3 pages
Text Mining
No ratings yet
Text Mining
16 pages
Text Mining
No ratings yet
Text Mining
12 pages
Multicollinearity Autocorrelation
No ratings yet
Multicollinearity Autocorrelation
28 pages
Unit 5 Descriptive Statistics Measures of Central Tendency
No ratings yet
Unit 5 Descriptive Statistics Measures of Central Tendency
6 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
100 Multiple-Choice Questions (MCQS) For Biostatistics - Clinical Corner
No ratings yet
100 Multiple-Choice Questions (MCQS) For Biostatistics - Clinical Corner
15 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Module 3 - Regression
No ratings yet
Module 3 - Regression
55 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
Dafm Cia 2 - 2227610
No ratings yet
Dafm Cia 2 - 2227610
16 pages
Correlation and Regression
No ratings yet
Correlation and Regression
12 pages
Datamining 1
No ratings yet
Datamining 1
11 pages
BS Course Outline
No ratings yet
BS Course Outline
20 pages
IMTC634 - Data Science - Chapter 7
No ratings yet
IMTC634 - Data Science - Chapter 7
24 pages
Mendenhall Ch06-+modified
No ratings yet
Mendenhall Ch06-+modified
28 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
5 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
6 pages
Chapter 03 - Sharda 11e Full Accessible PPT 07
No ratings yet
Chapter 03 - Sharda 11e Full Accessible PPT 07
29 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Project - Ipynb - Colaboratory
No ratings yet
Project - Ipynb - Colaboratory
21 pages
Accountability and Fraud Type Effects On Fraud Detection Responsibility
No ratings yet
Accountability and Fraud Type Effects On Fraud Detection Responsibility
13 pages
IT2110 - R Marks - 2023 October - Malabe
No ratings yet
IT2110 - R Marks - 2023 October - Malabe
11 pages
What's Trending in Difference-In-Differences
No ratings yet
What's Trending in Difference-In-Differences
27 pages
Metadta: A Stata Command For Meta-Analysis and Meta-Regression of Diagnostic Test Accuracy Data - A Tutorial
No ratings yet
Metadta: A Stata Command For Meta-Analysis and Meta-Regression of Diagnostic Test Accuracy Data - A Tutorial
15 pages
ECON 310 Stata Assignment
No ratings yet
ECON 310 Stata Assignment
8 pages
Unit II - RPLA QB - (2024) Students
No ratings yet
Unit II - RPLA QB - (2024) Students
10 pages
10 Simple Linear Regression
No ratings yet
10 Simple Linear Regression
13 pages
5 ASAP Business Analytics-BasicStatistics - Exploratory Data Analysis
No ratings yet
5 ASAP Business Analytics-BasicStatistics - Exploratory Data Analysis
24 pages
10 1109@icaccs 2019 8728547
No ratings yet
10 1109@icaccs 2019 8728547
5 pages
DMTerm Paper
No ratings yet
DMTerm Paper
4 pages
1 2 3 4 5 Merged
No ratings yet
1 2 3 4 5 Merged
23 pages
ProNEVA User Manual
No ratings yet
ProNEVA User Manual
15 pages
Stat - Quiz#1
No ratings yet
Stat - Quiz#1
4 pages
Business Intelligence and Anlytics UNIT 2
No ratings yet
Business Intelligence and Anlytics UNIT 2
35 pages
Statistics in Medicine - 2024 - Zhang - Weighted Expectile Regression Neural Networks For Right Censored Data
No ratings yet
Statistics in Medicine - 2024 - Zhang - Weighted Expectile Regression Neural Networks For Right Censored Data
15 pages
Slides 8 Iu
No ratings yet
Slides 8 Iu
42 pages
AMV in Pharma
No ratings yet
AMV in Pharma
13 pages
Text Mining
No ratings yet
Text Mining
18 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages

FDS-Content Beyond Syllabus

Uploaded by

FDS-Content Beyond Syllabus

Uploaded by

FOUNDATIONS OF DATA SCIENCE

CONTENT BEYOND SYLLABUS

Text mining and text analytics

What is Text Mining

Text Mining in Data Mining

Text Mining vs. Text Analytics

Why is Text Mining Important?

 Gathering unstructured information from various sources accessible in various document

 Pattern analysis is implemented in Management Information System.

Common Methods for Analyzing Text Mining

 Text Categorization: To assign a category to the text among categories predefined by

It is a process of extracting meaningful words from documents.

Text Mining Applications

Advantages of Text Mining

 Variety of Applications: Text mining has a wide range of applications, including

 Improved Decision Making

Disadvantages of Text Mining

 Lack of transparency: Text mining is frequently viewed as a mysterious process where

What is Inferential Statistics?

Inferential vs Descriptive Statistics

1. Descriptive Statistics: Descriptive statistics is concerned with describing and

Types of Inferential Statistics

1. Hypothesis Testing: Hypothesis testing involves making decisions about a population

2. Regression Analysis: Regression analysis is used to examine the relationship between

Inferential Statistics: Evaluating the Efficacy of New Weight Loss Drugs

Here’s a simple example of inferential statistics calculation:

Hypothesis testing is a fundamental concept in statistics used to make decisions or inferences

2. Choosing a Test Statistic:

3. Setting a Significance Level (α):

5. Determining the Critical Region:

SORTING ARRAYS AND VECTORIZED STRING OPERATIONS

Sorting means putting elements in an ordered sequence.

arr = np.array([3, 2, 0, 1])

Sort the array alphabetically:

arr = np.array(['banana', 'cherry', 'apple'])

Sorting a 2-D Array

Sort a 2-D array:

arr = np.array([[3, 2, 4], [5, 0, 1]])

Vectorization of operations simplifies the syntax of operating on arrays of data.

[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

[s if s is None else s.capitalize() for s in data]

['Peter', 'Paul', None, 'Mary', 'Guido']

This vectorization of operations simplifies the syntax of operating on arrays of data: we no

data = ['peter', 'Paul', 'MARY', 'gUIDO']

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

AttributeError: 'NoneType' object has no attribute 'capitalize'

In[2]: from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4,

plt.scatter(X[:, 0], X[:, 1], s=50);

this automatically, and in Scikit-Learn uses the typical estimator API:

In[3]: from sklearn.cluster import KMeans

In[4]: plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200,

After all, the number of possible combinations of cluster assignments is exponential in

k-Means Algorithm: Expectation–Maximization

Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of

1. Guess some cluster centers

2. Repeat until converged

a. E-Step: assign points to the nearest cluster center

b. M-Step: set the cluster centers to the mean

You might also like