FDS-Content Beyond Syllabus
FDS-Content Beyond Syllabus
UNIT-1
Text mining is a component of data mining that deals specifically with unstructured text data. It
involves the use of natural language processing (NLP) techniques to extract useful information
and insights from large amounts of unstructured text data. Text mining can be used as a
preprocessing step for data mining or as a standalone process for specific tasks.
Text mining in data mining is mostly used for, the unstructured text data that can be
transformed into structured data that can be used for data mining tasks such as classification,
clustering, and association rule mining. This allows organizations to gain insights from a wide
range of data sources, such as customer feedback, social media posts, and news articles.
Text mining and text analytics are related but distinct processes for extracting insights from
textual data. Text mining involves the application of natural language processing and machine
learning techniques to discover patterns, trends, and knowledge from large volumes of
unstructured text.
However, Text Analytics focuses on extracting meaningful information, sentiments, and context
from text, often using statistical and linguistic methods. While text mining emphasizes
uncovering hidden patterns, text analytics emphasizes deriving actionable insights for decision-
making. Both play crucial roles in transforming unstructured text into valuable knowledge, with
text mining exploring patterns and text analytics providing interpretative context.
Text mining is widely used in various fields, such as natural language processing, information
retrieval, and social media analysis. It has become an essential tool for organizations to extract
insights from unstructured text data and make data-driven decisions.
Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases. There exist various strategies and devices to mine the text and find
important data for the prediction and decision-making process. The selection of the right and
accurate text mining procedure helps to enhance the speed and the time complexity also.
Text Mining Process
Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the genuine
text, and it is performed to eliminate stop words stemming (the process of identifying the
root of a certain word and indexing the data.
Processing and controlling tasks are applied to review and further clean the data set.
Information processed in the above steps is utilized to extract important and applicable
data for a powerful and convenient decision-making process and trend analysis.
Text Summarization: To extract its partial content and reflect its whole content
automatically.
Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.
Text Mining Techniques
Information Retrieval
In the process of Information retrieval, we try to process the available documents and the text
data into a structured form so, that we can apply different pattern recognition and analytical
processes. It is a process of extracting relevant and associated patterns according to a given set of
words or text documents.
For this, we have processes like Tokenization of the document or the stemming process in
which we try to extract the base word or let’s say the root word present there.
Information Extraction
Feature Extraction – In this process, we try to develop some new features from existing
ones. This objective can be achieved by parsing an existing feature or combining two or
more features based on some mathematical operation.
Feature Selection – In this process, we try to reduce the dimensionality of the dataset
which is generally a common issue while dealing with the text data by selecting a subset
of features from the whole dataset.
Academic and Research Field: In the education field, different text-mining tools and strategies
are utilized to examine the instructive patterns in a specific region/research field. The main
purpose of text mining utilization in the research field is help to discover and arrange research
papers and relevant material from various fields on one platform.
Life Science: Life science and healthcare industries are producing an enormous volume of
textual and mathematical data regarding patient records, sicknesses, medicines, symptoms, and
treatments of diseases, etc. It is a major issue to filter data and relevant text to make decisions
from a biological data repository.
Social-Media: Text mining is accessible for dissecting and analyzing web-based media
applications to monitor and investigate online content like the plain text from internet news, web
journals, emails, blogs, etc. Text mining devices help to distinguish and investigate the number
of posts, likes, and followers on the web-based media network.
Business Intelligence: Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and competitors to make better
decisions.
Large Amounts of Data: Text mining allows organizations to extract insights from large
amounts of unstructured text data.
Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for
manual data entry.
Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.
Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.
High Computational Cost: Text mining requires high computational resources, and it
may be difficult for smaller organizations to afford the technology.
Limited to Text Data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.
Noise in text mining results: Text mining of documents may result in mistakes. It’s
possible to find false links or to miss others.
UNIT-II
INFERENTIAL STATISTICS
Inferential statistics is a branch of statistics that involves using sample data to make inferences
or draw conclusions about a larger population. It allows researchers to generalize their findings
beyond the specific data they have collected and to make predictions or hypotheses about the
population based on the sample data.
Inferential statistics includes techniques such as hypothesis testing, confidence intervals, and
regression analysis. These techniques help researchers assess the reliability of their findings and
determine whether they are likely to apply to the broader population.
nferential statistics are important because they help us make big conclusions based on small
amounts of data. This is really useful in areas like science, business, economics, and social
sciences where we need to make decisions based on data. It helps us understand things better and
predict outcomes with some confidence.
Inferential statistics and descriptive statistics are two branches of statistics that serve different
purposes:
2. Inferential Statistics: Inferential statistics, on the other hand, involves using sample data
to make inferences or draw conclusions about a larger population. It allows researchers to
generalize their findings from the sample to the population and to make predictions or
hypotheses about the population based on the sample data. Inferential statistics includes
techniques such as hypothesis testing, confidence intervals, and regression analysis.
These techniques help researchers assess the reliability of their findings and determine
whether they are likely to apply to the broader population.
3. Confidence Intervals: Confidence intervals provide a range of values within which the
true population parameter is likely to fall with a certain level of confidence. For example,
a 95% confidence interval for the population mean indicates that we are 95% confident
that the true population mean falls within the interval.
Consider a scenario where researchers aim to determine whether a new weight loss drug
outperforms the market’s leading medication. They conduct a study involving 100 overweight
individuals, randomly assigning 50 to receive the new drug and the remaining 50 to the current
medication. After a 12-week period, the average weight loss in each group is recorded.
Hypotheses:
Null Hypothesis (H0): The new weight loss drug is not more effective than the current
leading medication.
Alternative Hypothesis (H1): The new weight loss drug is more effective than the
current leading medication.
Significance Level:
Let’s set the significance level at α = 0.05, indicating a 5% chance of rejecting the null
hypothesis when it is actually true.
Test Statistic:
We can use the difference in average weight loss between the two groups as our test statistic.
Steps:
1. Collect Data: Measure the weight loss for each individual in both groups after 12 weeks.
2. Calculate Test Statistic: Find the difference in average weight loss between the two
groups.
3. Assumptions Check: Ensure that the conditions for using a t-test or z-test (depending on
sample size and other factors) are met.
4. Determine Critical Value or p-value: Using the appropriate statistical test (e.g., t-test
for smaller samples, z-test for larger samples), find the critical value or p-value
associated with the test statistic.
5. Make Decision: If the p-value is less than the significance level (α), reject the null
hypothesis. Otherwise, fail to reject the null hypothesis.
UNIT-III
HYPOTHESIS TESTING
1. Formulating Hypotheses:
Null Hypothesis (H0): This is the default assumption or the statement of no effect. It
represents what you're trying to test against. For example, if you're testing a new drug,
the null hypothesis might be that the drug has no effect.
Alternative Hypothesis (H1 or Ha): This is the opposite of the null hypothesis. It
represents what you're trying to find evidence for. Using the drug example, the alternative
hypothesis might be that the drug has a positive effect.
The choice of test statistic depends on the nature of the data and the hypothesis being
tested. Common test statistics include the t-test, z-test, chi-square test, ANOVA, etc.
For example, if you're comparing means between two groups, you might use a t-test.
The significance level, denoted by α, is the probability of rejecting the null hypothesis
when it is true. Common values for α are 0.05 or 0.01, indicating a 5% or 1% chance of
making a Type I error (rejecting the null hypothesis when it's actually true).
4. Collecting Data and Calculating the Test Statistic:
Collect a sample of data that is representative of the population you're interested in.
Calculate the test statistic based on the sample data and the chosen test.
The critical region is the set of values of the test statistic for which the null hypothesis
will be rejected.
This critical region is determined based on the chosen significance level and the
distribution of the test statistic under the null hypothesis.
6. Making a Decision:
If the test statistic falls within the critical region, the null hypothesis is rejected in favor
of the alternative hypothesis. This means there is enough evidence to suggest that the
alternative hypothesis is true.
If the test statistic does not fall within the critical region, the null hypothesis is not
rejected. This means there is not enough evidence to suggest that the alternative
hypothesis is true.
7. Drawing Conclusions:
Based on the decision made in step 6, conclusions are drawn about the population
parameter being tested.
If the null hypothesis is rejected, it suggests that there is a significant difference or effect.
If it is not rejected, it suggests that there is no significant difference or effect.
Types of Errors:
Type I Error: Rejecting the null hypothesis when it is actually true (false positive).
Type II Error: Failing to reject the null hypothesis when it is actually false (false
negative).
Considerations:
Sample size: Larger sample sizes generally provide more reliable results.
Assumptions: Many hypothesis tests rely on certain assumptions about the data, such as
normality or independence. These assumptions should be checked.
Power analysis: This assesses the probability of correctly rejecting the null hypothesis
when it is false.
Hypothesis testing is a powerful tool for drawing conclusions from data, but it's important to
interpret the results carefully and understand the limitations of the chosen test and the data.
UNIT-IV
Sorting Arrays
Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.
The NumPy ndarray object has a function called sort(), that will sort a specified array.
import numpy as np
print(np.sort(arr))
import numpy as np
print(np.sort(arr))
If you use the sort() method on a 2-D array, both arrays will be sorted:
Example
import numpy as np
print(np.sort(arr))
Introducing Pandas String Operations
For array of strings, Numpy does not provide simple access, and thus you're stuck using a more
verbose(=long-winded) loop syntax.
# In[1]
data=['peter','Paul','MARY','gUIDO']
# Out[1]
This is perhaps sufficient to work with some data, but it will break if there are any missing
values, so this approach requires putting in extra checks.
# In[2]
data=['peter','Paul',None,'MARY','gUIDO']
# Out[2]
Pandas includes features to address both this need for vectorized string operations as well as the
need for correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings.
# In[3]
names=pd.Series(data)
names.str.capitalize()
# Out[3]
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
For example:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
array([ 4, 6, 10, 14, 22, 26])
This is perhaps sufficient to work with some data, but it will break if there are any missing
values. For example:
<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
Pandas includes features to address both this need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings. So, for example, suppose we create a Pandas Series with this data:
import pandas as pd
names = pd.Series(data)
names
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object
We can now call a single method that will capitalize all the entries, while skipping over any
missing values:
names.str.capitalize()
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
Using tab completion on this str attribute will list all the vectorized string methods available to
Pandas.
UNIT-V
K-MEANS CLUSTERING
Introducing k-Means
The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:
• The “cluster center” is the arithmetic mean of all the points belonging to the cluster.
• Each point is closer to its own cluster center than to other cluster centers.
Those two assumptions are the basis of the k-means model. We will soon dive into exactly how
the algorithm reaches this solution, but for now let’s take a look at a simple dataset and see the k-
means result.
First, let’s generate a two-dimensional dataset containing four distinct blobs. To emphasize that
this is an unsupervised algorithm, we will leave the labels out of the visualization
cluster_std=0.60, random_state=0)
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
Let’s visualize the results by plotting the data colored by these labels. We will also plot the
cluster centers as determined by the k-means estimator
Here the “E-step” or “Expectation step” is so named because it involves updating our
expectation of which cluster each point belongs to. The “M-step” or “Maximization step”
is so named because it involves maximizing some fitness function that defines the
location of the cluster centers—in this case, that maximization is accomplished by taking
a simple mean of the data in each cluster.
The literature about this algorithm is vast, but can be summarized as follows: under
typical circumstances, each repetition of the E-step and M-step will always result in a
better estimate of the cluster characteristics.