0% found this document useful (0 votes)
214 views15 pages

FDS-Content Beyond Syllabus

Content Beyond Syllabus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
214 views15 pages

FDS-Content Beyond Syllabus

Content Beyond Syllabus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

FOUNDATIONS OF DATA SCIENCE

CONTENT BEYOND SYLLABUS

UNIT-1

Text mining and text analytics

What is Text Mining

Text mining is a component of data mining that deals specifically with unstructured text data. It
involves the use of natural language processing (NLP) techniques to extract useful information
and insights from large amounts of unstructured text data. Text mining can be used as a
preprocessing step for data mining or as a standalone process for specific tasks.

Text Mining in Data Mining

Text mining in data mining is mostly used for, the unstructured text data that can be
transformed into structured data that can be used for data mining tasks such as classification,
clustering, and association rule mining. This allows organizations to gain insights from a wide
range of data sources, such as customer feedback, social media posts, and news articles.

Text Mining vs. Text Analytics

Text mining and text analytics are related but distinct processes for extracting insights from
textual data. Text mining involves the application of natural language processing and machine
learning techniques to discover patterns, trends, and knowledge from large volumes of
unstructured text.

However, Text Analytics focuses on extracting meaningful information, sentiments, and context
from text, often using statistical and linguistic methods. While text mining emphasizes
uncovering hidden patterns, text analytics emphasizes deriving actionable insights for decision-
making. Both play crucial roles in transforming unstructured text into valuable knowledge, with
text mining exploring patterns and text analytics providing interpretative context.

Why is Text Mining Important?

Text mining is widely used in various fields, such as natural language processing, information
retrieval, and social media analysis. It has become an essential tool for organizations to extract
insights from unstructured text data and make data-driven decisions.

Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases. There exist various strategies and devices to mine the text and find
important data for the prediction and decision-making process. The selection of the right and
accurate text mining procedure helps to enhance the speed and the time complexity also.
Text Mining Process

 Gathering unstructured information from various sources accessible in various document


organizations, for example, plain text, web pages, PDF records, etc.

 Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the genuine
text, and it is performed to eliminate stop words stemming (the process of identifying the
root of a certain word and indexing the data.

 Processing and controlling tasks are applied to review and further clean the data set.

 Pattern analysis is implemented in Management Information System.

 Information processed in the above steps is utilized to extract important and applicable
data for a powerful and convenient decision-making process and trend analysis.

Common Methods for Analyzing Text Mining

 Text Summarization: To extract its partial content and reflect its whole content
automatically.

 Text Categorization: To assign a category to the text among categories predefined by


users.

 Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.
Text Mining Techniques

Information Retrieval

In the process of Information retrieval, we try to process the available documents and the text
data into a structured form so, that we can apply different pattern recognition and analytical
processes. It is a process of extracting relevant and associated patterns according to a given set of
words or text documents.

For this, we have processes like Tokenization of the document or the stemming process in
which we try to extract the base word or let’s say the root word present there.

Information Extraction

It is a process of extracting meaningful words from documents.

 Feature Extraction – In this process, we try to develop some new features from existing
ones. This objective can be achieved by parsing an existing feature or combining two or
more features based on some mathematical operation.

 Feature Selection – In this process, we try to reduce the dimensionality of the dataset
which is generally a common issue while dealing with the text data by selecting a subset
of features from the whole dataset.

Text Mining Applications


Digital Library: Various text mining strategies and tools are being used to get the pattern and
trends from journal and proceedings which is stored in text database repositories.

Academic and Research Field: In the education field, different text-mining tools and strategies
are utilized to examine the instructive patterns in a specific region/research field. The main
purpose of text mining utilization in the research field is help to discover and arrange research
papers and relevant material from various fields on one platform.

Life Science: Life science and healthcare industries are producing an enormous volume of
textual and mathematical data regarding patient records, sicknesses, medicines, symptoms, and
treatments of diseases, etc. It is a major issue to filter data and relevant text to make decisions
from a biological data repository.

Social-Media: Text mining is accessible for dissecting and analyzing web-based media
applications to monitor and investigate online content like the plain text from internet news, web
journals, emails, blogs, etc. Text mining devices help to distinguish and investigate the number
of posts, likes, and followers on the web-based media network.

Business Intelligence: Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and competitors to make better
decisions.

Advantages of Text Mining

 Large Amounts of Data: Text mining allows organizations to extract insights from large
amounts of unstructured text data.

 Variety of Applications: Text mining has a wide range of applications, including


sentiment analysis, named entity recognition, and topic modeling.

 Improved Decision Making

 Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for
manual data entry.

Disadvantages of Text Mining

 Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.

 Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.

 High Computational Cost: Text mining requires high computational resources, and it
may be difficult for smaller organizations to afford the technology.
 Limited to Text Data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.

 Noise in text mining results: Text mining of documents may result in mistakes. It’s
possible to find false links or to miss others.

 Lack of transparency: Text mining is frequently viewed as a mysterious process where


large corpora of text documents are input and new information is produced.

UNIT-II

INFERENTIAL STATISTICS

What is Inferential Statistics?

Inferential statistics is a branch of statistics that involves using sample data to make inferences
or draw conclusions about a larger population. It allows researchers to generalize their findings
beyond the specific data they have collected and to make predictions or hypotheses about the
population based on the sample data.

Inferential statistics includes techniques such as hypothesis testing, confidence intervals, and
regression analysis. These techniques help researchers assess the reliability of their findings and
determine whether they are likely to apply to the broader population.

nferential statistics are important because they help us make big conclusions based on small
amounts of data. This is really useful in areas like science, business, economics, and social
sciences where we need to make decisions based on data. It helps us understand things better and
predict outcomes with some confidence.

Inferential vs Descriptive Statistics

Inferential statistics and descriptive statistics are two branches of statistics that serve different
purposes:

1. Descriptive Statistics: Descriptive statistics is concerned with describing and


summarizing the features of a dataset. It involves methods such as calculating measures
of central tendency (mean, median, mode), measures of dispersion (variance, standard
deviation), and visualizing data through graphs and charts (histograms, box plots).
Descriptive statistics are used to understand the basic characteristics of the data, such as
its distribution, variability, and central tendency.

2. Inferential Statistics: Inferential statistics, on the other hand, involves using sample data
to make inferences or draw conclusions about a larger population. It allows researchers to
generalize their findings from the sample to the population and to make predictions or
hypotheses about the population based on the sample data. Inferential statistics includes
techniques such as hypothesis testing, confidence intervals, and regression analysis.
These techniques help researchers assess the reliability of their findings and determine
whether they are likely to apply to the broader population.

Types of Inferential Statistics

1. Hypothesis Testing: Hypothesis testing involves making decisions about a population


parameter based on sample data. It typically involves formulating a null hypothesis (H0)
and an alternative hypothesis (Ha), collecting sample data, and using statistical tests to
determine whether there is enough evidence to reject the null hypothesis in favor of the
alternative hypothesis.

2. Regression Analysis: Regression analysis is used to examine the relationship between


one or more independent variables and a dependent variable. It helps in predicting the
value of the dependent variable based on the values of the independent variables.

3. Confidence Intervals: Confidence intervals provide a range of values within which the
true population parameter is likely to fall with a certain level of confidence. For example,
a 95% confidence interval for the population mean indicates that we are 95% confident
that the true population mean falls within the interval.

Inferential Statistics: Evaluating the Efficacy of New Weight Loss Drugs

Consider a scenario where researchers aim to determine whether a new weight loss drug
outperforms the market’s leading medication. They conduct a study involving 100 overweight
individuals, randomly assigning 50 to receive the new drug and the remaining 50 to the current
medication. After a 12-week period, the average weight loss in each group is recorded.

Here’s a simple example of inferential statistics calculation:

Hypotheses:

 Null Hypothesis (H0): The new weight loss drug is not more effective than the current
leading medication.

 Alternative Hypothesis (H1): The new weight loss drug is more effective than the
current leading medication.

Significance Level:

Let’s set the significance level at α = 0.05, indicating a 5% chance of rejecting the null
hypothesis when it is actually true.

Test Statistic:

We can use the difference in average weight loss between the two groups as our test statistic.
Steps:

1. Collect Data: Measure the weight loss for each individual in both groups after 12 weeks.

2. Calculate Test Statistic: Find the difference in average weight loss between the two
groups.

3. Assumptions Check: Ensure that the conditions for using a t-test or z-test (depending on
sample size and other factors) are met.

4. Determine Critical Value or p-value: Using the appropriate statistical test (e.g., t-test
for smaller samples, z-test for larger samples), find the critical value or p-value
associated with the test statistic.

5. Make Decision: If the p-value is less than the significance level (α), reject the null
hypothesis. Otherwise, fail to reject the null hypothesis.

UNIT-III

HYPOTHESIS TESTING

Hypothesis testing is a fundamental concept in statistics used to make decisions or inferences


about population parameters based on sample data. It involves the formulation of a hypothesis,
collecting data, and using statistical techniques to evaluate the evidence against the hypothesis.
Here's a detailed explanation of hypothesis testing:

1. Formulating Hypotheses:

 Null Hypothesis (H0): This is the default assumption or the statement of no effect. It
represents what you're trying to test against. For example, if you're testing a new drug,
the null hypothesis might be that the drug has no effect.
 Alternative Hypothesis (H1 or Ha): This is the opposite of the null hypothesis. It
represents what you're trying to find evidence for. Using the drug example, the alternative
hypothesis might be that the drug has a positive effect.

2. Choosing a Test Statistic:

 The choice of test statistic depends on the nature of the data and the hypothesis being
tested. Common test statistics include the t-test, z-test, chi-square test, ANOVA, etc.
 For example, if you're comparing means between two groups, you might use a t-test.

3. Setting a Significance Level (α):

 The significance level, denoted by α, is the probability of rejecting the null hypothesis
when it is true. Common values for α are 0.05 or 0.01, indicating a 5% or 1% chance of
making a Type I error (rejecting the null hypothesis when it's actually true).
4. Collecting Data and Calculating the Test Statistic:

 Collect a sample of data that is representative of the population you're interested in.
 Calculate the test statistic based on the sample data and the chosen test.

5. Determining the Critical Region:

 The critical region is the set of values of the test statistic for which the null hypothesis
will be rejected.
 This critical region is determined based on the chosen significance level and the
distribution of the test statistic under the null hypothesis.

6. Making a Decision:

 If the test statistic falls within the critical region, the null hypothesis is rejected in favor
of the alternative hypothesis. This means there is enough evidence to suggest that the
alternative hypothesis is true.
 If the test statistic does not fall within the critical region, the null hypothesis is not
rejected. This means there is not enough evidence to suggest that the alternative
hypothesis is true.

7. Drawing Conclusions:

 Based on the decision made in step 6, conclusions are drawn about the population
parameter being tested.
 If the null hypothesis is rejected, it suggests that there is a significant difference or effect.
If it is not rejected, it suggests that there is no significant difference or effect.

Types of Errors:

 Type I Error: Rejecting the null hypothesis when it is actually true (false positive).
 Type II Error: Failing to reject the null hypothesis when it is actually false (false
negative).

Considerations:

 Sample size: Larger sample sizes generally provide more reliable results.
 Assumptions: Many hypothesis tests rely on certain assumptions about the data, such as
normality or independence. These assumptions should be checked.
 Power analysis: This assesses the probability of correctly rejecting the null hypothesis
when it is false.

Hypothesis testing is a powerful tool for drawing conclusions from data, but it's important to
interpret the results carefully and understand the limitations of the chosen test and the data.
UNIT-IV

SORTING ARRAYS AND VECTORIZED STRING OPERATIONS

Sorting Arrays

Sorting means putting elements in an ordered sequence.

Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.

The NumPy ndarray object has a function called sort(), that will sort a specified array.

import numpy as np

arr = np.array([3, 2, 0, 1])

print(np.sort(arr))

Sort the array alphabetically:

import numpy as np

arr = np.array(['banana', 'cherry', 'apple'])

print(np.sort(arr))

Sorting a 2-D Array

If you use the sort() method on a 2-D array, both arrays will be sorted:

Example

Sort a 2-D array:

import numpy as np

arr = np.array([[3, 2, 4], [5, 0, 1]])

print(np.sort(arr))
Introducing Pandas String Operations

Vectorization of operations simplifies the syntax of operating on arrays of data.

For array of strings, Numpy does not provide simple access, and thus you're stuck using a more
verbose(=long-winded) loop syntax.

# In[1]

data=['peter','Paul','MARY','gUIDO']

[s.capitalize() for s in data]

# Out[1]

['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing
values, so this approach requires putting in extra checks.

# In[2]

data=['peter','Paul',None,'MARY','gUIDO']

[s if s is None else s.capitalize() for s in data]

# Out[2]

['Peter', 'Paul', None, 'Mary', 'Guido']

Pandas includes features to address both this need for vectorized string operations as well as the
need for correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings.

# In[3]

names=pd.Series(data)

names.str.capitalize()

# Out[3]

0 Peter

1 Paul
2 None

3 Mary

4 Guido

dtype: object

For example:

import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
array([ 4, 6, 10, 14, 22, 26])

This vectorization of operations simplifies the syntax of operating on arrays of data: we no


longer have to worry about the size or shape of the array, but just about what operation we want
done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck
using a more verbose loop syntax:

data = ['peter', 'Paul', 'MARY', 'gUIDO']


[s.capitalize() for s in data]
['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing
values. For example:

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']


[s.capitalize() for s in data]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings. So, for example, suppose we create a Pandas Series with this data:

import pandas as pd
names = pd.Series(data)
names
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any
missing values:

names.str.capitalize()
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object

Using tab completion on this str attribute will list all the vectorized string methods available to
Pandas.

UNIT-V

K-MEANS CLUSTERING

Introducing k-Means

The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:

• The “cluster center” is the arithmetic mean of all the points belonging to the cluster.

• Each point is closer to its own cluster center than to other cluster centers.

Those two assumptions are the basis of the k-means model. We will soon dive into exactly how
the algorithm reaches this solution, but for now let’s take a look at a simple dataset and see the k-
means result.

First, let’s generate a two-dimensional dataset containing four distinct blobs. To emphasize that
this is an unsupervised algorithm, we will leave the labels out of the visualization

In[2]: from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4,

cluster_std=0.60, random_state=0)

plt.scatter(X[:, 0], X[:, 1], s=50);


By eye, it is relatively easy to pick out the four clusters. The k-means algorithm does

this automatically, and in Scikit-Learn uses the typical estimator API:

In[3]: from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(X)

y_kmeans = kmeans.predict(X)

Let’s visualize the results by plotting the data colored by these labels. We will also plot the
cluster centers as determined by the k-means estimator

In[4]: plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200,


alpha=0.5);
The good news is that the k-means algorithm (at least in this simple case) assigns the
points to clusters very similarly to how we might assign them by eye. But you might
wonder how this algorithm finds these clusters so quickly!

After all, the number of possible combinations of cluster assignments is exponential in


the number of data points—an exhaustive search would be very, very costly. Fortunately
for us, such an exhaustive search is not necessary; instead, the typical approach to k-
means involves an intuitive iterative approach known as expectation–maximization.

k-Means Algorithm: Expectation–Maximization

Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of


contexts within data science. k-means is a particularly simple and easy-to- understand
application of the algorithm, and we will walk through it briefly here. In short, the
expectation–maximization approach consists of the following procedure:

1. Guess some cluster centers

2. Repeat until converged

a. E-Step: assign points to the nearest cluster center

b. M-Step: set the cluster centers to the mean

Here the “E-step” or “Expectation step” is so named because it involves updating our
expectation of which cluster each point belongs to. The “M-step” or “Maximization step”
is so named because it involves maximizing some fitness function that defines the
location of the cluster centers—in this case, that maximization is accomplished by taking
a simple mean of the data in each cluster.

The literature about this algorithm is vast, but can be summarized as follows: under
typical circumstances, each repetition of the E-step and M-step will always result in a
better estimate of the cluster characteristics.

You might also like