0% found this document useful (0 votes)
19 views5 pages

Exam 2

Uploaded by

JUBAYAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Exam 2

Uploaded by

JUBAYAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Week -10

1. What are the 3 differences between Primary vs secondary data?

2. Different types of Application-programming interface (API)


a. Public API?
b. Rest API?-Historical Data
c. Streaming API?- Real time data
d. Web Crawling

Week-11

• Supervised vs unsupervised learning

In this class, we only learn unsupervised ML (where we do not have target


variables). If there are target variables, then it is supervised ML; if there are
no target variables, then it is unsupervised ML.

Topic modeling is unsupervised ML.

• Structured vs unstructured data

Structured-organized data in rows and columns in excel sheet

Unstructured Data- Video, Audio, Textual data. Using Term by Document


Matrix (TDM), unstructured data can be converted into structured data. E.g
Bag of Words

• Text mining concepts

Define the process of Text Analysis?

Text Analytics = Information Retrieval + Information Extraction + Data


Mining + Web Mining
text mining = A semi-automated process of extracting knowledge from
unstructured data sources a.k.a. text data mining or knowledge discovery in
textual databases

To perform text mining – first, impose structure to the data, then mine the
structured data

Text Mining Terminology (Unit of Analysis:


Document)
Unstructured or semistructured data

Corpus (and corpora)-collection of documents

Terms-each word known as terms

Concepts

Stemming- cutting the word, to bring words in same level those are in
different forms

Stop words (and include words)- are those words we do not need in our
analysis. Like articles (a,an, the etc)

Synonyms (and polysemes)

Tokenisation – the process of breaking up a given text into units called


tokens.

Lemmatization - remove inflectional endings only and to return the base or


dictionary form of a word

Term dictionary

Word frequency

Part-of-speech tagging

Term-by-document matrix

Occurrence matrix

Transformation

The Three-Step/Task Text Mining Process


Task 1 Task 2 Task 3
Establish the Corpus: Create the Term- Extract Knowledge:
Collect and organize Document Matrix: Discover novel
Data
Text the domain-specific Introduce structure patterns from the 5
4
3
unstructured data to the corpus T-D matrix 1
2

Knowledge
Feedback Feedback

The inputs to the process The output of Task 1 is a The output of Task 2 is a flat The output of Task 3 is a
include a variety of relevant collection of documents in file called term-document number of problem-specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such as computer processing populated with the term clustering models and
text, XML, HTML, etc. frequencies visualizations

TF-IDF

A high weight in tf–idf is reached by a high term frequency (in the given document) and
a low document frequency of the term in the whole collection of documents; the
weights hence tend to filter out common terms.

Week-12

Sentiment Analysis Process

Objective-Subjective
Negative-Positive

Comes right after the retrieval and preparation of the text documents
Step 1 – Sentiment It is also called detection of objectivity
Detection Fact [= objectivity] versus Opinion [= subjectivity]

Step 2 – N-P Given an opinionated piece of text, the goal is to classify the opinion as
falling under one of two opposing sentiment polarities
Polarity
N [= negative] versus P [= positive]
Classification

The goal of this step is to accurately identify the target of


Step 3 – Target the expressed sentiment (e.g., a person, a product, an
event, etc.)
Identification Level of difficulty  the application domain

Step 4 – Once the sentiments of all text data points in the


document are identified and calculated, they are to be
Collection and aggregated
Aggregation Word  Statement  Paragraph  Document

Tag the documents Parse and


Read Data Extract
using MPQA Pre-
and Create Words
positive and Process
Corpus negative list
Bag of words

Aggregate at Tag *TF


Term Frequency
the Review
Positive =1 (Absolute)
level
Negative=-1
How many times + or – words

Week 14 – Future trends


• Definitions
• Deep learning – complex neural networks – non-linear relationships
• Transfer learning – using pre-trained models
• Generative AI – model able to create new content
• Reinforcement learning – reward-based ML
• Federated Models – involves local and global ML
• IoT
• Machine to Machine communications
• Sensors
• Automated algorithms

You might also like