Data Analytics Chapter 4 Vision
Data Analytics Chapter 4 Vision
tags data = '' for dat print (data. get. in page.find_all("p"): xt ()) Output The first few lines of the output is shown below. Text mining - Wikipedia Text mining, also referred to as text data mining, similar to text analytics, is the process of detiving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.”[1] Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising pattems and trends by means such as statistical | pattern learning. iii. Programming languages: Programming languages, like Python and R, and scientific software, like SAS, offer packages for interacting with APIs and have libraries for interacting with most major digital platforms. For example, Tweepy, for Python, and twitteR, for R, have become standard for downloading Twitter data, 5. Key Social Media Analytics Methods The data generated from Online Social Networks is vast, noisy, distributed and dynamic. This requires multi-disciplinary and appropriate techniques such as statistics, data mining, graph- based mining, computational linguistics etc. techniques to analyze such large, complex, and frequently changing social media data. The three primary methods for social media analytics mainly include social network analysis, text analysis/mining, and trend analytics. 5.1 Social Network Analysis Online Social Network (OSN) data is huge, dynamic, noisy, and scattered. Social network analysis and mining allow analyzing such complex and dynamic data to generate required outcomes. SNA emphasizes on analyzing the users in a network (often referred to as nodes) and their connections among each other (often termed as edges). Based on the above structure, it is(> Socia! Media and Text Analytics wision\ 4-11 possible to study the structure of connections among users, relationships between media users, organizations, user communities, users from a particular demographic group etc. ‘There are various issues in social network data mining including link prediction, community detection, influence maximization, expert finding, and prediction of trust and distrust among individuals, j. _ Link prediction: Link prediction in Online Social Networks is all about predicting the possibility of a future association between two nodes, knowing that there is no association between these nodes at present. This is possible by mining the bulk amount of data available in OSNs to find out “who is a friend of whom” or “which products have an association with other products”. By analyzing and gathering useful information about an individual or product, OSNs can infer new interactions among members or nodes of an OSN that are likely to occur in the near future. The link prediction problem is a common feature found in many social networking sites for possible friends’ suggestions as found on Facebook or LinkedIn. The generic link prediction framework is shown in the figure below. - Similarity-based approach : [Similarity ‘Order Top ranked T7L_measures scores pairs Leaming-based approach Learning models Positive instances Figure 4.5: A generic link prediction framework The static social network is fed as an input to the framework. The framework then applies one of two approaches for prediction of future links in the social network. @. ~ Similarity-based approach: In the similarity-based approach, a similarity score between mon-connected nodes in the network is calculated. A list of the top-N ranked links is prepared for link prediction. ee2. 4-12 /wiion Data Analytics Learning-based approach: In the learning-based approach, a machine learning classification model assigns a label that is binary - positive or negative to each non-connected pair of nodes. A positive value indicates that there is a chance of better connectivity between the non-connected pair of nodes whereas a negative value indicates that there is very little chance of connectivity between the non. connected pair nodes. Community detection: Community or group detection in OSNs is based on studying the OSN structure to cluster individuals into groups by finding which individuals correlate more with each other than with other users. A user belonging to the same community is expected to share similar tastes, likes and dislikes, Such detection of communities can help to further make an assessment about what products, services and activities an individual might be interested in. An illustration of three OSN communities is depicted in the following figure. Various approaches are followed for community detection in social networks. These include the following: a Figure 4.6: Online Social Network Communities Traditional ‘clustering methods: The various traditional clustering methods of community detection is mainly divided into hierarchical, spectral, and partitional methods. Link-based clustering methods: The social network is modeled as a graph with nodes and edges. In link-based community detection methods, the strength of connections between nodes is explored,om Social Media and Text Analytics wsiom\ 4-13 © Bea tesed methods: These community detection methods emphasize the generation of communities based on the common topic of interests. Here, communities that are topically similar are explored. d. — Topic-link based methods: This is a hybrid approach that considers both, the strength of connections between nodes as well as finding communities that are topically similar. Influence maximization: Nowadays, as OSNs are attracting millions of people, people rely on making decisions based on the influence of such sites. For example, influence propagation can help make a decision on which product to purchase, which audio/video to watch, which community to join, and so on. Thus, influence propagation has become vital for effective viral marketing, where companies can convince customers to buy products through the help of those active people in OSNs who can play a key role in influencing others. In case of viral marketing, an influence maximization technique is used to find a set of few influential users of a social network. These users are made to influence others abcut the goodness and usefulness of a product so that it can create a cascade of influence of buying the same product by the users’ friends. However, in influence maximization, the main challenge is to generate the best influential users (the seed set) in the social network. The generic influence maximization framework is shown below: Figure 4.7: A generic influence maximization framework$14 /epee Sex cays Bere, an un-weighted social graph is fed as input to the model. Th uence dif; PES acton propagation traces which are then used by an in J graph is pr of each edge. } to generate the seed set which is considered as ace maximization model. increasingly used to finding so ems faced by people. Some social networking tes pi topics and discussions. Experts are people who are Knowledge sharing environment like question and answer communities. The expen Snding is the task of generating and grouping experss of a social network based on his'ter expertise on certain topics. For doing so, the main task adapted in expent findings is to Fetieve a ranked list of top-N experts who are well conversant on a given topic. The basic WSea behind finding such expenss is to take their help in questions answering and problear solving. Prediction of trust and distrust among individuals: With so much user interaction xd coment created, the question of whom and what to tust has become an increasingly important challenge on the web. A user is likely to encounter hundreds of pieces of user- Senerated content each day, and some of it will need to be evaluated for trustworthiness. ‘Trost information can help a user make decisions, sor and filter information. receive recommendations, and develop a context within a community with Tespect to whom 10 wus and why. In these contexts. knowing whom to trust is important, however knowing whom to distrust is equally useful. ‘The following figure shows a simple example of generating trusted users in a social network. The network consists of four nodes (or users) and the solid line connectivity shows the trust between nodes (or users). For instance, node A trusts node C and node D while node B trusts node C. This indicates that there is a strong likelihood that node B will also must node D (shown with dotted line).Social Media and Text Anaiyics wision\ 4-15, culating distrust is a more Calculating distrust is a more challenging and complex problem as compared to finding sted users ina r trusted u ina network, This is because the trust fac tor is considered transitive, i.e., if and user B trusts user C, maximum likelihood that user A ee eres uret By then, it can be concluded that there is a also trusts user C. However, the distrust factor cannot be ed transitive. This is so because if user A distrusts user B, and user B distrusts user C, this does not guarantee that user A will distrust user C. consid 6. Introduction to Natural Language Processing In recent years, personal assistants like Alexa, Siri etc. have gained widespread use in our daily lives. They can "talk" to us just as’ any other human would. To make computers communicate with humans in our natural language is an extremely complicated and challenging task. Computers can easily understand data in a structured, well-defined form. Human languages however, are unstructured. Context, meaning, ambiguity, pronunciations, synonyms, etc. all play an important role in communication using human languages. Natural Language Processing (NLP) is the driving force behind personal assistant apps, translation and others. Natural Language Processing or NLP is a subset of Artificial Intelligence (AD), which is Tesponsible for the understanding of human language by a machine or a robot. It deals with the interaction between humans and computers using a natural language. Applications of NLP Today, NLP powered software helps in our daily lives in many ways. For example: Personal assistants: Smart assistants like Apple’s Siri, Amazon's Alexa, Google assistant etc. recognize patterns in speech, then infer meaning and provide a useful response. Predictive text: Techniques like autocorrect, autocomplete, and predictive text on our browser, IDE, desktop apps (e.g. Microsoft Word) etc. Language translation: Today, translators like Google Translate can translate languages more accurately and present grammatically-correct results. Sentiment analysis: Sentiment analysis is able to recognize subtle nuances in emotions and opinions — and determine how positive or negative they are. It is often used to Understand what customers like and dislike about a specific service, feature or product,5 4-16 /wiion Data Anaytcs V. Text analytics: Text analytics converts unstructured text hs into cea data joy analysis using different linguistic, statistical, and machine beating tec niques An Nip tool will typically analyze customer interactions, such as social media comments o, eviews, posts, messages ete, to derive meaningful insights. Text summarization summarizes text, by extracting the most important information. Its main goal is to Simplify the process of going through vast amounts of data, such as scientific papers, news conten, or legal documentation, vi. Chatbots and digital calls: Automated systems direct customer calls to a service Tepresentative or online chatbots, which respond to customer requests with helpfy) information. We all hear “this call may be recorded for training purposes”. The Tecordings are analyzed by an NLP system to learn from and improve in the future. vil. Speech recognition: Speech recognition technology uses natural language processing to transform spoken language into a machine-readable format. Speech recognition systems are an essential part of virtual assistants, like Siri, Alexa, and Google Assistant, for example. However, there are more and more use cases of speech recognition in business, viii, Spam filtering: One of the most common problems these days is unwanted emails, Classifying emails as spam or not spam requires analyzing the emails and looks for specific criteria on which to base the judgments. Text analytics, sentiment analysis and text summarization will be covered in the next sections. 6.1 Text Analytics Online social media data is often used as text in the form of comments, tweets, reviews, blogs, articles, replies, discussions, emails, or feedback provided by several online users, Also, social media and the internet contain millions of documents and web pages with huge volumes of text. This text contains hidden knowledge which, if uncovered can give valuable information and insights to individuals, organizations and businesses. Most of the text data is unstructured and “scattered around the web. If this text data is gathered, collated, structured, and analyzed correctly, valuable knowledge can be derived from it. Organizations can use these insights 10 take actions that enhance profitability, customer Satisfaction, etc, Text analytics is the automated process of Wanslating large volumes of unstructured text int quantitative data to uncover insights, trends, and patterns. It combines machine leaming. statistical and linguistic techniques to analyze large volumes of unstructured text or text that does not have a predefined format, to derive insights and patterns,m\ 4-17 Social Main and Taxt Anaivticn advantages of text analytics elp businesses to understand custome ; Help be understand customer trends, product performance, and service quality This results in quick decision making, enhancing business intelligence, increased productivity, and cost savings Helps researchers to explore a great deal of pre-existing literature in a short time, extracting What is relevant to their study. ji, Assists. in understanding general trends and opinions in the society that enable governments and political bodies in decision making. Text analytic techniques help search engines and information retrieval systems to improve their performance, thereby providing fast user experiences. y._ Refine user content recommendation systems by categorizing related content. in the data. vi. Mentify critical topic: There are several different techniques applied in text analytics. These include tokenization, bag of words, word weighting, stemming and lemmatization etc. Tokenization In order to get the computer to understand any text, we need to break the text into smaller parts in a way that the machine can understand. The first part of any NLP process is simply breaking a piece of text into its constituent parts (usually words). This process is called “tokenization". Tokenization is a key (and mandatory) aspect of working with text data. Tokenization is a way of separating a piece of text into smaller units called tokens. It's the Process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. The token occurrences in & document can be used directly as @ vector representing that document. ‘This immediately turns an unstructured string (text document) into a data structure suitable for machine learning, There are different methods and libraries available to perform tokenization. NLTK, Gensim, Keras are some of the libraries that can be used to accomplish the task. NLTKC (Natural Language ToolKit) is # powerful Python package that provides a set of diverse natural languages Algorithms, It is free, open source, easy t0 use, large community, and well documented. NLTK Consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, Sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer ‘© analysis, preprocess, and understand the written text.On 4-18 /wision Data Anaiytics First install NLTK in your PC, To install it go to the command prompt and type. pip install nitk Once nitk is installed, type the following commands. This will download all the tools for text analytics, import nitk nltk.download('all') Types of tokenizations i, Sentence tokenization: Sentence tokenization breaks text paragraph into sentences, For word tokenization, we use sent_tokenize (text) Example: from nltk.tokenize import sent_tokenize, word_tokenize text='This is a sentence. This is another sentence’ sent_tokenize (text) Itgives {'This is a sentence.', ‘This is another sentence'] ii, Word tokenization: This breaks a text paragraph into words. For word tokenization, we use word_tokenize (text) It gives ['This',; "is"; 'a', ‘sentence’, '.', 'This!, 'is', ‘another’, ‘sentence'] One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary, iii. Tokenization using Regular Expressions: A regular expression is a special character Sequence that helps you match or find other strings or sets of strings using that sequence as a pattern. We can use the re library in Python to work with regular expression. The re-findall() function finds all the words that match the pattern passed on it and stores it in the list.Sect Moca and Tet Aneto Whew AID phe “Ww” represents “any ‘ The represents “any word character” which usually means alphanumeric (letters. numbers) and underscore (_). ‘+’ means any number of times. So [\w’]+ signals that the code should find all the alphanumeric characters until any other character is encountered. Examples: The following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: rom nltk.tokenize import RegexpTokenizer text = 'I won $1000 as 1st prize in the competition on 15/6/2022 tokenizer = RegexpTokenizer('\w+| [\d\.J+I\$+') tokenizer.tokenize (text) | Output ", ‘won’, '$', 1000, ‘as’, "Ist, ‘prize’, else. text = "Hello goodBye TEST split" tokenizer = RegexpTokenizer('[A-Z] \w+') tokenizer.tokenize (text) ‘A dataset is referred to as corpus in nltk. A corpus is essentially a collection of sentences which serves as an input. For further processing, a corpus is broken down into smaller pieces and processed. There are several datasets which can be used with nltk. To use them, we need to download them. We can download them by executing this: nltk. download () Bag of Words The most basic concept in NLP is that of a “bag-of-words,” also called a frequency distribution. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps 4 count of the total occurrences of most frequently used Words. Thus, a bag of words is a representation of text that describes the occurrence of words Within a document.2 4-20 /vision Data Anaytics Let us consider an example of three sentences. Sentence 1: Python is the best language Python is good Sentence 2: Python is good for machine learning Sentence 3: [love Python The bag of words for the above three sentences would look like: best | for | good | 1 | is | language | leaming | love machine | Python | the Sentencei |i [0 | 1 [ole 1 0 0 0 gees Sentencez] 0 [1 | 1 fold 0 1 0 1 if 0 Sentence3 0 [0 | 0 [ilo 0 0 1 0 Aeszh 28 Sentence 1 can be represented in the vector form as: [1,0,1,0,2,1,0,0,0,2,1] To create a bag of words, ‘some text preprocessing is necessary. Typically, we convert text to lower case, remove all non-word characters and remove all punctuations. After this, the bag of words can be created. Step 1: Step 2: Output Apply sentence tokenizer corpus = nltk.sent_tokenize (text) The result is as follows: ("Python is the best language Python is gocd.', ‘Python is good for machine learning. "I love Python'} Preprocessing the text As you can see, our text contains uppercase letters and Punctuations. So, we first convert our text into lower case and then remove the Punctuation from our text. Removing punctuation can result in multiple empty spaces. We will remove the empty spaces from the text using regex. for i in range(len(corpus }): corpus[i] = corpus [i] .lower() corpus[i] = re.sub(r'\W'," ',corpus[i}) corpus[i] = re.sub(r'\s+',* ',corpus[i}) print (corpus) [python is the best language python is good ', ‘python is good for machine learning ', ‘love python] As you can see, all letters are lowercase and there are no punctuation symbols. Now, we will generate bag of words for this corpus.aN Social Media and Text Analytics wision \ 4-21 Step 3: Generate vocabulary and frequency distribution Step 4: The next step is to tokenize the sentences in the corpus and create a dictionary that contains words and their corresponding frequencies in the corpus. wordfreq = (} for sentence in corpus: tokens = nltk.word_tokenize (sentence) for token in tokens: if token not in wordfreq.keys()+ wordfreq{token] = 1 else: wordfreq[token] += 1 print (wordfreq) Here, wordfreq is a dictionary which stores the words as keys and their frequencies as the values. { "python! : "good! "love! ‘ist: 3, ‘the! 2, 'for": 1, ‘machine 1} 1, ‘best': 1, ‘language’: 1, 1, ‘learning': 1, 'i': 1, Generate the vector representation of the sentences in the text The final step is to convert the sentences in our corpus into their corresponding vector representation. For each word in the wordfreq dictionary if the word exists in the sentence, its count will be added for the word, else 0 will be added. sentence_vectors = [] for sentence in corpus: sentence_tokens = nltk.word_tokenize (sentence) vec = [] for w in wordfreq: vec. append (sentence_tokens.count (w) ) sentence_vectors. append (vec) print (sentence_vectors) Output Ga 4,1, 1, 0,0,0, 0, 0}, 1, 1.0, 0,0, 1, 1,1, 1,0, 0} (1, onsen Word weighting In large documents, the number of words in very large. There are words in a document that occur many times, for example, words like “the”, “is”, “of’, and so forth, But they may not be as important ag other words that may appear less frequently. In such cases, we have to determine which words are worth counting in the word vector and what weight we should assign to them.a 4-22 / sion - Date Analytics Term weighting or word weighting is a procedi jure that takes place during the text indexing met s the assi; process in order to assess the value of each term to the document. It is ignment of ; in a docut numerical values to terms that represent their importance in a di ment in order to improve retrieval effectiveness. The following techniques are used for calculating the term weights: ‘TF-IDF: “Term Frequency-Inverse Document Frequency” (TF-IDF) is one of the most popular term-weighting schemes today: most of text-based recommender systems in digital libraries use tf-idf. It is a numerical statistic that reflects how important a word is to a document in a collection or corpus The TF-IDF weight is composed by two terms: a. Normalized Term Frequency (1): The term frequency indicates the number of occurrences of a particular term t in document d. The term frequency tf of a term t in document d is given as: tit, d) = Nit, dAIDIL where Nit, d) = number of umes a term t occurs in document d, Dil = Total number of terms in the document Inverse document frequency (idf): The IDF measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of", and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones. This decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. N pee HM. D) = los TTS :te ail * Ni total number of documents in the corpus N = | D | ° Hd e D: te djl: oumber of documents where the term t appears Ge. tft, d) # 0). If the term is not in the corpus, this will lead to @ division-by-zero. It is therefore common to adjust the denominator to l+i{de Dive djl ‘Social Media and Text Anaiytics wision\ 4-23 The above two, i.e., TF and IDF can be combined to calculate a term’s tf-idf which gives the frequency of a term adjusted for how rarely it is used. The intuition behind TE-IDF is that rarer words are more important. tfidf(t, d, D) = u(t, d) - idf(t, D) Let's take an example to get a clearer understanding Document 1: Itis not going to rain today (7 words) Document 2: Tam not going out today (6 words) Document 3: Iam going to college (5 words) Here, N = 3 documents The TF, IDF and TF-IDF calculations are shown in the table below: TF ‘TFADF SO tor 2 me Di | b2 | 03 ‘Am | /7=0_ | 1/6=0.17 .2 | Log(ai2)=0.18| 0 | 0.030 | 0.036 o6=0 | 115-02.| Log(ai)=048] 0 | 0 | 0.096 116=0.17 | 1/5=02 | Log(aa-0 [0 | 0 | 0 1/6=0.17 | 1/5=0.2 | Log(s/2)=0.18 | 0 _| 0.030 | 0.036 076-0 | 0/5=0 | Log(st)= 0.48 | 0.067| 0 076-0 | 0/5=0 | Log(3/1)= 0.48 | 0.067] 0 1/6=0.17 | 0/5=0 | Log(a/2)= 0.18 | 0.025 | 0.030 out_| 070 _| 176-017 | 0-0 | togis)=0.48 | 0 | 0.082 Rain | 17=0.14| 0/6=0 | 05-0 | Log(3/t)= 0.48 | 0.067| 0 To | o7=0 | 06-0 | 15-02| Log(si)-048| 0 | 0 | 0. today | 170.14 | 1/6=0.17 | 0/5=0 | Log(a/2) -0.18 | 0.025 | 0,030 2|glejele/ele From the TF-IDF values in the table above, we can infer that even though the word “going” appears in all three documents, it is not an important word. The words “is”, “it” and “rain” are important in document 1, “out” is most important in document 2 and “to” and “college” are most important in document 3. Implementation ‘There is no direct support for TF-IDF in NLTK. However, the TFIDFVectorizer in the sklearn library can be used for this purpose. The code is given below. Here, we use the CountVectorizer with the fit_transform() method, The result is converted into an array and passed to a pandas dataframe for display.on 4-24 /wsion Data Anaitics Document1 = 'It is not going to rain today’ Document2 = 'T am not going out today’ Document 3 = 'r am going to college’ Doc = [Document1, Document2, Document 3] i from sklearn, feature_extraction.text import Count Vectorizer #Get the frequency counts Import the vectorizer vectorizer = CountVectorizer() X = vectorizer. fit_transform (Doc) print (vectorizer.get_feature_names()) Doc_Term_Matrix = pd.DataFrame(X.toarray(),columns= vectorize r.get_feature_names ()) To get the TF-IDF values, we use the Tfidf Vectorizer class. import pandas as pd . from sklearn,feature_extraction.text import T£idfVectorizer vectorizer = TfidfVectorizer() tfidf = vectorizer,fit_transform(Doc), Doc_Term Matrix = pd.DataFrame(tfidf.toarray(),columns= vecto rizer.get_feature_names ()) print (Doc_Term Matrix) output Sa canta gains te ts es eal meta © 0.000000 c.coaco0 0261840 0.43803 442509 0.337285 coc0000 0.449500 o.sa7zas 0.97295 1 2432087 0.000000 0.96315 caG0000 oc00000 0.439067 0.568481 ‘0.000000 0.000000 0 «85067 | 2 omoen o6o745 O28 osm ceaxm owe catte eumee tanen nen "ol One thing to note Is that the tf welghts are normalized. You can easily check this by ‘squaring and adding the weight values along each row of the document-term matrix, n-Grams The bag of words does not take into consideration the order of the words in which they appear in 4 document, and only individual words are counted. In some cases, the order of the words might be important. N-grams captures the context in which the words are used together. For example, it might be better to consider two words like “New York” instead of breaking it into individualor Social Media and Text Analytics wsion\ 4-25 words like “New” and “York” An N-Gram is a sequence of N-words in a sentence. Here, N is an integer which st ands for the number of words in the sequence. For example, if we put N=1, then it is referred to as “uni-gram, If you put N=2, then it is a bi-gram. If we substitute N: then it is a tri-gram, Example. Consider the sentence “I like the N”. The bi-grams are: “I like”, “like the”, “the rain” For better text analytics, you can create a bag-of-words out of n-grams, run TF-IDF on them. The NLTK ngrams funetion can be used to gene from nltk import ngrams te n-grams of the text. sentence = 'I like the rain! [ output | bigrams = ngrams(sentence.split(! '), n=2) bio | for x in bigrams: eed print (x) | Stop Words Natural languages like English contain main words that occur very frequently such as “the”, “for”, “and”, etc. These words occupy a lot of space in the Bag of words or n-grams. Processing these words increases the time required for the text analysis. Search engines are programmed to ignore these words. Immediately, we can recognize that some words carry more meaning than other words. We can also see that some words are just plain useless, and are filler words, Such words are called "stop words". Stop words are words that just contain no meaning, and we want to remove them. We can create our own stop word list or use the standard list of stop words in the nltk library. from nltk.corpus import stopwords print (stopwords..words ('english')) oho, “she her, hers, Herself, ‘its’, ‘itself, ‘them’, ‘their’, ‘thei a’, Which’, who’, whom’, ‘this’ ‘that’, “that these’ those ‘am, ‘is, ‘are’, 'been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, oe ty Src, trough’ dung betoe afr bove’, blow, ‘against’, ‘between’, ‘into’, ‘throug! in re’, ' , ‘to’, Fe ae aac, On, off ‘over, ‘under, ‘egal, turner, hen’ ‘ones’ hare’, ere: II, ‘any’, ‘both’, ‘each’, Yew, ‘more’, ‘most’ ‘other’, ‘some’, ‘such’, ‘no’, . ‘can’, ‘will, ust’, ‘don’, “dont, ‘should’, ahh ", “aren't, ‘couldn’, “couldnt, ‘didn’, "didn't, ‘dower’, "dosent, haven’, "havent", ‘isn’, “isn't’, ‘ma’, ‘mightn’, *mightrit", ‘mustr', “mustn't’, ‘needn’, “needn'’, ‘shan’, “shan‘’, ‘shouldn’; "shouldn't", ‘wasn’, "wasn't", ‘weren’, ‘weren't’, won’, “wont”, ‘wouldn’, "wouldn't"4-26 /wiion Data Ananytics To remove stop words from our text, we must tokenize the text and check if the words belong to the set of stop words, If yes, it is removed. The following code removes the stop words from the given text import nltk from nitk.corpu! stop_words = set (stopwords .words (english')) from nitk.tokenize import word tokenize , text = "It seems like it is going to rain today. I will no text_tokens = word_tokenize (text, lower ()) tokens_without_sw = (word for word in text_tokens if not word in is import stopwords (tokens_without_sw) Output [seems’, ‘ike’, ‘going’, ‘rain’, ‘today pr Stemming and lemmatization Consider two sentences given below: i, He was swimming. ii, He was taking a swim. The meaning of the above two sentences is the same. It is very easy for humans to understand this, but for machines, both sentences will be treated as different. Hence, text analytics is a very complex process because the same word can have different meaning depending upon the context or the same meaning can be conveyed using different words. Stemming and Lemmatization are two techniques that are widely used in tagging systems, indexing, SEOs, Web search results, and information retrieval. Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. For example, whenever we run a search for some text, we get the results not only for the exact text but also for other possible forms of the words in the text. For example, when we search for fishes, we may get results for fish, fishing, fisherman ete. In grammar, inflection is the modification of a word to express different grammatical categories Such as tense, case, voice, aspect, person, number, gender, and mood, For example, playing, Played, plays are all inflections of the word play. The aim of both processes is the same: reducing the inflectional forms of each word into a common base ot root. Stemming Stemming is a technique used to extract the base form of the words by removing affixes from them. The stem is just the part of a word that doesn’t change despite the inflection of the word. Ita Soci Meda and TextAnavice wisn. 4-27 is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat, Stemming reduces the inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language. For the words “produce,” “producer,” “product,” “production,” and “producing,” the stem will be “produc: Different stemmers are available for different languages in Python nltk. For the English language, you can choose between PorterStammer or LancasterStammer. from nltk.stem import PorterStemmer from nltk.stem import LancasterStemmer porter = PorterStemmer() lancaster=LancasterStemmer () nitk.stem is a package that performs stemming using different classes. PorterStemmer is one of the classes, so we import it using the above lines of code. Now, we create a wordlist and find the stem of the word using both the stemmers. porter = PorterStemmer() lancaster=LancasterStemmer () wordlist = ["frien "friendship", "friends", "running", "runs", "ran", wcrigs", “ciying", "eating", "eaten", "studies", "otudying"] print ("{0:20}{1:20}{2:20}". format ("Word", "Porter Stemmer", "Lancaster St emmer") ) for word in word_list: print ("{0:20}{1:20}{2:20}".format (word, porter. stem(word) , lancaster. stem(word) )) Output Word Porter Stemmer Lancaster Stemmer | friend friend friend | 1 friendship friendship friend friends friend friend running run z run run | | a F ran ran ran cries ori ri crying cri oy eating eat eat eaten eaten eat studies studi study Studying studi study NLTK has RegexpStemmer class with the help of which we can construct our own stemmer. For non-english language, the SnowballStemmer class is used. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.2 4-28 /wSion Data Analytics Lemmatization Text analytics involves another important technique called “lemmatization.” The “lemma” for g word is the base or root word from which it is derived. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words. Because Jemmatization returns an actual word of the language, it is uséd where it is necessary to get valid words. The NLTK Lemmatization method is based on WorldNet's built-in morph function. Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words. The following code uses the WordNetLemmatizer to generate the lemma of words ina wordlist. import nltk nltk.download('wordnet ') from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNet Lemmatizer() word_list = ["friend", "friendship", "friends", "running", "runs", "ran", "cries", "crying", "eating","eaten", "eats", "studies", "studying"] print ("{0:20}{1:20}". format ("Word", "Lemma") ) for word in word_list: print ("{0:20}{1:20}". format (word, wordnet_lemmatizer.lemmatize (word))) i Word Lemma | friend friend friendship friendship | friends friend | running mining | eae ran ran | cries cry crying ory | eating eating eaten eaten 7 eats: eats studies study studying studying As seen above, the lemma for cries and crying is cry. But studying, eating etc. do not give the right results. This is because, the NLTK lemmatizer requires POS (Parts of Speech) 18 information to be provided explicitly otherwise lemmatization will not give the right results: Extracting the lemma for a given word requires knowing its part of speech, for example, which requires analysis of the surrounding sentence. The problem with lemmatization is that, in general, it is extremely computationally expensive because it requires a certain amount of understanding of the text,print (synonym. — . From the above example, witness, interpret, examine etc are Safaris synonyms of see. In order to replace the word by the synonym, it oe is important to know the parts of speech. ae ataac examine experience iMedia and Tort Anaivics wiaven\ 4-29 Difference between Stemming and Lemmatization __ Stemming ~ Lemmatization ] Stemming extracts the base form of a word by | Lemmatization gives the root word or lemma removing oF replacing word suffixes from which the word is derived Stemming is faster because it chops words | Lemmatization is slower as compared to | without knowing the context of the word in given | stemming since it needs to know the context of sentences. the word before proceeding. | The word may not be a valid word. ‘The word is a valid word Itis a rule-based approach. Itis a dictionary based approach. Accuracy is less. ‘The accuracy is more. | Lemmatization is preferred when the meaning of | ‘Stemming is preferred when the meaning of the the word is important for analysis. word is not important for analysis. Example: Example: “studies” => “studi” “studies” => “study” Synonyms When we are trying to analyze the text, the words themselves are less important than the “meaning.” This means that we might want to replace related terms such as “big” and “large” by a single identifier. These identifiers are often called “synsets,” for sets of synonyms. Many NLP packages use synsets to analyze a piece of text. The simplest use of synsets is to take a piece of text and replace every word with its corresponding synset. Although this may seem simple to carry out, the language characteristics may not make it feasible. For example, a word may have different meanings in different other word may not be semantically correct. contexts. Hence, simply replacing it by anc al F output Example: nee t wordnet E understand i witness from nltk.corpus impor syns = wordnet .synsets ("see") wees. fearn for synonym in syns? Jemmas () [0] -name()) watch interpretAe 430 /wsion Data Anayrice Parts of Speech Tagging Words in text belong to different categories like nouns, verbs, adverbs, adjectives etc. language processing tasks require the words to be classifi analysis. There are commonly nine parts of speeches; adjective, preposition, conjunction, inte speech to make sense in the sentence. ed into the correct category for better noun, pronoun, verb, adverb, article, jection, and a word need to be fit into the proper part of The process of classifying words into their Parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Example: from nltk import word_tokenize, pos_tag sentence = "I like sports" Print (nltk.pos_tag(word_tokenize (sentence) )) Here, PRP is preposition, VBP is Verb, Present tense, NN: the POS tags can be seen by calling the function: nltk.help.upenn_tagset () 'S is noun, plural. The complete list of Let us now apply POS tagging to our Wordnet Lemmatizer example. As seen earlier, words did not yield the correct lemma because they were not tagged appropriately. Words like ‘eating’, ‘studying’ etc. remained the same after lemmatization. This is because these words are treated as a noun (default tag) in the given sentence rather than a verb. To overcome this, we use POS (Part of Speech) tags and provide the correct ‘part-of-speec! tag as the second argument to Jemmatize(). As seen above, the nitk,pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The default POS tag is NOUN and that it does not output the correct lemma for a verb. If the POS tag starts with N, we map it to wordnet NOUN. If it starts with J, then we map it to wordnet ADV and so on, For this, we create.a dictionary that maps the POS tags to the wordnet lemmatizer.(ds Social Media and Text Analytics vision 4-31 The code is as follows: from nltk.corpus import wordnet from nitk.stem.wordnet import WordNet Lemmatizer from nltk import word tokenize, pos ag F tag_map = ("J": wordnet ..ADJ, "NN": wo OUN, "Vv" t:. VERB Bene Wordnet .ADV) wordnet .NOUN, : wordnet’. Z wordlist = ("friend", "friendship", "friends", *running", "runs", "ran", ‘cries", "crying", "eating", "eaten", "eats", "studies", "studying"] print (nltk.pos_tag(word_list)) lemma_function = WordNetLemmatizer() print ("(0:20}({1:20}". format ("Word", "Lemma") ) for token, tag in pos_tag(word_list): jemma = lemma_function.lemmatize (token, tag_map(tag{0]]) print ("{0:20){1:20}". format (token, lemma) ) Output (friend, ‘NN’, (friendship’, 'NN’), (riends’, 'VBZ’, (‘running’, 'VBG), (runs’, ‘NNS)), (‘ran’, ‘VD’ (cries', 'NNS’), ‘crying’, 'VBG'), (‘eating’, 'VBG'), ‘eaten’, J}, (‘eats’, 'NNS)), ‘studies’, 'NNS'), (‘studying’, ‘VBG')] Word Lemma friend friend friendship friendship friends friends running run runs run | ran run i cries cry | i crying ory i i eat i eaten | eats study study | Now the correct lemma is returned by the lemmatizer for the words. 7. Sentiment Analysis Sentiment means feeling, emotion, or opinion. A sentiment can be positive, negative or neutral. Humans can intuitively understand the sentiments from spoken words, messages, text, feedback or reviews, Using NLP, statistics, or machine learning methods bd extract, Mdeotify, oF otherwise characterize the sentiment content of a text unit is called as Sentiment analysis. It is sometimes Teferred to as opinion mining. The key aspect of sentiment analysis ato) analyze « body of text for understanding the opinion expressed by it. Typically, we quantify this sentiment with a ™ See Son 4-32 /wibion Data Anaivics Positive or negative value, called polarity, The overall sentiment is often inferred as positive, neutral or negative from the sign of the polarity score. Figure 4.9: Sentiment Some of the questions which Sentiment analysis attempts to answer are: i, _ Is this product review p ive or negative? ii, Is the customer satisfied or dissatisfied with the service? ili, Based on a sample of tweets, how are people responding to event? iv. _ How have bloggers’ attitudes about the government changed since the election? Sentiment analysis can help you determine the ratio of positive to negative reactions about a Specific topic. You can analyze text, such as comments, tweets, feedback and product reviews, to obtain insights from the people. Sentiment analysis can be classified as: i, Document-level sentiment analysis: In this level, the e1 unit and the whole document is classifi ii, Sentence-level sentiment analysis: Sentence level. Bach sentence of the negative opinions it expresses. iii, Aspect-level sentiment analysis: ntire document is chosen as one ied as positive, negative or neutral, 1m this level, the entire document is partitioned into document is then classified based on the positive orSoci Meda and Text Anaics wistn\ 4-33 Decision Tree Ciassitiers Linear l Ciass.hers —_ Lo Li Nearal Newone |_J Rueased | | | Ciassitiers ——— r>[_Naive Bayes Eee Bayesian Network Classifiers Lf Maximum Entropy Ly oP eee LfSemantie For the supervised machine learning approach, we typically need pre-labeled data. If no labeled dataset is available, then unsupervised techniques are used for predicting the sentiment. In the lexicon based approach, knowledge bases, ontologies, databases, and lexicons are used that have detailed information, specially prepared just for sentiment analysis. A lexicon is a dictionary, vocabulary, or a book of words. In our case, lexicons are special dictionaries or vocabularies that have been created for analyzing senuments. Most of these lexicons have a list of positive and negative polar words with some score associated with them, and using various techniques like the position of words, surrounding words, context, parts of speech, phrases, and so on, scores are assigned to the text documents for which we want to compute the sentiment. ‘After aggregating these scores, we get the final sentiment. Various popular lexicons are used for sentiment analysis, including AFINN lexicon, Bing Liu’s lexicon, MPQA subjectivity lexicon, ‘SentiWordNet, VADER lexicon, TextBlob lexicon etc. One of the most popular rule-based sentiment analysis models is VADER. VADER, or Valence ‘Aware Dictionary and sEntiment Reasoner, is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media. Since it is tuned for social media ‘content, it performs best on the content you can find on social media.Je 4-34 /usion Data Analytics Let us consider an example of a Lexicon-based sentiment analysis on a set of user feedback. First we have to import the libraries. import nitk nltk.download('vader_lexicon') from nltk.sentiment.vader import SentimentIntensityAnalyzer sia = Sentiment IntensityAnalyzer () We use the polarity_score calculator of the SentimentIntensity Analyzer model. This model gives us four scores in a dictionary: (i) Negativity, (ii) Positivity, (iii) Neutrality score of the sentence, and finally, (iv) Compound sentiment score of the sentence. We create a list of strings representing feedback. For each feedback, we display the polarity scores and the overall sentiment. sentences.= ['The product is terrible’, 'The service was very slow', 'I am extremely satisfied with this product’, 'The movie is entertaining’ , ‘Just booked two days at this hotel’, 'The app is not user friendly ' 1 for sentence in sentences: print ("Feedback -' + sentence) pol_dict’= sia.polarity_scores (sentence) print (pol_dict) if pol_dict ['compound'] >= 0,01 : print ("Overall sentiment: Positive") elif pol_dict ['compound'] <= -0.01: print ("Overall sentiment: Negative") else : print ("Overall sentiment [ Output Feedback -The product is terrible {neg': 0.508, ‘neu’: 0.492, ‘pos’: 0.0, ‘compound’: -0.4767} Overall sentiment: Negative Feedback -The service was very slow {{neg': 0.0, ‘neu’: 1.0, ‘pos’: 0.0, ‘compound: 0.0} Overall sentiment: Neutral Feedback -| am extremely satisfied with this product {'neg’: 0.0, ‘neu’: 0.618, ‘pos': 0.382, ‘compound’: 0.4754} Overall sentiment: Positive Feedback -The movie is entertaining {neg': 0.0, ‘neu’: 0.508, ‘pos’: 0.492, ‘compound’: 0.4404) Overall sentiment: Positive Feedback -Just booked two days at this hotel {{neg’: 0.0, ‘neu’: 1.0, ‘pos': 0.0, ‘compound’: 0.0} Overall sentiment: Neutral Feedback -The app is not user friendly {'neg’: 0.345, ‘neu’: 0.655, ‘pos’: 0.0, 'compound': -0.3875} Overall sentiment: Negative Neutral")Social Macia and Text Analytics wion\ 4-35 8. Document or Text Summarization Millions of documents, web pages and websites exist on the internet today, Going through a vast amount of content becomes very difficult to extract information on a certain topic. Often you are unable to find the right content that you need. There is a lot of redundant and overlapping data in the articles which leads to a lot of wastage of time If done manually, the entire process of text summarization will be a very complex and tedious task, and sometimes maybe almost impossible within a short time frame. In such a case, one easy hhine learning algorithms that can be trained to identify the important ordingly produce a summary of the document. solution is to use 1 sections of a document and Text summarization is an NLP technique that extracts text from a large amount of data. It helps in creating a shorter version of the large text available. Advantages i, Reduces reading time. Helps in better research work. : Increases the amount of information that can fit in an area. iv. When researching documents, summaries make the selection process easier. ‘Automatic summarization improves the effectiveness of indexing. vi Automatic summarization algorithms are less biased than human summarizers. Disadvantages i, Might miss out on certain sentences affecting the summary’s meaning. There are mainly two types of text summarization approaches followed in text analytics:2, 4-36 /usion — Data Anaivtios Extractive Summarization: This approach uses a machine learning approach to assign weights to sentences in the text and use.the results to generate text summaries. Sentences are divided into sections and weights are assigned to the sections. Each section is then ranked based on the importance and relevance of the text. The extracted sections are combined to ultimately form a text summary. In Extractive Summarization, we are identifying important phrases or sentences from the original text and extract only these Phrases from the text. These extracted sections would be the summary. Extractive Summarizer In the Abstractive Summarization approach, we work on Generating new sentences from the original text. This approach uses a deep learning approach to paraphrase a document to generate the text summary. The sentences generated through this approach might not even be present in the original text. One advantage of the abstraction-based text summarization approach compared to the extraction-based text summarization approach is that there are fewer grammatical inaccuracies found in the generated summarized textual output. Text Sentence 1 Sentence 2 Sentence 3 Sentence 4 Summary Sentence 2” Sentence 4 ii, | Abstractive Summarization: Text Sentence 1 Sentence 2 Sentence 3 Sentence 4 Comparison Steps Obtain data Text preprocessing iii. Sentence and Word tokenization iv. Find the word frequencies v. Caleullate the weighted frequencies of the sentencesvi. vii. Social Media and Text Analytics waion\ 4-37 Sort sentences in descending order of weights Summarize the article Let us now apply these steps to summarize a document. We will use Extractive summarization technique. i, Obtain the data Let us consider the following paragraph for summarization. Alternatively, you can obtain the contents from a website or any document. In the following example, we have taken a text paragraph stored in variable “text”, text = """ With the present explosion of data circulating the digital space, which is mostly non-structured textual data, there is a need to develop automatic text summarization tools that allow people to get insights from them easily. Millions of web pages and websites exist on the Internet today. Going through a vast amount of content becomes very difficult to extract information on a certain topic. Currently, we enjoy quick access to enormous amounts of information. However, most of this information is redundant, insignificant, and may not convey the intendéd meaning. For example, if you are looking for specific information from an online news article, you may Nave to dig through its content and spend a lot of time weeding out the unnecessary stuff before getting the information you want. Therefore, using automatic text Summarizers capable of extracting useful information that leaves out inessential and insignificant data is becoming vital. The better way to deal with this problem is to summarize the text data which is available in large amounts to smaller sizes. Implementing summarization can enhance the readability of documents, reduce the time spent in researching for information, and allow for more {Information to be fitted in a particular area. Text summarization is the technique for generating a concise and precise summary of voluminous texts while focusing on the sections that convey useful information, and without losing the overall meaning. Automatic text summarization aims to transform lengthy documents into shortened versions, something which could be difficult and costly to undertake if done manually. Machine learning algorithms fan be trained to comprehend documents and identify the sections that convey important facts and information before producing the required sunmarized texts. For example, the image below is of this news article that has been fed into a machine learning algorithm to generate @ summary. Text Preprocessing In this step, we remove extra brackets, symbols etc. in the text, This is done using regular expression. In our example, since we have considered a simple paragraph, the preprocessing step need not be applied. However, for general text summarizations, the preprocessing is very important.he 4-38 /ision Data Analytics #Removing Square Brackets and Extra Spaces import re text = re.sub(r'((0-9]*]', ' ', text) text = re.sub(r's+', ' ', text) #Removing special characters and digits formatted_text = re.sub('{*a-zA-Z]', | ', text) formatted_text = re.sub(r's+', ' ', formatted_text) iii, Calculate the word frequencies and weighted word frequencies First, we calculate the frequency of occurrence of each word. Finally, to find the weighted frequency, we can simply divide the number of occurrences of all the words by the frequency of the most occurring word. from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize “stopWords = set (stopwords.words ("english") ) words = word_tokenize(formatted_text) #Creating a frequency table of words wordfreq = (} for word in words: if word in stopWords: continue if word’ in wordfre; ‘Sample output wordfreq[word] += 1 else: {Automatic’: 0.1, wordfreq[word] = 1 | ‘Currently’ #Compute the weighted frequencies ‘For: 0.2, maximum_frequency = max (wordfreq.values ()) oe for word in wordfreq.keys(): wordfreq{word] = (wordfreq[word) /maximam_frequency) iv. Calculate the sentence scores * In the code below, we first create an empty sentenceValue dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding Scores of the sentences. Next, we loop through each sentence in the sentence_list and tokenize the sentence into words. We then check if the word exists in the wordfreq dictionary. If it exists, the frequency is added to the sentence value. #Creating a dictionary to keep the score sentences = sent_tokenize(text) sentenceValue = {} for sentence in sentences: for word, freq in wordfreq. items( if word in sentence. lower( if sentence in sentenceValue: sentenceValue[sentence] += freq tof each sentencea Soclal Media and Text Anaiyics vision \ 4-39 else: senter ceValue(sentence) = freq y. _ Pick the top ‘n’ sentences to generate the summary The heapg library is a priority queue, We use this to pick the top n sentences for the summary. import heapq summary = '! summary_sentences = heapq.nlargest (4, sentenceValue, key=senten! alue.get) summary = ' '.join(summary_sentences) print (summary) After running the above code, the summary dbtained is: Text summarization is the technique for generating a concise and precise summary of voluminous texts while focusing on the sections that convey useful information, and without losing the overall meaning. Machine learning algorithms can be trained to comprehend documents and identify the sections that convey important facts and information before producing the required summarized texts. For example, if you are looking for specific information from an ‘online news article, you may have to dig through its content and spend a’ lot of time weeding out the unnecessary stuff before getting the information you want. Therefore, using automatic text summarizers capable of extracting useful information that leaves out inessential and insignificant data is becoming vital. 8.1 Trend Analytics In many cases, we are interested in predicting future events. For example, stock or housing prices in the near future, temperatures in the next week or month ete. Trend analytics is the widespread practice of collecting information and attempting to spot a pattern. Trend analytics mainly involves determining the possible drifts or trends over a period of time. Usually, historical trends are analyzed to determine future trends for a given phenomenon or feature. So, trend analytics is used to predict future events. The main task in trend analytics is to analyze data stored during a period of time and identify a trend in this data with time, The purpose of trend analysis is to spot a prevalent trend within a user group and/or to determine how a trend developed/would develop over time. There are several applications of trend analytics - for instance, predicting the stock market, predict sales of a particular item, etc. Various tools are used for trend analysis ranging from simple linear based tools such as linear regression to many complex non-linear toolsOn 4-40 /vibion Date Anaiytics There are three types of methods in trend analysis Temporal trend analysis: Temporal analysis deals with time, It allows one to examine and model the change in the value of a feature or va ble in a dataset over time. An example of temporal trend analysis is the time-series analysis. Time series analysis deals ‘with statistical data that are placed in chronological order, that is, according to time. It deals with two variables, one being time and the other being a particular phenomenon, Time series can be constituted by three components - Short-term movement (periodic changes), long-term movement (secular trend), and random movement (irregular trend) ‘One of the important characteristics of the time series is its stationarity. A time series is considered stationary if its behavior does not change over time. The statistical properties ‘of the stationary time series such as the mean and variance remain constant over time. However, most of the time-series problems that are encountered are non-stationary. Non- stationary time series do not have a Constant mean or variance and follows a trend by either drifting upward or backward. Advantages a. Helpful in figuring relationships between user groups from different generations. b. Helpful in predicting future events based on those of the past. Limitations a Historical data may not be an accurate representation of trends. b. The trend may not be replicable. Geographic trend analysis: Many applications tely on location information to provide Personalized and localized services to users, For example, delivery services, weather related services etc. Location based services and Geographic trend analysis is mainly involved in analyzing the trend of products, users ot other elements within or across geographic locations. This helps in finding the pattern of trends within a specific geographic location and uses various Seographic related factors such as culture, climate. food habits, etc. This kind of data is comparatively easy to analyze and interpret. Advantages a. ~ Easy and reliable. b. Helpful in figuring our commonalities . and differ user groups belonging to the same as well as different oe Bs beeen= o> \ Social Media and Text Analytics vision 4-41 Limitations & __ The analysis is limited to geography b. May be influenced by factors such as culture, etc specific to the user groups of the geography, Intuitive trend analysis: Intuitive approach to trend analysis is more often used when there is a lack of large statistical data Tequired to carry out trend analysis. The analyst analyzes the trend within or across user’ groups based on some logical explanation, behavioral patterns or other elements perceived by a futurist. It is helpful for prediction- making without the need for large amounts of statistical data. However, some issues with the methodology are the overreliance on knowledge and logic provided by futurists and researchers, which makes it prone to become biased to its researcher. Advantages a. Helpful to make predictions that are not backed by large amounts of statistical data, Limitations a. There is over reliance on knowledge and logic of researchers. b. _Ittis prone to researcher bias. c. Most difficult form of trend analysis. Challenges to Social Media Analytics Volume and Velocity as a Challenge: Huge volume of social media data is generated every second. Capturing, storing and analyzing millions of media items that appear every second is a real challenge. This requires proper storage, organization and sophisticated tools for analysis. Diversity as Challenge: Social media users and the content they generate are extremely diverse, multilingual, and vary across time and space. Not every social media item is important or worth analyzing. Due to the noisy and diverse nature of social media data, separating relevant content from noise is challenging and time-consuming.2 4-42 /vsion Data Anatyics Unstructured Data as a Challenge: Data in databases are highly structured and can be easily analyzed. Social media data is highly unstructured and consists of text, graphics, actions, and relations. Text data such as tweets, feedback, reviews and comments, may have wrong grammatical structure, and filled with symbols, abbreviations, acronyms, and emoticons, thus representing a significant challenge for extracting business intelligence, Social Media Analytics Accuracy: Owing to the challenges of volume, velocity, and diversity, the accuracy of social media analytics is questionable. As the huge unstructured data is generated over social media, the accuracy of social listening is decreasing. Privacy and Ethical Issues: Two important issues to bear in mind are privacy and ethical ‘issues related to mining data from social media platforms. The use of social media can reveal information that may lead to privacy violations if not properly managed by the user. Hence, users must be actively aware about the privacy and security policies of the social media networks.Social Media and Tot Anaincs isin 4-43 Exercises Multiple Choice Questions: A structure made up of a set of social actors (such as individuals or organizations), their inter-relations, and other social interactions between actors is called @ Social media b. Social graph e Social network d. — Social layer Which of the following is not a stage in the social media analytics process? a Data capturing b. Data preparation ¢. Data understanding d. Data presentation Which of the seven layers is used to measure the popularity and influence of a product, service, or idea over social media? a, Text b. Hyperlinks c. Actions d. Networks In which of the seven layers, nodes are considered as the users and the edges are considered as the links or connection among users? a Text b. Hyperlinks c Actions d. Networks A dataset is referred to as ___in nltk: a. Corpus b. Token c. Text d. Data Bag of words represents b. Frequency distribution d. Tokens ao c. Lexicon TF-IDF is a___—_ technique. a. Tokenizing c. Stemming b. Frequency distribution 4. Word weightingea * 2 ‘nic the res of acing. collet nd anyzing dt from svi reworks? Sei media aay Social network © Soci atu ants 4. Socialaigais Which of he flowing snot a mechan fo asceing i me | 2 ate B Web scraping © Propamming languages 4 Webmiing Involves wings computer program ofc ta fom a web des nd [Re icon io soared a 8 Wevcrmtng » © Wetpaning « hat predic the pomuly of» fre accion between 140 nods in 8 social seroc.nowing ahr isa pee no sscintion between thee Seder? Wed rng Lak prescooe Community dettce © Bayer fining 4 Trstpresictos Te. ‘meh castes india eto group by fading which individu ‘Celt mre wih ac tte wih her 2 Laken Communic detection Eager ng € Tastpredicion ‘Which abe mest eect ups with indie baving sna intrest and ites? Community miner Conti processing © Commniy eection 4 Community deiscaion Wich of te ftowing aprons for commuy Seton is el networks wes. etme” is A Teno cimering » nk sed hring © Topchelcateng Topica ed cusecag Which i te aumaie Saconey of new, previcly enka, it ‘sro tea ue fom eens Bit a, © dazanayics meres a 2. Som tad Tota we, ‘Unie I echt ek th lt mater in wy the he mechan cw derstand? Analysis %Tokeizton © Stemming 4 Replarexprenion Which of te ftowing considers the oder of wed nh ext? Bago wore & meme © TRIDF 4 step wonte ‘A semtimentis quantified with postive ot epatve va, call 8 Polany bem © Synonym Pans of spect NLP ands for Natural Lesing Process >. Naural Language Proceing Neural Leaming Programming Naa! Language Progaming Which of be flowing i not an plcton of NLP? Peso situs 1 Seoximeat analysis &_ Speech engin 4 Tanlaion ‘Wich pytionpacage spp a lngsgeproesing? sag pants {se 4 eam ‘oh Ss a Ud rs . ie ; due | Indexing provess in order to asess the value rte el i . atten te ment Words cme Bagot words 4 Lemme Bagot words 4 tem448 [ita cos ways 24, The echiqu sed o ext the base frm ofthe words by removing ax fom hem Iscatet 1 Lemmaizaion seins © Stemming a Inesing x. is word th isthe bse ooo word rom which ti deve. ten man © sem andes 26, Pansot pce a ai ald. Wont vate © Inston Tape 37, Whichot te ftowing ss eiment mass ype? 2 Doeunevies ass b.Wordeve anns © “Semence evel anys 4 Aspectieel alysis 23% Which ats said ou sin ini tom he veep content of soll media au? © Lecon bot © Commi Tad 29. rave summation: ‘Generts sew sentence fromthe xiginal text emis iow imporant parse or enenes rom the text Deletes random senteces rom te cial text nunc words ving highest eqns rom th txt 230. Aburctiesmarion: ‘eatifes most imporant pases or enteces fromthe text Deletes random seeacs fom he orignal ext arts words ving highest requeecy fom he exe Generates new sentences fr the eign ext Roe epee a1 2. 3. ea 2s. Which ofthe following i nota tanple of rend ayes? 2. Prediing sok pris ® Weate frsaing Sentiment nls 4 Tine series assis ‘Which metbod is sed for ienying ingen users fa soi network? Exper ding een siskinizaton Link resin ‘Which analysis mean o ety he view o emotion in tet? & Community detection e 8 Trend Loxton & Sentimest a tet ‘What els with he meron betwees human nd comput sing at language? Natural Language Pocesing big sun anyice Soci mesa aayie 4 Texantysce Which of he followings nota met in end ais? Geographic ee nays ‘Terps end apis —‘Inutve tend nays 4 Four aon nays Which isthe proces of compressing a document it tore venon by preserving meanings? a Textreponing 1b Textsmmszaion © Textobservatons 4 Textcompression Which nays mail volves eteminng the pose pms over pesid of ie? a Trendanlses be Soci ia malycs Semmes analyst 4 Panem analysis ina ume saith fects ow importa words a ocunent Ina caletion comes Frege & Termucighting 8 ir 4 Vese648 [ete one As B._Sute Troe or Fate 1. Soil media analytics be proses of ating ae analyzing daa fom soil networks sch as Facebok Inara. inka and Tite 2. InPOS. we ety weiter wont in enence are oun, ers atv ee 3. Social media antics mainly desl with determining possible pers over a peri of “4 Teat analy the ces of enn ih qty information from ex S$. A-sracrre mde wp of st f sci crs ch nial organizations, hie mer elaons and desi interatos hereon actor clled social network. (6 The sep in the scl men anti proces ae its patherng data preparation and (saa prema 7. THIF iss wkeninng mtg, §Exmctve summation peers new sentences fom the original ex 9. Tent summarization i the roses of compresng& document into a shcer version by esenmag mesnne 10. The view or emmion teri Meified ang ued ays 11. Sennen nas caleines an meson pelt rm the tx 12 Stemming exc the efor af he won by removing ais om them. 13. Stemming gives wor tht may ste valid word. 14 Thelen isthe ot wort 1S. Bag of wort isa merc reeactiton of ext ut dcxibes the occurence of woeds os 16 Repu exression eas the tex it sme pa in a ae Pats in 8 vay thatthe mach 17. Tess anise he anomie discovery of ey, aan Peiusly unknown infomation from 18 TRADE sign mor imporance feu. xcurig wnt, 19. Sennen sais ideas he ein bind wae 2. 2. 29. 20. a1 2 3 34 2s, 6. 2. Sect as Tn Anace wb A Link based lusting metods sy thee cee and noes of the uci netwak in ode ‘Topi bated hsering methods comies both the sagt of conection wee mes ‘5 well as finding communities tht ae tpcally sae, Link pectin i sed fo find ate wes foil atmo NLP eas wih he interaction betwen humans and computes sing ntl aguge. ‘Text analytics ithe atomated proces of analysing sacred text uncover nigh, trends, nd pats, ‘TRADF is a numeri satis iat reflects Row import a words ocemen in a calleston or corps Negrams gives tem most ingoran words inthe tex. Tokeniztion mean rein the inlecsoal forms of each word ino & common bao ‘Temporal tend analyts analyzes the wend ove peopl locton Tnitive tend anlyes the ead wikia o across usr groupe teed om some lope explanation, behavioral paterns robe elemens. ‘The main challenge fo soil medi analysis the volume of suctue data avaible, ‘Geographic wend analysis i msn invaved in analysing he wed of pros. Social medi ata can be accessed wing APTS Search engine analytes pays tenon wo analyzing Nstoncal seach daa gene informative search engi stasis. “The volume refer tthe sped at whic ibe data geting accumulated [NLP is used fr spesch recognition and epic character recognitions. Nosractive ummasizaton ie fase thn exeaive summation, ‘S__ Short Anew Queens 1 2 3 “ a Wha soca err What social mei anys? ‘What ae te sages inthe soil medi analytics process? Lit the ups performed ie da capring phase of soca mada nays proces ‘Which social media nls process step eas with data visual ols play th ‘utp in anes interpretable form?50/itn oon rye Liste seven ayers of Sail Media Amit Wat tee anaes? Defi NLP Se plicaons of NLP ae pee ac erasers So fe i ac eae son Seas Saas pee aes ee 5 aes — econ. = ee ae css 2 Se eaien seen a es ee ee Le he SURNRRERE Soc Mae at Arace we 81 ‘Whats Lik prediction” Expan he pnt prediction framework ‘Whats Commonty detection” Explain the vious proaches for community det. ‘Whats inlence maximization and exper Finding? ‘Whats inlunce maximization? Explain is tramework grammatically, ‘What sexpert iting? How ian expe found” Exphin wih an sample Explain he concep of tat How ihe pedcton of tr and rst smong inv a ‘Whats marl language processing” Gives pliatons. ‘Whats et anaes? Sti abamages. Esplin how entation cae xt in texans. Explain the concep of bog of word with an example ‘Whats word weigtng? How iit done inex antes? ‘What is TRIDF? How iit alelto” Expaa with an example, Esplin he tems» Grams stop wor Explain sterining and lermatznton. “Whats the ference between tering and lemmatizaton? ‘Whats the purpose of arts of specch ning? ‘What is Seniment Analysis? How is ica ou? ‘What are te yes of Sesient analysis? {Whats the ced for documento tex summarization? What oes avanages? pli he types fet summarization. ‘Whats rend ants? List the peso tend aay. ‘plain empor rend analytes with advange and imitations Explain geographical trend analy with advantages nd imitans ‘What are the hllengs oscil media aayics? or tbe following tex cae the bg of words: 1. Sentence I: He ikes to pla fotball Sentence 2 She kes to pun and pla tena ik Seence 3 The fil Hes play eicket and tenisIe 4-52 /wiion Data Anabrtics 31. Consider the following sentences: Sentence 1: We will play outside today if it does not rain . Sentence 2: [like to play in the rain iii, Sentence 3: The weather is sunny today iv. Create a bag of words for the above. 32. Consider the following sentences: ic Sentence 1: Online mode is better for flexibility ii, Sentence 2: Offline mode is better for communication iii, Sentence 3: Hybrid mode is good combination iv. Calculate the TF-IDF and identify the most important words. Answers A. es 2° b Bec A) oe Sa 6 »b 7-4 8. c a 10. b loa 12. b 13. ¢ 14. bb 15. a 16. b 17. b 18. a oe 20. ¢ 2a 22. b 23. a 24. ¢ Ba 26. a 7. b 28. a 29. »b 30. d 31. ce 32. +b 33. 34, a 35. d 36. b 37: 8 38. b B. eam 2... Truc 3. False 4. False 5. True 6. False 7, False 8 False 9. Tre 10, False 11. True 12, True 13. True 14. True 15. True 16. False 17. True 18. False 19. Tre 20. False 21. True 22. False 23. False 24, True 25. False 26. True 27. False 28. False 29. False 30. True 31. False 32, False 33. Tre 34. True 35. False 36, False 37. False Ode