Final Project File Chat Insight - Aranya
Final Project File Chat Insight - Aranya
On
“CHAT INSIGHT”
Submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology
In
Computer Science & Engineering
May 2024
INDEX
Page 1 of 98
CHAPTER 1: INTRODUCTION
CHAPTER 2: OBJECTIVES
CHAPTER 3: SYSTEM REQUIREMENTS
3.1 H/W Requirements
3.2 S/W Requirements
3.3 Introduction to Tools/Technologies/ S/W used in project
CHAPTER 4: S/W REQUIREMENTS ANALYSIS
4.1 Problem Definition
4.2 Modules and their Functionalities
CHAPTER 5: S/W DESIGN
5.1 S/W Development Lifecycle Model
5.2 Progress flow chart of Project
CHAPTER 6: SOURCE CODE
6.1 Frontend code
6.2 Backend code
CHAPTER 7: RESULTS
CHAPTER 8: FUTURE SCOPE
CHAPTER 9: CONCLUSION
CHAPTER 10: REFRENCES
CERTIFICATE
Page 2 of 98
This is to certify that project report entitle “CHAT INSIGHT” done by Ms. ARANYA
DHULL Roll No- 20/CSE/109, University Roll no: 1066691 of Vaish College Of
Engineering, Rohtak towards partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology in Computer Science & Engineering is a bonafide record
of the work carried out by Her under My Supervision and Guidance.
ACKNOWLEDGEMENT
Page 3 of 98
I take this opportunity to express my profound gratitude and deep regards to my guide “Ms.
Urvashi (Asst. Prof.)” for her exemplary guidance, monitoring and constant encouragement
throughout the course of this thesis. The blessing, help and guidance given by her time to
time shall carry me a long way in the journey of life on which I am about to embark.
I also take this opportunity to express a deep sense of gratitude to Dr. Bijender Bansal ,
Head Department of Computer Science and Engineering, Rohtak for his cordial support,
valuable information and guidance, which helped me in completing this task through various
stages. I am greatful to my teammates of my college helped me as a part of their team.
I am obliged to staff members of Computer Department, for the valuable information
provided by them in their respective fields. I am grateful for their cooperation during the
period of my Project.
Lastly, I thank almighty, my parents, brother, sisters and friends for their constant
encouragement without which this assignment would not be possible.
ARANYA
1066691
Department of Computer Science
Chapter 1: Introduction
Page 4 of 98
1.1 In today's digital age, communication primarily occurs through various digital platforms
such as messaging applications, social media, and emails. The vast amount of textual data
generated through these channels presents a unique opportunity to extract valuable insights,
sentiments, and trends from conversations. However, manually analyzing such data can be
time-consuming and challenging. To address this issue, we introduce the Comprehensive
Chat Analyzer Tool, a project aimed at automating the analysis of textual conversations.
The scope of the project encompasses the development of a robust software application
capable of processing large volumes of text data, performing various NLP tasks such as
sentiment analysis, topic modeling, and named entity recognition, and presenting the results
in a clear and intuitive manner.
Text Data Import: Users can import text data from various sources, including text files, CSV
files, and APIs.
Preprocessing: The tool preprocesses the text data by removing noise, such as special
characters and stopwords, and tokenizing the text into meaningful units.
Sentiment Analysis: Analyzes the sentiment of the conversations, categorizing them as
positive, negative, or neutral.
Topic Modeling: Identifies the main topics discussed within the conversations using
techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization
(NMF).
Page 5 of 98
Named Entity Recognition: Extracts named entities such as persons, organizations, and
locations mentioned in the conversations.
Visualization and Reporting: Presents the analysis results through interactive visualizations
such as word clouds, sentiment distribution charts, and topic clusters. Users can also generate
comprehensive reports summarizing the insights obtained.
1.1.4. System Architecture
The Chat Analyzer Tool follows a modular architecture comprising the following
components:
Data Input Module: Responsible for importing text data from various sources.
Preprocessing Module: Handles text preprocessing tasks such as noise removal and
tokenization.
Analysis Modules: Includes modules for sentiment analysis, topic modeling, and named
entity recognition.
Visualization Module: Generates interactive visualizations based on the analysis results.
Reporting Module: Generates detailed reports summarizing the insights obtained from the
conversations.
The architecture ensures scalability, flexibility, and ease of maintenance, allowing for future
enhancements and additions of new analysis modules.
Data Import: Text data is imported from the specified sources, such as chat logs or social
media feeds.
Text Cleaning: Noise such as special characters, URLs, and punctuation marks is removed
from the text.
Tokenization: The text is tokenized into individual words or phrases for further analysis.
Stopword Removal: Common stopwords such as 'and', 'the', and 'is' are removed from the
text to focus on meaningful content.
The preprocessing stage lays the foundation for accurate analysis and insights generation.
Page 6 of 98
Sentiment analysis is a crucial aspect of the Chat Analyzer Tool, enabling users to understand
the overall sentiment of the conversations. The tool employs machine learning algorithms and
lexicon-based approaches to classify the sentiment of each message or conversation as
positive, negative, or neutral.
1.2 The Chat Analyzer project is designed to analyze text-based conversations, extracting
valuable insights such as sentiment analysis, topic modeling, and user engagement metrics. It
involves processing large volumes of chat data to understand patterns, sentiments, and user
behavior for various purposes like improving customer service, understanding social
interactions, or monitoring online communities.
The Chat Analyzer project is a comprehensive system that employs natural language
processing (NLP) techniques to analyze text-based conversations. With a focus on processing
large volumes of chat data, the project aims to extract valuable insights, including sentiment
analysis, topic modeling, and user engagement metrics.
1.3 "Chat Analyzer" can refer to various tools or software designed to analyze conversations
that occur in chat platforms. These platforms could include messaging apps, social media
platforms, customer service chats, and more. Here are some common features and
functionalities of chat analyzers:
Page 7 of 98
1.3.1. Sentiment Analysis:
Chat analyzers can determine the overall sentiment of a conversation by analyzing the
language used. They identify whether the conversation is positive, negative, or neutral. This
can be useful for businesses to gauge customer satisfaction or for researchers studying online
interactions.
For businesses with customer support chat systems, chat analyzers can help optimize
response times, identify common issues, and route inquiries to the appropriate departments or
agents.
Page 8 of 98
Given the sensitive nature of chat data, chat analyzers often include features to ensure privacy
and security compliance, such as data encryption and anonymization.
Overall, chat analyzers play a valuable role in understanding and extracting insights from
textual conversations, whether for customer service improvement, market research, or social
media monitoring.
Chapter 2: Objectives
Page 9 of 98
2.1 A "Chat Insight" typically refers to a software tool or system designed to analyze
conversations, often in text form, between individuals or groups. The objectives of such a
tool can vary depending on its intended use and the specific features it offers. Here are some
common objectives of a chat insight:
2.1.1 Sentiment Analysis: One of the primary objectives is to gauge the sentiment of the
conversation. This involves determining whether the overall tone of the conversation is
positive, negative, or neutral. Understanding sentiment can be useful in various contexts, such
as customer service interactions, social media monitoring, or market research.
2.1.2 Topic Detection: Another key objective is to identify the main topics being discussed
in the conversation. By analyzing the language used and identifying keywords or themes, the
chat analyzer can categorize conversations into different topics or subjects. This can help in
organizing and summarizing large volumes of text data.
2.1.3 User Profiling: Chat insight often aim to create profiles of the individuals
participating in the conversation. This involves analyzing linguistic patterns, vocabulary, and
other factors to infer demographic information, interests, or behavioral traits of the users.
User profiling can be valuable for targeted advertising, personalized recommendations, or
social network analysis.
2.1.5 Language Understanding: Chat Insight may also focus on understanding the
semantics and context of the conversation. This involves tasks such as named entity
recognition (identifying names, organizations, locations, etc.), parsing sentence structure, and
resolving ambiguous references. Language understanding capabilities enable more
sophisticated analysis and interpretation of chat data.
2.1.6 Insights and Reporting: Ultimately, the goal of a chat insight is often to provide
actionable insights and generate reports based on the analysis. This could include
visualizations of sentiment trends over time, summaries of key topics discussed,
identification of influential users, or recommendations for improving communication
strategies.
Page 10 of 98
2.1.7 Integration and Automation: Many chat analyzers are designed to integrate with
other systems or workflows and to automate certain tasks. For example, they might integrate
with customer relationship management (CRM) software to provide real-time analysis of
customer interactions, or with social media management tools to monitor brand mentions and
engagement.
2.1.8 Customization and Scalability: Depending on the specific use case and requirements,
chat analyzers may need to be highly customizable and scalable. Users may want to fine-tune
the analysis algorithms, define custom metrics, or process large volumes of data efficiently.
Providing flexibility and scalability ensures that the tool can adapt to diverse applications and
growing datasets.
By fulfilling these objectives, a chat analyzer can help organizations gain valuable insights
from their conversational data, improve decision-making processes, and enhance
communication strategies.
Page 11 of 98
System requirements refer to the specifications and capabilities necessary for a computer
system, software application, or project to operate effectively and efficiently. These
requirements encompass both hardware and software aspects and are essential for ensuring
that the system functions as intended and meets the needs of its users.
3.1.1 Hardware requirements of the device on which the project was built:
3.1.2 The display on the 10th-gen model expands to 10.9-inches. There's a 2360 x 1640
pixel resolution on board, which results in the same 264ppi pixel density as the 9th gen
model, though you of course get more screen within a very similar footprint.
3.1.3 The 10th generation iPad supports True Tone, as well as the Apple Pencil - still only 1st
Generation - and it comes with a 500nits brightness and fingerprint-resistant oleophobic
coating.
Page 12 of 98
3.1.5 Storage options for the 10th generation iPad are the same as the 9th generation model -
meaning it starts at 64GB, with a 256GB option too. As mentioned though, the iPad (10th
generation) switches to USB Type-C for charging over Lightning.
3.1.6 In terms of other hardware, there's a Smart Connector on the 10th generation model
again, as there is on the 9th generation model. There's also a 12-megapixel FaceTime HD
camera again, allowing for features like Centre Stage. On the rear, you'll also find a 12-
megapixel sensor.
3.1.7 The 3.5mm headphone jack has been removed for the 10th generation iPad.
- Processor: Modern multi-core processor (e.g., Intel Core i5 or AMD Ryzen) for optimal
performance.
- Memory (RAM): Minimum 4GB RAM, recommended 8GB or more for smoother
operation, especially when handling large datasets.
- Storage: At least 100MB of free disk space for installation, additional space for caching
and storing user data.
- Graphics: Standard integrated or dedicated graphics card for rendering user interface
elements.
- Display: Monitor with a resolution of at least 1024x768 pixels for optimal viewing of
charts and data.
Page 13 of 98
3.2.1 Minimum software requirements are:
- Operating System: Compatibility with major desktop operating systems such as Windows
10, macOS, and Linux distributions (e.g., Ubuntu).
- Web Browser: Support for modern web browsers such as Google Chrome, Mozilla
Firefox, Safari, and Microsoft Edge for accessing web-based versions of the application.
- Runtime Environment: Installation of runtime environments like Node.js or Python (if
applicable) for server-side or scripting functionalities.
- Database: Compatibility with popular database systems such as MySQL, PostgreSQL, or
MongoDB for storing user preferences, watchlists, and historical data.
The tools and technologies used in a chat analyzer can vary depending on the specific
features and objectives of the analyzer. Here are some common tools and technologies used
in building chat analyzers:
Altair is designed to work seamlessly with Pandas DataFrames and supports a wide range of
chart types, including scatter plots, line charts, bar charts, histograms, heatmaps, and more.
1. Declarative Syntax: Altair uses a declarative grammar of graphics approach, where you
specify the visualization's appearance and properties using a high-level description. This
makes it easy to create complex visualizations with minimal code.
Page 14 of 98
2. Integration with Pandas: Altair seamlessly integrates with Pandas DataFrames, allowing
you to visualize data directly from DataFrame objects. This makes it convenient for data
analysis workflows, as you can easily explore and visualize datasets without needing to
preprocess them extensively.
4. Rich Set of Chart Types: Altair supports a wide variety of chart types, including basic
charts like scatter plots and bar charts, as well as more advanced visualizations like trellis
plots, facet grids, and layered charts. This allows you to choose the most appropriate
visualization type for your data and analysis goals.
5. Customization and Theming: Altair provides options for customizing the appearance and
style of visualizations, including control over colors, markers, axes, labels, and legends
6. Export and Sharing: Altair allows you to export visualizations to various formats, including
PNG, SVG, and HTML. This makes it easy to share visualizations with others or embed them
in reports, presentations, or web applications.
Overall, Altair is a powerful and user-friendly library for creating interactive and expressive
visualizations in Python, particularly suited for data exploration, analysis, and
communication tasks.
1. Calendar Heatmaps: calmap specializes in creating calendar heatmaps, where each cell
represents a day on the calendar, and the color intensity represents the value of the data for
that day. This allows for easy visualization of patterns and trends over time.
Page 15 of 98
2. Customization: calmap provides options for customizing the appearance of the calendar
heatmap, including the color map, color range, cell size, and cell shape. This allows you to
tailor the visualization to your specific needs and preferences.
3. Support for Time Series Data: calmap is designed to work with time-series data, making it
easy to visualize temporal patterns and trends. It supports various time resolutions, including
daily, monthly, and yearly aggregations.
4. Integration with Matplotlib: calmap is built on top of Matplotlib, a popular plotting library
in Python. This means that calmap can be seamlessly integrated into Matplotlib figures and
plots, allowing for easy customization and combination with other types of visualizations.
5. Ease of Use: calmap provides a simple and intuitive interface for creating calendar
heatmaps, with straightforward functions for plotting time-series data. This makes it easy for
users to generate high-quality visualizations with minimal effort.
Overall, calmap is a useful tool for visualizing time-series data in a calendar heatmap format,
allowing for easy interpretation and analysis of temporal patterns and trends. It's particularly
well-suited for tasks such as visualizing daily or monthly fluctuations in data, identifying
seasonal patterns, and detecting anomalies or outliers in time-series datasets.
When you use `%matplotlib inline` in a Jupyter Notebook cell, it tells Jupyter to display
Matplotlib plots directly within the notebook, rather than opening a separate window or
generating a file. This is often used for convenience and to keep all visualizations within the
notebook itself.
Page 16 of 98
3.3.5 NumPy (1.24.2)
NumPy is a fundamental package for scientific computing with Python. It provides support
for large, multi-dimensional arrays and matrices, along with a collection of mathematical
functions to operate on these arrays efficiently. NumPy is widely used in various fields such
as data science, machine learning, engineering, and scientific research.
The version number "1.24.2" refers to a specific release version of the NumPy library. In
software development, version numbers typically follow a format like "major.minor.patch",
where:
- "Major" version changes indicate significant updates, often with breaking changes or major
new features.
- "Minor" version changes usually add new features or enhancements without breaking
existing functionality.
- "Patch" version changes typically address bugs or issues found in the software.
The version number "1.5.3" would follow the typical format for software versioning, where:
Page 17 of 98
Scikit-learn, often abbreviated as sklearn, is an open-source machine learning library for
Python. It is built on top of other popular Python libraries such as NumPy, SciPy, and
matplotlib. Scikit-learn provides a simple and efficient toolset for data mining and data
analysis tasks, including classification, regression, clustering, dimensionality reduction, and
model selection.
1. Consistent API: Scikit-learn provides a consistent interface for various machine learning
algorithms, making it easy to experiment with different models without needing to learn new
syntax for each one.
3. Model Evaluation and Selection: Scikit-learn offers tools for model evaluation, including
various metrics for assessing model performance such as accuracy, precision, recall, F1-
score, ROC curves, and AUC-ROC. It also provides utilities for hyperparameter tuning and
cross-validation to select the best-performing model.
4. Preprocessing and Feature Engineering: The library includes utilities for preprocessing
data, such as scaling, normalization, encoding categorical variables, and handling missing
values. It also provides feature extraction and transformation methods for creating new
features or transforming existing ones.
5. Pipeline: Scikit-learn allows you to construct machine learning pipelines that chain
together multiple processing steps, such as data preprocessing, feature selection, and model
training, into a single object. This makes it easy to apply the same preprocessing steps
consistently to training and test data.
6. Integration with other Libraries: Scikit-learn integrates well with other Python libraries
such as pandas for data manipulation and matplotlib for data visualization.
Page 18 of 98
SciPy is an open-source Python library that is used for scientific and technical computing. It
builds on NumPy, another Python library, and provides additional functionality for
optimization, integration, interpolation, linear algebra, statistics, signal processing, and more.
SciPy is a fundamental library for many scientific computing tasks in Python, providing
efficient and optimized implementations of numerous mathematical algorithms and functions.
It is often used alongside other libraries like NumPy, matplotlib, and scikit-learn to perform
various tasks ranging from data analysis to simulation to optimization.
SciPy is organized into several submodules, each focusing on different aspects of scientific
computing:
4. SciPy.linalg: Provides linear algebra routines for solving linear systems, computing
eigenvalues and eigenvectors, singular value decomposition (SVD), and other matrix
operations.
5. SciPy.sparse: Deals with sparse matrices and related operations, which are often
encountered in large-scale scientific computing and numerical simulations.
6. SciPy.stats: Contains a wide range of statistical functions and probability distributions for
statistical analysis and hypothesis testing.
7. SciPy.signal: Offers signal processing tools, including filtering, spectral analysis, and
waveform generation.
Page 19 of 98
8. SciPy.special: Provides special functions such as Bessel functions, gamma functions, and
elliptic functions, which are commonly used in scientific computing and mathematical
physics.
2. Built-in Themes and Color Palettes: Seaborn includes several built-in themes and color
palettes that can be easily applied to plots to improve their aesthetics and readability. These
themes and palettes are designed to work well together and provide consistent styling across
different plot types.
3. Statistical Estimation: Seaborn provides functions for estimating and visualizing statistical
relationships in data, including linear regression models, correlation matrices, and kernel
density estimation plots.
4. Categorical Data Plotting: Seaborn offers specialized support for visualizing categorical
data, including functions for creating grouped bar plots, point plots, count plots, and
categorical scatter plots.
5. Matrix Plotting: Seaborn includes functions for creating matrix plots, such as cluster maps
and heatmap plots, which are useful for visualizing hierarchical clustering and pairwise
relationships in data.
Page 20 of 98
6. Flexible Configuration Options: Seaborn provides a range of configuration options for
customizing plot appearance and behavior, including control over axis labels, titles, legends,
tick marks, and grid lines.
1. Memory Mapping: SMmap allows you to memory-map large files so that you can access
their contents as if they were stored in memory. This enables you to work with files that are
larger than the available physical memory.
2. Sliding Window Technique: SMmap uses a sliding window technique to efficiently manage
memory maps. Instead of mapping the entire file into memory at once, it maps only a portion
of the file (window) into memory at a time. As you access different parts of the file, SMmap
automatically adjusts the window to ensure that the necessary data is available in memory.
3. Efficient Access: By using memory mapping, SMmap provides efficient random access to
large files. It minimizes the overhead associated with reading and writing large amounts of
data by leveraging the operating system's virtual memory system.
SMmap is particularly useful in scenarios where you need to process or analyze large datasets
that are stored in files, such as log files, databases, or scientific data files.
Page 21 of 98
3.3.11 Streamlit (1.28.1)
Streamlit is a Python library that enables you to create interactive web applications for
machine learning, data science, and other analytical tasks with minimal effort
1. Simple and Intuitive: Streamlit allows you to build web applications using only Python
scripting. You can create interactive apps with just a few lines of code, without needing to
write HTML, CSS, or JavaScript.
2. Fast Iteration: With Streamlit, you can see the results of your code changes instantly in the
web app, thanks to its automatic rerunning and updating mechanism. This enables fast
iteration and experimentation during development.
3. Wide Range of Widgets: Streamlit provides a variety of widgets for building interactive
elements in your apps, including sliders, dropdowns, text inputs, buttons, and more. These
widgets make it easy to create user interfaces for controlling and visualizing data.
4. Integration with Python Libraries: Streamlit seamlessly integrates with popular Python
libraries such as Pandas, Matplotlib, Plotly, and scikit-learn, allowing you to leverage their
capabilities for data manipulation, visualization, and machine learning within your web apps.
Streamlit has gained popularity in the data science and machine learning community for its
simplicity and ease of use in building interactive applications.
1. Progress Bars: TQDM makes it easy to add progress bars to your Python loops and
iterables. This provides users with visual feedback on the progress of tasks, particularly
useful for lengthy computations or when dealing with large datasets.
Page 22 of 98
2. Customization: TQDM offers various customization options for progress bars. You can
adjust the appearance, including the bar's style, color, width, and position. Additionally, you
can configure features like dynamic updates, nested progress bars, and estimated time
remaining.
3. Integration: TQDM integrates seamlessly with a wide range of Python data structures and
libraries, including lists, dictionaries, NumPy arrays, Pandas dataframes, and file objects. It
can be used with loops, list comprehensions, map functions, and other iterable constructs.
5. Extensibility: TQDM is highly extensible, allowing you to create custom progress bars and
integrate them into your projects. You can subclass the TQDM class to implement custom
behavior or develop plugins to extend its functionality.
TQDM is widely used in various Python projects, particularly in data science, machine
learning, and computational tasks, to monitor progress and enhance the user experience.
1. Geocoding: Geopy enables you to geocode addresses, meaning it can convert human-
readable addresses into geographic coordinates (latitude and longitude). This is useful for
tasks such as mapping, location-based services, and data analysis.
Page 23 of 98
2. Reverse Geocoding: Conversely, Geopy supports reverse geocoding, which involves
converting geographic coordinates into human-readable addresses. This can be helpful for
identifying the location of a point on a map or finding nearby places based on coordinates.
3. Support for Multiple Providers: Geopy supports multiple geocoding providers, allowing
you to choose the service that best fits your needs. Some of the supported providers include
Google Maps API, Bing Maps API, OpenStreetMap Nominatim, ArcGIS, and more.
4. Pluggable Architecture: Geopy has a pluggable architecture, making it easy to add support
for additional geocoding providers or customize existing ones. This flexibility allows
developers to adapt Geopy to their specific requirements or preferences.
5. Simple API: Geopy provides a simple and consistent API for geocoding and reverse
geocoding operations, making it easy to integrate into your Python projects. It abstracts away
the complexities of interacting with various geocoding services, allowing you to focus on
your application logic.
Page 24 of 98
Chapter 4: Software Requirements Analysis
Software requirements analysis is the process of identifying, eliciting, documenting, and
validating the needs and expectations of stakeholders for a software system. It's a crucial
phase in the software development lifecycle as it lays the foundation for the entire project.
4.1.1 Chat Insight is a statistical analysis tool for WhatsApp chats. Working on the chat files
that can be exported from WhatsApp it generates various plots showing, for example, which
another participant a user responds to the most. We propose to employ dataset manipulation
techniques to have a better understanding of WhatsApp chat present in our phones.
4.1.2 "In today's digital age, communication via chat platforms has become ubiquitous,
spanning various domains such as customer service, social networking, and online
collaboration. However, the sheer volume and complexity of textual conversations pose
significant challenges in extracting meaningful insights and optimizing communication
processes. Businesses and organizations need a robust solution to analyze chat data
effectively, identify patterns, sentiments, and key information, and derive actionable insights
to enhance customer satisfaction, streamline operations, and make data-driven decisions. The
problem statement for a chat analyzer is to develop a comprehensive tool or software solution
that can automatically process, analyze, and visualize chat data from diverse sources,
providing users with valuable insights into conversation dynamics, sentiment trends, topic
clusters, and other relevant metrics. This solution should prioritize accuracy, scalability,
privacy, and ease of use, catering to the needs of businesses, researchers, and other
stakeholders seeking to leverage chat data for various purposes."
Page 25 of 98
need to extract meaningful insights such as positive or negative sentiment, prevalent topics of
discussion, and key influencers driving conversations.
5. Streamlining Operations:
- Example: A financial institution uses chat-based communication for internal discussions
among employees. They need a solution to analyze these conversations to identify areas for
process improvement, compliance risks, or opportunities for cost reduction.
In summary, the problem statement for a chat insight revolves around addressing the
challenges associated with analyzing large volumes of chat data, extracting meaningful
insights, optimizing communication processes, and ultimately leveraging this information to
improve customer satisfaction, streamline operations, and make informed decisions across
various domains and industries.
Page 26 of 98
4.2 Modules and their functionalities
Here are three to five modules commonly used in a Chat Insight project:
1. Text Cleaning:
- Text data often contains various artifacts, such as HTML tags, special characters, and
punctuation, which need to be removed to ensure uniformity and consistency. Text cleaning
involves tasks like stripping HTML tags, removing non-alphanumeric characters, and
handling special cases like emojis or emoticons.
2. Tokenization:
- Tokenization is the process of splitting text into smaller units, typically words or phrases,
known as tokens. This step breaks down the text into its fundamental components, making it
easier to analyze and process. Tokenization can be performed using simple whitespace
splitting or more advanced techniques that consider punctuation and context.
3. Normalization:
- Normalization aims to standardize the text by converting it to a common format. This may
involve converting text to lowercase to ensure case-insensitive matching, expanding
contractions (e.g., "don't" to "do not"), and replacing abbreviations or acronyms with their
full forms.
4. Stemming or Lemmatization:
- Stemming and lemmatization are techniques used to reduce words to their base or root
forms, allowing for more effective analysis. Stemming involves stripping suffixes from words
to obtain their stems (e.g., "running" becomes "run"), while lemmatization uses linguistic
rules to return the canonical form of a word (e.g., "ran" becomes "run"). Both techniques help
consolidate variations of words and improve text normalization.
Page 27 of 98
5. Spell Checking:
- Spell checking is optional but can be beneficial for improving the quality of text data. It
involves identifying and correcting spelling errors, ensuring that the text is accurate and
interpretable. Spell checking algorithms can automatically suggest corrections for misspelled
words based on dictionary lookup or statistical models.
By performing these preprocessing tasks, the data preprocessing module ensures that the chat
data is clean, standardized, and ready for subsequent analysis tasks such as sentiment
analysis, named entity recognition, and topic modeling. This module plays a critical role in
ensuring the accuracy and reliability of the insights derived from the chat data.
1. Classification Approach:
- Sentiment analysis typically employs a classification approach, where each chat message
is classified into one or more sentiment categories, such as positive, negative, or neutral. This
classification can be binary (positive/negative) or multiclass (positive/negative/neutral)
depending on the granularity of sentiment analysis required.
3. Lexicon-Based Approaches:
- Lexicon-based approaches rely on predefined sentiment lexicons or dictionaries that map
words to sentiment scores. Each word in a chat message is assigned a sentiment score, and
the overall sentiment of the message is determined based on the aggregation of these scores.
Page 28 of 98
Lexicon-based methods are computationally efficient but may struggle with context-
dependent sentiment and nuances in language.
4. Rule-Based Systems:
- Rule-based systems use a set of predefined rules or patterns to identify sentiment in text.
These rules may be based on linguistic patterns, syntactic structures, or domain-specific
knowledge. Rule-based systems offer transparency and interpretability but may require
manual crafting of rules and lack the flexibility of machine learning approaches.
7. Evaluation Metrics:
- The performance of sentiment analysis models is evaluated using metrics such as
accuracy, precision, recall, and F1-score. These metrics quantify the model's ability to
correctly classify chat messages into their respective sentiment categories. Additionally,
domain-specific evaluation may be necessary to ensure the model's effectiveness in real-
world applications.
The sentiment analysis module is integral to understanding the emotional tone and attitude
conveyed in chat conversations. By accurately identifying sentiment, businesses can gauge
customer satisfaction, detect emerging trends, and make informed decisions to enhance user
experience and brand perception.
Page 29 of 98
4.2.3. Named Entity Recognition (NER) Module:
- NER is a natural language processing task that identifies and categorizes named entities
mentioned in text, such as people, organizations, locations, dates, and other entities of
interest. This module extracts relevant entities from chat conversations, enabling users to
identify key entities and topics discussed. NER can be useful for various applications,
including trend analysis, topic modeling, and entity-based sentiment analysis.
The Named Entity Recognition (NER) module in a chat insight project is designed to identify
and classify named entities mentioned in text data. Here are some key functionalities of the
NER module:
1. Entity Identification:
- The primary functionality of the NER module is to identify named entities within chat
messages. Named entities can include various types of entities such as persons, organizations,
locations, dates, numerical values, and more. The module scans the text data to identify spans
of text that represent named entities.
2. Entity Classification:
- Once named entities are identified, the NER module classifies them into predefined
categories. Common categories include:
- Person: Individual names of people.
- Organization: Names of companies, institutions, or groups.
- Location: Names of places, including cities, countries, and landmarks.
- Date: Specific dates or time expressions.
- Numeric: Numerical values, such as quantities or measurements.
- Others: Additional categories specific to the domain, such as product names or event
names.
3. Contextual Understanding:
- The NER module takes into account the context surrounding named entities to improve
accuracy and relevance. Contextual understanding helps distinguish between entities that may
have multiple meanings or interpretations based on the surrounding text. For example, "Paris"
could refer to both the city in France and a person's name.
Page 30 of 98
4. Multi-lingual Support:
- Depending on the application requirements, the NER module may support multiple
languages. It should be capable of recognizing named entities in different languages and
adapting its classification approach accordingly. Multi-lingual support enhances the
versatility and usability of the chat analyzer across diverse linguistic contexts.
Overall, the NER module enhances the capability of the chat insight to extract structured
information from unstructured text data, enabling users to identify and analyze named entities
of interest within chat conversations.
1. Topic Discovery:
- The primary functionality of the Topic Modeling module is to discover latent topics within
the chat data. It identifies groups of words that frequently co-occur across multiple chat
messages, suggesting the presence of underlying themes or subjects of discussion.
2. Unsupervised Learning:
- Topic modeling is typically an unsupervised learning task, meaning it does not require
labeled data for training. Instead, it automatically learns the latent topics present in the chat
data based on the distribution of words across documents.
Page 31 of 98
3. Algorithm Selection:
- The Topic Modeling module may employ various algorithms to perform topic discovery.
Common algorithms include Latent Dirichlet Allocation (LDA), Non-Negative Matrix
Factorization (NMF), Latent Semantic Analysis (LSA), and Probabilistic Latent Semantic
Analysis (PLSA). Each algorithm has its own strengths and weaknesses, and the choice
depends on factors such as the size of the dataset, computational resources, and desired
interpretability.
4. Topic Representation:
- Once topics are discovered, the Topic Modeling module represents them in a meaningful
way for interpretation. Each topic is typically represented as a distribution of words, with
words ranked by their probability of occurring in the topic. This representation allows users
to understand the key terms associated with each topic.
5. Topic Labeling:
- The module may provide functionality for labeling topics based on the most representative
words. Automatic topic labeling helps users quickly understand the content of each topic
without having to manually inspect the entire list of associated words.
1. Data Exploration:
- The primary functionality of the Visualization module is to enable users to explore the
chat data visually. This includes displaying summary statistics, such as message frequency,
Page 32 of 98
word counts, and user activity over time. Visualization facilitates initial data inspection and
helps users gain a high-level understanding of the dataset.
2. Sentiment Visualization:
- The module can visualize the sentiment distribution of chat messages using various charts
or graphs. Common visualization techniques include sentiment histograms, pie charts, or
stacked bar charts showing the proportion of positive, negative, and neutral messages.
Sentiment visualization provides insights into the overall sentiment trends in the chat data.
4. Topic Visualization:
- The module visualizes the topics discovered through topic modeling to aid in their
interpretation. This may include interactive topic proportion plots, word clouds showing the
most representative words for each topic, or dendrogram-based topic hierarchies. Topic
visualization enables users to explore the content of topics and understand their relationships.
6. Network Visualization:
- In cases where chat data involves interactions between users or entities, the module may
visualize the communication network. This could involve creating node-link diagrams or
social network graphs showing connections between users based on message exchanges.
Network visualization provides insights into the structure of communication and the
relationships between participants.
By providing these functionalities, the Visualization module enables users to visually inspect,
analyze, and communicate insights derived from chat data effectively.
Page 33 of 98
Chapter 5: S/W Design
5.1.1 For a chat analyzer software, we’ll want a development lifecycle model that allows for
iterative development, frequent testing, and continuous improvement.
2. Design:
- Create a high-level architectural design outlining the components of the chat analyzer.
- Design the user interface for interacting with the analyzer.
- Develop data models and algorithms for analyzing chat data.
3. Implementation:
- Develop the software according to the design specifications.
- Implement algorithms for sentiment analysis, entity recognition, topic modeling, etc.
- Integrate with any external APIs or services required for additional functionality.
4. Testing:
- Conduct unit testing for individual components to ensure they function as expected.
- Perform integration testing to ensure different modules work together seamlessly.
- Conduct functional testing to verify that the software meets the specified requirements.
- Perform performance testing to ensure the software can handle the expected workload.
5. Deployment:
Page 34 of 98
- Prepare the software for deployment in the target environment.
- Set up necessary infrastructure, including servers, databases, and security measures.
- Deploy the software to production or staging environments.
8. Continuous Improvement:
- Continuously monitor the performance and usage of the chat analyzer.
- Collect data on user behavior and system performance to inform future updates.
- Incorporate user feedback and market trends to prioritize future development efforts.
This lifecycle model combines elements of iterative development, agile methodologies, and
continuous improvement to ensure the chat analyzer software meets the evolving needs of its
users.
For a chat analyzer software, we can consider using an Agile software development model,
specifically Scrum. Scrum is well-suited for projects where requirements may evolve over
time, and there's a need for frequent feedback and adaptation. Here are some steps to adapt
Scrum to fit the development of a chat analyzer:
Page 35 of 98
2. Sprint Planning:
- Plan short development cycles called sprints, typically lasting 2-4 weeks.
- During sprint planning, select a subset of items from the product backlog to work on
during the sprint. These items should deliver tangible value to the users.
3. Daily Standups:
- Hold daily standup meetings where the development team discusses progress, challenges,
and plans for the day.
- These meetings ensure transparency and keep the team aligned towards sprint goals.
4. Development:
- Develop and implement the selected backlog items during the sprint.
- Use iterative development practices to deliver working software increments at the end of
each sprint.
5. Sprint Review:
- At the end of each sprint, conduct a sprint review meeting where the development team
demonstrates the completed work to stakeholders.
- Gather feedback from stakeholders to validate whether the delivered features meet their
expectations.
6. Sprint Retrospective:
- Hold a sprint retrospective meeting after the sprint review to reflect on what went well,
what could be improved, and any adjustments needed for the next sprint.
- Identify and address process improvements to enhance team productivity and product
quality.
7. Incremental Delivery:
- Continuously deliver increments of the chat analyzer software with each sprint, allowing
stakeholders to see tangible progress and provide feedback throughout the development
process.
Page 36 of 98
8. Adaptation:
- Use feedback gathered from stakeholders during sprint reviews to adapt the product
backlog and adjust development priorities as needed.
- Embrace change and flexibility to ensure the chat analyzer meets evolving requirements
and user needs.
By following the Scrum framework, we can effectively manage the development of the chat
analyzer software in a collaborative and adaptive manner, delivering value to users
incrementally while remaining responsive to changes in requirements and feedback.
Page 37 of 98
5.2 Progress flowchart of Project
Page 38 of 98
Chapter 6: Source Code
st.set_page_config(
page_title="CHAT INSIGHT",
page_icon="🧊",
layout="wide",
initial_sidebar_state="expanded",
menu_items={
st.cache_data.clear()
st.cache_resource.clear()
st.write("""
## Chat insight
""")
st.markdown(
"""
V1.0 2024-02-25:
Page 39 of 98
- Maps for shared location
### Info
- Most of the charts are based on group chats but it works for dms too,
- Sometimes whatsapp can have problems while exporting with date formats.
- It may take a while for around 2 minutes for 20mb of chat file on the
server.
- Possible to-dos:
- Aggregate multiple people into one. Sometimes a user can have multi
numbers and we should give a chance to see them as one single user.
- Exportable pdf
- More prescriptive
- Demo chat
- source: [ankur](https://github.com/Ankurkatri/Chatinsight/)
"""
Page 40 of 98
def app():
session_state = st.session_state
@st.cache_data
def read_sample_data():
df = pd.read_csv(
"https://raw.githubusercontent.com/koftezz/whatsapp-chat-analyzer/
0aee084ffb8b8ec4869da540dc95401b8e16b7dd/data/sample_file.txt", header=None)
return df.to_csv(index=False).encode('utf-8')
csv = read_sample_data()
st.download_button(
data=csv,
file_name='data/sample_file.txt',
df = read_file(file)
df = df.sort_values("timestamp")
df = df[3:]
edited_df = df[["author"]].drop_duplicates()
edited_df = st.data_editor(edited_df)
with st.form(key='my_form_to_submit'):
Page 41 of 98
selected_authors = st.multiselect(
author_list)
selected_lang = st.radio(
("English", 'Turkish'))
submit_button = st.form_submit_button(label='Submit')
"authors.",
icon="")
selected_authors = df["author"].drop_duplicates().tolist()
if submit_button:
selected_lang=selected_lang,
selected_authors=selected_authors)
st.dataframe(df)
author_list = df.groupby('author').size().index.tolist()
enumerate(author_list)}
def formatted_barh_plot(s,
pct_axis=False,
Page 42 of 98
thousands_separator=False,
color_labels=True,
sort_values=True,
width=0.8,
**kwargs):
if sort_values:
s = s.sort_values()
s.plot(kind='barh',
color=s.index.to_series().map(
author_color_lookup).fillna(
'grey'),
width=width,
**kwargs)
if color_labels:
s.index.to_series().map(
author_color_lookup).fillna(
'grey'),
plt.gca().yaxis.get_major_ticks()):
tick.label1.set_color(
if pct_axis:
if type(pct_axis) == int:
decimals = pct_axis
else:
decimals = 0
plt.gca().xaxis.set_major_formatter(
ticker.PercentFormatter(1, decimals=decimals))
elif thousands_separator:
plt.gca().xaxis.set_major_formatter(
ticker.FuncFormatter(
Page 43 of 98
lambda x, p: format(int(x), ',')))
return plt.gca()
f"people " \
st.write(msg)
st.write(
o = basic_stats(df=df)
st.dataframe(o, use_container_width=True)
o = stats_overall(df=df)
st.dataframe(o)
Page 44 of 98
author_df = trend_stats(df=df)
st.dataframe(author_df, use_container_width=True)
o = pd.DataFrame(
df.groupby('author')["message"].count()).reset_index()
most_active = \
o.sort_values("message", ascending=False).iloc[0][
'author']
total_msg = o.sort_values("message",
ascending=False).iloc[0][
'message']
c = alt.Chart(o).mark_bar().encode(
x=alt.X("author", sort="-y"),
y=alt.Y('message:Q'),
color='author',
rule = alt.Chart(o).mark_rule(color='red').encode(
y='mean(message):Q'
c = (c + rule).properties(width=600, height=600,
st.altair_chart(c)
Page 45 of 98
o = activity(df=df)
most_active = \
'author']
ascending=False).iloc[0][
'Activity %']
st.info(
c = alt.Chart(o).mark_bar().encode(
x=alt.X("author:N", sort="-y"),
y=alt.Y('Activity %:Q'),
color='author',
rule = alt.Chart(o).mark_rule(color='red').encode(
y='mean(Activity %):Q'
c = (c + rule).properties(width=600, height=600,
title='Activity % by author'
st.altair_chart(c)
Page 46 of 98
with st.expander("More info"):
min_year = df.year.max() - 5
smoothed_daily_activity_df = smoothed_daily_activity(df=df)
st.area_chart(smoothed_daily_activity_df)
st.write("""
""")
min_year = df.year.max() - 3
o = relative_activity_ts(df=df)
st.area_chart(o)
st.write("""
0 - Monday
6 - Sunday
""")
Page 47 of 98
o = activity_day_of_week_ts(df=df)
st.line_chart(o)
st.write("""
""")
b = activity_time_of_day_ts(df=df)
c = alt.Chart(b).transform_fold(
selected_authors,
as_=['author', "message"]
).mark_line().encode(
x=alt.X('utchoursminutes(timestamp):T', axis=alt.Axis(
format='%H:00'),
scale=alt.Scale(type='utc')),
y='message:Q',
color='author:N'
).properties(width=1000, height=600)
st.altair_chart(c)
# Response matrix
st.write("""
## Response Matrix
""")
"")
Page 48 of 98
with st.container():
fig = response_matrix(df=df)
st.pyplot(fig)
st.write("""
""")
"")
# Response time
fig, ax = plt.subplots()
plt.subplot(121)
o["response_time"] = (
(o.timestamp - o.timestamp.shift(1)).dt.seconds
.replace(0, np.nan)
.div(60)
.apply(np.log10))
o = o[["author", "response_time"]]
o.groupby("author")["response_time"].apply(
sns.kdeplot)
Page 49 of 98
plt.ylabel('Relative frequency', fontsize=8)
plt.subplot(122)
o["response_time"] = (
(o.timestamp - o.timestamp.shift(1)).dt.seconds
.replace(0, np.nan)
.div(60))
response.groupby("author").median()["response_time"].pipe(
formatted_barh_plot)
plt.ylabel('')
plt.tight_layout()
with st.container():
slow_typer = response.groupby("author").median()[
"response_time"]. \
sort_values()[-1:].index[0]
st.pyplot(fig)
std = response.response_time.std()
mean = response.response_time.mean()
Page 50 of 98
* std]
c = alt.Chart(response).mark_point(size=60).encode(
x='letters',
y='response_time',
color='author',
c = (c + c.transform_regression("letters",
'response_time').mark_line()). \
properties(width=1000, height=600,
'a message'
).interactive()
"response time?")
with st.container():
st.altair_chart(c)
st.write("""
messages. """ % (
max_spammer, max_spam))
st.write("""
""")
year_content = year_month(df=df)
total_messages = year_content.sort_values("message",
Page 51 of 98
ascending=False).iloc[
0].message
year = int(year_content.sort_values("message",
ascending=False).iloc[
0].YearMonth / 100)
month = \
year_content.sort_values("message", ascending=False).iloc[
0].YearMonth % 100
c = alt.Chart(year_content).mark_bar().encode(
x=alt.X("YearMonth:O", ),
y=alt.Y('message:Q'),
color='year:O',
rule = alt.Chart(year_content).mark_rule(color='red').encode(
y='mean(message):Q'
c = (c + rule).properties(width=1000, height=600,
st.altair_chart(c)
st.write("""
""")
Page 52 of 98
"\n- Right chart shows the adjusted values based "
"on "
fig = sunburst(df=df)
st.pyplot(fig)
# st.write("""
# """)
# fig = radar_chart(df=df)
# st.pyplot(fig)
fig = heatmap(df=df)
st.pyplot(fig)
if locations.shape[0] > 0:
# geolocator = Nominatim(user_agent="loc_finder")
Page 53 of 98
# st.dataframe(pd.DataFrame(locations.groupby(["country", "town"])["lat"]
# .count()).rename(columns={"lat":
# "count"}).sort_values(
# "count", ascending=False))
st.map(locations)
st.cache_data.clear()
st.cache_resource.clear()
if __name__ == "__main__":
app()
Page 54 of 98
6.2 Backend code
import streamlit as st
import pandas as pd
import numpy as np
import datetime
import tempfile
import math
def create_wordcloud(selected_user,df):
f = open('stop_hinglish.txt', 'r')
stop_words = f.read()
Page 55 of 98
if selected_user != 'Overall':
df = df[df['user'] == selected_user]
def remove_stop_words(message):
y = []
y.append(word)
wc = WordCloud(width=500,height=500,min_font_size=10,background_color='white')
temp['message'] = temp['message'].apply(remove_stop_words)
return df_wc
def set_custom_matplotlib_style():
plt.style.use('seaborn-dark')
plt.rcParams['axes.titlesize'] = 13.0
plt.rcParams['axes.titleweight'] = 500
plt.rcParams['figure.titlesize'] = 13.0
plt.rcParams['figure.titleweight'] = 500
plt.rcParams['text.color'] = '#242121'
plt.rcParams['xtick.color'] = '#242121'
plt.rcParams['ytick.color'] = '#242121'
plt.rcParams['axes.labelcolor'] = '#242121'
Page 56 of 98
plt.rcParams['font.family'] = ['Source Sans Pro', 'Verdana', 'sans-serif']
return (None)
@st.cache_data(show_spinner=False)
def read_file(file):
bytes_data = file.getvalue()
temp.write(bytes_data)
parser = WhatsAppParser(temp.name)
parser.parse_file()
df = parser.parsed_messages.get_df()
return df
selected_lang: str,
selected_authors: list):
# language settings
},
Page 57 of 98
"gif": "GIF dahil edilmedi",
df["timestamp"] = pd.to_datetime(df["timestamp"],
errors='coerce')
df["date"] = df["timestamp"].dt.strftime('%Y-%m-%d')
df = df.loc[df["author"].isin(selected_authors)]
df = df.sort_values(["timestamp"])
locations = df.loc[df["is_location"] == 1]
if locations.shape[0] > 0:
locs[1] = locs[1].str[27:]
df['is_link'] = ~df.message.str.extract('(https?:\S*)',
expand=False).isnull() * 1
Page 58 of 98
df['msg_length'] = df.message.str.len()
"picture"]) * 1
"video"]) * 1
"gif"]) * 1
"sticker"]) * 1
"audio"]) * 1
df['is_deleted'] = (df.message.isin(lang[selected_lang]["deleted"])) * 1
# collections.Counter([match[
# "message"]
# for word in
Page 59 of 98
# word_list for
# match in
# emoji.emoji_list(
# word)]))
df = df[~(~df.author.str.extract('(\+)',
expand=False).isnull() |
df.author.isnull())]
df['is_conversation_starter'] = ((
df.timestamp - df.timestamp.shift(
'7 hours')) * 1
df = df.drop("hour", axis=1).groupby(
'author').mean().rename(
columns={
"words": "Words",
"letters": "Letters",
"is_link": "Link",
"is_image": "Image",
Page 60 of 98
"is_video": "Video",
"is_gif": "GIF",
"is_audio": "Audio",
"is_sticker": "Sticker",
"is_deleted": "Deleted",
"is_location": "Location"
).style.format({
'Words': '{:.2f}',
'Letters': '{:.1f}',
'Link': '{:.2%}',
'Image': '{:.2%}',
'Video': '{:.2%}',
'GIF': '{:.2%}',
'Audio': '{:.2%}',
'Sticker': '{:.2%}',
'Deleted': '{:.2%}',
'Location': '{:.2%}'
}).background_gradient(axis=0)
return df
authors = df[["author"]].drop_duplicates()
temp = df.loc[df["is_image"] == 1]
images = pd.DataFrame(
temp.groupby("author")["is_image"].sum() / temp[
"is_image"].sum()).reset_index()
Page 61 of 98
temp = df.loc[df["is_video"] == 1]
"is_video"].sum()).reset_index()
temp = df.loc[df["is_link"] == 1]
"is_link"].sum()).reset_index()
temp = df.loc[df["is_conversation_starter"] == 1]
con_starters = pd.DataFrame(
temp.groupby("author")["is_conversation_starter"].sum() / temp[
"is_conversation_starter"].sum()).reset_index()
temp = df.loc[df["is_gif"] == 1]
gifs = pd.DataFrame(
temp.groupby("author")["is_gif"].sum() / temp[
"is_gif"].sum()).reset_index()
temp = df.loc[df["is_audio"] == 1]
audios = pd.DataFrame(
temp.groupby("author")["is_audio"].sum() / temp[
"is_audio"].sum()).reset_index()
temp = df.loc[df["is_sticker"] == 1]
stickers = pd.DataFrame(
temp.groupby("author")["is_sticker"].sum() / temp[
"is_sticker"].sum()).reset_index()
temp = df.loc[df["is_deleted"] == 1]
delete = pd.DataFrame(
Page 62 of 98
temp.groupby("author")["is_deleted"].sum() / temp[
"is_deleted"].sum()).reset_index()
temp = df.loc[df["is_location"] == 1]
locs = pd.DataFrame(
temp.groupby("author")["is_location"].sum() / temp[
"is_location"].sum()).reset_index()
authors = authors.fillna(
{"is_sticker": 0,
"is_gif": 0,
"is_audio": 0,
"is_video": 0,
"is_conversation_starter": 0,
"is_deleted": 0,
"is_location": 0}).rename(
columns={
"is_link": "Link",
"is_image": "Image",
"is_video": "Video",
"is_gif": "GIF",
Page 63 of 98
"is_audio": "Audio",
"is_sticker": "Sticker",
"is_deleted": "Deleted",
"is_location": "Location"
).style.format({
'Link': '{:.2%}',
'Image': '{:.2%}',
'Video': '{:.2%}',
'GIF': '{:.2%}',
'Audio': '{:.2%}',
'Sticker': '{:.2%}',
'Deleted': '{:.2%}',
'Location': '{:.2%}'
}).background_gradient(axis=0)
return authors
df["year"] = df["timestamp"].dt.year
min_year = df.year.max() - 5
['author',
'timestamp']).first().unstack(
level=0).resample('D').sum().msg_length.fillna(0)
smoothed_daily_activity_df = pd.DataFrame(
gaussian_filter(daily_activity_df,
(6, 0)),
index=daily_activity_df.index,
Page 64 of 98
columns=daily_activity_df.columns)
# fig, ax = plt.subplots()
# subplots = daily_activity_df.plot(figsize=[8,2*len(df.author.unique())],
# ax = smoothed_daily_activity_df.plot(figsize=[8, 2*len(
# plt.xlabel('')
# plt.tight_layout()
# plt.subplots_adjust(wspace=10, hspace=10)
# st.pyplot(fig)
return smoothed_daily_activity_df
distinct_dates = df[["date"]].drop_duplicates()
distinct_authors = df[["author"]].drop_duplicates()
distinct_authors['key'] = 1
distinct_dates['key'] = 1
on="key").drop(
"key", 1)
activity = pd.DataFrame(
df.groupby(["author", "date"])[
"words"].nunique()).reset_index()
activity["start_date"] = activity.groupby(["author"])[
"date"].transform(
Page 65 of 98
"min")
on=["date", "author"],
how="left")
distinct_dates["max_date"] = df.date.max()
['max_date', 'start_date']].apply(pd.to_datetime)
distinct_dates["date_diff"] = (
distinct_dates['max_date'] - distinct_dates[
'start_date']).dt.days
o = distinct_dates.groupby("author").agg(
.rename(columns={
"""
:param colors:
:param n_bins:
:return:
"""
Page 66 of 98
cmap_name = 'temp_cmap'
return (cm)
min_year = df.year.max() - 3
['author', 'timestamp']).first().unstack(level=0).resample(
'D').sum().msg_length.fillna(0)
smoothed_daily_activity_df = pd.DataFrame(
index=daily_activity_df.index,
columns=daily_activity_df.columns)
o = smoothed_daily_activity_df.div(
smoothed_daily_activity_df.sum(axis=1),
axis=0)
return o
o = df.groupby(
[df.timestamp.dt.dayofweek, df.author]).msg_length.sum(
).unstack().fillna(0)
# "Sunday"])
return o
Page 67 of 98
def activity_time_of_day_ts(df: pd.DataFrame):
a = df.groupby(
[df.timestamp.dt.time,
df.author]).msg_length.sum().unstack().fillna(0)
a = a.reindex(
range(60)]).fillna(0)
# Temporarily add the tail at the start and head and the end of the
a = pd.concat([a.tail(120), a, a.head(120)])
index=a.index,
columns=a.columns)
b = b.iloc[120:-120]
b = b.reset_index()
b = b.rename_axis(None, axis=1)
b['timestamp'] = pd.to_datetime(b['timestamp'].astype(str))
# o = b.reset_index()
# o = o.sort_values("timestamp")
Page 68 of 98
# b = o.set_index(["hour", "timestamp"])
# o = b.plot(ax=plt.gca())
# plt.xticks(range(0,24*60*60+1, 3*60*60))
# plt.xlabel('Time of day')
# plt.ylabel('Relative activity')
# plt.ylim(0, plt.ylim()[1])
# plt.gca().legend(title=None)
# plt.tight_layout()
# st.pyplot(o)
return b
fig, ax = plt.subplots()
ax = (df
.groupby([df.author.rename('Message author'),
df.author.shift(1).rename('Responding to...')])
.size()
.unstack()
cmap='viridis',
cbar=False))
Page 69 of 98
plt.title('Reponse Martix\n ')
plt.gca().text(.5, 1.04,
transform=plt.gca().transAxes);
plt.gca().set_yticklabels(plt.gca().get_yticklabels(),
va='center',
minor=False,
fontsize=8)
# plt.gcf().text(0, 0,
# va='bottom')
plt.tight_layout()
return fig
prev_sender = []
max_spam = 0
tmp_spam = 0
for jj in range(len(df)):
current_sender = df['author'].iloc[jj]
if current_sender == prev_sender:
tmp_spam += 1
max_spam = tmp_spam
max_spammer = current_sender
else:
tmp_spam = 0
Page 70 of 98
prev_sender = current_sender
df[
year_content = year_content.sort_values('YearMonth')
return year_content
subplot_kw={'projection': 'radar'})
current_year = df.year.max()
last_year = current_year - 1
alpha=0)
ax[0].set_title(last_year)
ax[1].set_title(current_year)
plt.tight_layout()
return fig
current_year = df.year.max()
last_year = current_year - 1
Page 71 of 98
if df.loc[df["year"] == last_year].shape[0] > 0:
monthly_border=True,
cmap='Oranges',
ax=ax[0])
linewidth=0,
monthly_border=True, ax=ax[1])
ax[0].set_title(last_year)
ax[1].set_title(current_year)
else:
ax = vis.calendar_heatmap(df, year=current_year,
linewidth=0,
monthly_border=True)
ax.set_title(current_year)
return fig
subplot_kw={'projection': 'polar'})
isolines=[2500, 5000],
isolines_relative=False, ax=ax[0])
isolines=[0.5, 1],
color='C1', ax=ax[1])
return fig
Page 72 of 98
def trend_stats(df: pd.DataFrame):
author_df = df["author"].value_counts().reset_index()
author_df.rename(columns={"index": "Author",
inplace=True)
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
max_year = df.year.max()
max_month = df.loc[
df.year == max_year].month.max()
df["month"]
df.loc[
temp_df = df.pivot_table(
index=["yearmonth"], columns=["author"],
temp_df.columns]
temp_df = temp_df.reset_index().sort_values(
["yearmonth"])
temp_df.set_index('yearmonth', inplace=True)
Page 73 of 98
"Author"].apply(
lambda x: trendline(temp_df.tail(12)[x]))
"Author"].apply(
lambda x: trendline(temp_df.tail(6)[x]))
"Author"].apply(
lambda x: trendline(temp_df.tail(3)[x]))
return author_df
words_lst = (
''.join(df["message"].values.astype(str))).split(' ')
df = pd.DataFrame.from_dict(Counter(words_lst),
orient='index',
columns=[
"count"]).reset_index().rename(
columns={'index': 'word'})
df.sort_values('count', ascending=False,
inplace=True, ignore_index=True)
df[""] = df["count"].apply(
return df
slope = coeffs[-2]
Page 74 of 98
if slope > 0:
else:
else:
def extract_emojis(s):
if b == 0:
return a
return gcd(b, a % b)
Page 75 of 98
def findnum(str):
n = len(str)
# fraction form.
count_after_dot = 0
dot_seen = 0
num = 0
for i in range(n):
if str[i] != '.':
if dot_seen == 1:
count_after_dot += 1
else:
dot_seen = 1
# is already a natural.
if dot_seen == 0:
return 1
Page 76 of 98
# Result is denominator divided by
def percent_helper(percent):
ans = findnum(str(percent))
else:
Page 77 of 98
Chapter 7: Results
Page 78 of 98
Pic 7.1
Page 79 of 98
Pic 7.2
Page 80 of 98
Pic 7.3
Page 81 of 98
Pic 7.4
Page 82 of 98
Pic 7.5
Page 83 of 98
Pic 7.6
Page 84 of 98
Pic 7.7
Page 85 of 98
Pic 7.8
Page 86 of 98
Pic 7.9
Page 87 of 98
Pic 7.10
Page 88 of 98
Pic 7.11
Page 89 of 98
Pic 7.12
Page 90 of 98
Pic 7.13
Page 91 of 98
Chapter 8: Future Scope
8.1 The future scope of a Chat Insight project is promising, with several potential avenues for
expansion and enhancement. Here are some future scopes for further development:
Page 92 of 98
8.1.6. Integration with Chatbots and AI Assistants:
Integrating the chat insight with chatbots and AI assistants would enable automated insights
generation and decision-making based on real-time chat analysis, enhancing the efficiency
and effectiveness of chat-based interactions.
By exploring these future scopes, the chat insight project can continue to evolve and adapt to
meet the growing demands and challenges of analyzing chat data in diverse contexts and
domains.
Page 93 of 98
Chapter 9: Conclusion
9.1 In conclusion, the chat Insight project offers a comprehensive solution for analyzing and
extracting insights from textual conversations across various chat platforms. Through the
integration of advanced natural language processing techniques, machine learning algorithms,
and interactive visualization tools, the chat insight empowers users to gain a deeper
understanding of chat data and derive actionable insights for a wide range of applications.
The project addresses key challenges in chat data analysis, including sentiment analysis,
named entity recognition, topic modeling, and visualization, providing users with a holistic
view of conversation dynamics, sentiment trends, and thematic content. By leveraging the
capabilities of the chat analyzer, businesses can enhance customer satisfaction, optimize
communication processes, and make data-driven decisions to drive growth and innovation.
9.2 Moving forward, the Chat Insight project has promising future scopes for further
development, including real-time analysis, multimodal support, advanced NLP techniques,
and domain-specific solutions. By continuously evolving and adapting to meet the evolving
needs and challenges of analyzing chat data, the chat analyzer project remains at the forefront
of enabling organizations to unlock the full potential of their textual conversations and drive
meaningful outcomes in the digital age.
Page 94 of 98
Chapter 10: References
https://dvatvani.com/joweich/chat-miner
https://github.com/joweich/chat-miner
Thanks to chat-miner for easy whatsapp parsing tool and their awesome charts .
Page 95 of 98
Page 96 of 98
Page 97 of 98
Page 98 of 98