0% found this document useful (0 votes)
33 views98 pages

Final Project File Chat Insight - Aranya

Uploaded by

ankurkatri0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views98 pages

Final Project File Chat Insight - Aranya

Uploaded by

ankurkatri0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 98

A Project Report

On

“CHAT INSIGHT”
Submitted in partial fulfillment of the requirements for the award of the degree of

Bachelor of Technology
In
Computer Science & Engineering

Submitted to- Submitted by-


Ms. Urvashi Aranya Dhull
Assistant Professor University Roll no -
1066691

VAISH COLLEGE OF ENGINEERING


(Affiliated to Maharishi Dayanand University, Rohtak)
ROHTAK- 124001

May 2024
INDEX

Page 1 of 98
CHAPTER 1: INTRODUCTION
CHAPTER 2: OBJECTIVES
CHAPTER 3: SYSTEM REQUIREMENTS
3.1 H/W Requirements
3.2 S/W Requirements
3.3 Introduction to Tools/Technologies/ S/W used in project
CHAPTER 4: S/W REQUIREMENTS ANALYSIS
4.1 Problem Definition
4.2 Modules and their Functionalities
CHAPTER 5: S/W DESIGN
5.1 S/W Development Lifecycle Model
5.2 Progress flow chart of Project
CHAPTER 6: SOURCE CODE
6.1 Frontend code
6.2 Backend code
CHAPTER 7: RESULTS
CHAPTER 8: FUTURE SCOPE
CHAPTER 9: CONCLUSION
CHAPTER 10: REFRENCES

CERTIFICATE

Page 2 of 98
This is to certify that project report entitle “CHAT INSIGHT” done by Ms. ARANYA
DHULL Roll No- 20/CSE/109, University Roll no: 1066691 of Vaish College Of
Engineering, Rohtak towards partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology in Computer Science & Engineering is a bonafide record
of the work carried out by Her under My Supervision and Guidance.

Date: Ms. Urvashi


Place: (Assist.Professor)
Vaish college of Engineering,
Rohtak

Dr. Bijender Bansal


H.O.D(Computer Science)
Vaish college of Engineering,
Rohtak

ACKNOWLEDGEMENT

Page 3 of 98
I take this opportunity to express my profound gratitude and deep regards to my guide “Ms.
Urvashi (Asst. Prof.)” for her exemplary guidance, monitoring and constant encouragement
throughout the course of this thesis. The blessing, help and guidance given by her time to
time shall carry me a long way in the journey of life on which I am about to embark.
I also take this opportunity to express a deep sense of gratitude to Dr. Bijender Bansal ,
Head Department of Computer Science and Engineering, Rohtak for his cordial support,
valuable information and guidance, which helped me in completing this task through various
stages. I am greatful to my teammates of my college helped me as a part of their team.
I am obliged to staff members of Computer Department, for the valuable information
provided by them in their respective fields. I am grateful for their cooperation during the
period of my Project.
Lastly, I thank almighty, my parents, brother, sisters and friends for their constant
encouragement without which this assignment would not be possible.

ARANYA
1066691
Department of Computer Science

Chapter 1: Introduction

Page 4 of 98
1.1 In today's digital age, communication primarily occurs through various digital platforms
such as messaging applications, social media, and emails. The vast amount of textual data
generated through these channels presents a unique opportunity to extract valuable insights,
sentiments, and trends from conversations. However, manually analyzing such data can be
time-consuming and challenging. To address this issue, we introduce the Comprehensive
Chat Analyzer Tool, a project aimed at automating the analysis of textual conversations.

1.1.1. Project Overview


The Comprehensive Chat Analyzer Tool is a software application designed to analyze text-
based conversations from diverse sources, including but not limited to chat logs, social media
interactions, and email threads. The tool utilizes natural language processing (NLP)
techniques to extract valuable information from the data, enabling users to gain insights into
sentiment, topics discussed, and entities mentioned within the conversations.

1.1.2. Purpose and Scope


The primary purpose of the Chat Analyzer Tool is to streamline the process of analyzing
textual conversations, allowing users to extract meaningful insights efficiently. The tool
caters to a wide range of users, including researchers, marketers, customer service
representatives, and social media analysts, who can benefit from understanding the
underlying patterns and sentiments in conversations.

The scope of the project encompasses the development of a robust software application
capable of processing large volumes of text data, performing various NLP tasks such as
sentiment analysis, topic modeling, and named entity recognition, and presenting the results
in a clear and intuitive manner.

1.1.3. Features and Functionality


The Chat Analyzer Tool offers the following key features:

Text Data Import: Users can import text data from various sources, including text files, CSV
files, and APIs.
Preprocessing: The tool preprocesses the text data by removing noise, such as special
characters and stopwords, and tokenizing the text into meaningful units.
Sentiment Analysis: Analyzes the sentiment of the conversations, categorizing them as
positive, negative, or neutral.
Topic Modeling: Identifies the main topics discussed within the conversations using
techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization
(NMF).

Page 5 of 98
Named Entity Recognition: Extracts named entities such as persons, organizations, and
locations mentioned in the conversations.
Visualization and Reporting: Presents the analysis results through interactive visualizations
such as word clouds, sentiment distribution charts, and topic clusters. Users can also generate
comprehensive reports summarizing the insights obtained.
1.1.4. System Architecture
The Chat Analyzer Tool follows a modular architecture comprising the following
components:

Data Input Module: Responsible for importing text data from various sources.
Preprocessing Module: Handles text preprocessing tasks such as noise removal and
tokenization.
Analysis Modules: Includes modules for sentiment analysis, topic modeling, and named
entity recognition.
Visualization Module: Generates interactive visualizations based on the analysis results.
Reporting Module: Generates detailed reports summarizing the insights obtained from the
conversations.
The architecture ensures scalability, flexibility, and ease of maintenance, allowing for future
enhancements and additions of new analysis modules.

1.1.5. Data Collection and Preprocessing


Before analysis, the textual data undergoes a preprocessing stage to ensure its cleanliness and
suitability for analysis. This stage involves several steps, including:

Data Import: Text data is imported from the specified sources, such as chat logs or social
media feeds.
Text Cleaning: Noise such as special characters, URLs, and punctuation marks is removed
from the text.
Tokenization: The text is tokenized into individual words or phrases for further analysis.
Stopword Removal: Common stopwords such as 'and', 'the', and 'is' are removed from the
text to focus on meaningful content.
The preprocessing stage lays the foundation for accurate analysis and insights generation.

1.1.6. Sentiment Analysis

Page 6 of 98
Sentiment analysis is a crucial aspect of the Chat Analyzer Tool, enabling users to understand
the overall sentiment of the conversations. The tool employs machine learning algorithms and
lexicon-based approaches to classify the sentiment of each message or conversation as
positive, negative, or neutral.

1.1.17. Topic Modeling


Topic modeling is employed to identify the underlying themes or topics within the
conversations. Techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative
Matrix Factorization (NMF) are utilized to extract topics based on the frequency and co-
occurrence of words within the text data.

1.1.8. Named Entity Recognition


Named Entity Recognition (NER) is utilized to identify and extract named entities such as
persons, organizations, and locations mentioned within the conversations. This information
provides valuable insights into the key entities associated with the discussions.

1.1.9. Visualization and Reporting


The Chat Analyzer Tool offers interactive visualizations to present the analysis results in a
visually appealing and intuitive manner. Visualizations such as word clouds, sentiment
distribution charts, and topic clusters allow users to explore the insights derived from the
conversations effectively. Additionally, users can generate detailed reports summarizing the
key findings and trends observed in the data.

1.2 The Chat Analyzer project is designed to analyze text-based conversations, extracting
valuable insights such as sentiment analysis, topic modeling, and user engagement metrics. It
involves processing large volumes of chat data to understand patterns, sentiments, and user
behavior for various purposes like improving customer service, understanding social
interactions, or monitoring online communities.
The Chat Analyzer project is a comprehensive system that employs natural language
processing (NLP) techniques to analyze text-based conversations. With a focus on processing
large volumes of chat data, the project aims to extract valuable insights, including sentiment
analysis, topic modeling, and user engagement metrics.
1.3 "Chat Analyzer" can refer to various tools or software designed to analyze conversations
that occur in chat platforms. These platforms could include messaging apps, social media
platforms, customer service chats, and more. Here are some common features and
functionalities of chat analyzers:

Page 7 of 98
1.3.1. Sentiment Analysis:
Chat analyzers can determine the overall sentiment of a conversation by analyzing the
language used. They identify whether the conversation is positive, negative, or neutral. This
can be useful for businesses to gauge customer satisfaction or for researchers studying online
interactions.

1.3.2. Keyword Tracking:


These tools can track specific keywords or phrases within conversations. This could be
helpful for businesses monitoring mentions of their brand, product names, or topics related to
their industry.

1.3.3. Language Understanding:


Advanced chat analyzers may utilize natural language processing (NLP) techniques to
understand the context of conversations. They can identify entities, such as people, places, or
events mentioned in the chat, and extract meaningful insights from the text.

1.4.4. Conversation Flow Analysis:


Chat analyzers can analyze the flow of conversation, identifying patterns such as turn-taking,
topic shifts, and engagement levels. This can help businesses understand customer
interactions and improve their communication strategies.

1.4.5. Customer Support Optimization:

For businesses with customer support chat systems, chat analyzers can help optimize
response times, identify common issues, and route inquiries to the appropriate departments or
agents.

1.5.6. Historical Data Analysis:


Chat analyzers can analyze historical chat data to identify trends over time. Businesses can
use this information to make data-driven decisions and improve their operations.

1.6.7. Privacy and Security:

Page 8 of 98
Given the sensitive nature of chat data, chat analyzers often include features to ensure privacy
and security compliance, such as data encryption and anonymization.

Overall, chat analyzers play a valuable role in understanding and extracting insights from
textual conversations, whether for customer service improvement, market research, or social
media monitoring.

Chapter 2: Objectives

Page 9 of 98
2.1 A "Chat Insight" typically refers to a software tool or system designed to analyze
conversations, often in text form, between individuals or groups. The objectives of such a
tool can vary depending on its intended use and the specific features it offers. Here are some
common objectives of a chat insight:

2.1.1 Sentiment Analysis: One of the primary objectives is to gauge the sentiment of the
conversation. This involves determining whether the overall tone of the conversation is
positive, negative, or neutral. Understanding sentiment can be useful in various contexts, such
as customer service interactions, social media monitoring, or market research.

2.1.2 Topic Detection: Another key objective is to identify the main topics being discussed
in the conversation. By analyzing the language used and identifying keywords or themes, the
chat analyzer can categorize conversations into different topics or subjects. This can help in
organizing and summarizing large volumes of text data.

2.1.3 User Profiling: Chat insight often aim to create profiles of the individuals
participating in the conversation. This involves analyzing linguistic patterns, vocabulary, and
other factors to infer demographic information, interests, or behavioral traits of the users.
User profiling can be valuable for targeted advertising, personalized recommendations, or
social network analysis.

2.1.4 Anomaly Detection: Detecting anomalies or unusual patterns in conversations is


another objective of chat insight. This could include identifying outliers in sentiment,
detecting sudden changes in topic, or flagging suspicious language that may indicate
fraudulent activity, harassment, or other problematic behavior.

2.1.5 Language Understanding: Chat Insight may also focus on understanding the
semantics and context of the conversation. This involves tasks such as named entity
recognition (identifying names, organizations, locations, etc.), parsing sentence structure, and
resolving ambiguous references. Language understanding capabilities enable more
sophisticated analysis and interpretation of chat data.

2.1.6 Insights and Reporting: Ultimately, the goal of a chat insight is often to provide
actionable insights and generate reports based on the analysis. This could include
visualizations of sentiment trends over time, summaries of key topics discussed,
identification of influential users, or recommendations for improving communication
strategies.

Page 10 of 98
2.1.7 Integration and Automation: Many chat analyzers are designed to integrate with
other systems or workflows and to automate certain tasks. For example, they might integrate
with customer relationship management (CRM) software to provide real-time analysis of
customer interactions, or with social media management tools to monitor brand mentions and
engagement.

2.1.8 Customization and Scalability: Depending on the specific use case and requirements,
chat analyzers may need to be highly customizable and scalable. Users may want to fine-tune
the analysis algorithms, define custom metrics, or process large volumes of data efficiently.
Providing flexibility and scalability ensures that the tool can adapt to diverse applications and
growing datasets.

By fulfilling these objectives, a chat analyzer can help organizations gain valuable insights
from their conversational data, improve decision-making processes, and enhance
communication strategies.

Chapter 3: System Requirements

Page 11 of 98
System requirements refer to the specifications and capabilities necessary for a computer
system, software application, or project to operate effectively and efficiently. These
requirements encompass both hardware and software aspects and are essential for ensuring
that the system functions as intended and meets the needs of its users.

3.1 Hardware Requirements

3.1.1 Hardware requirements of the device on which the project was built:

iPad (10th generation) display


10.9-inch display
True Tone
Apple Pencil compatibility
The iPad (10th generation) has a larger display than its predecessor - which offers a 10.2-inch
Retina display with a 2160 x 1620 pixel resolution, resulting in a pixel density of 264ppi.

3.1.2 The display on the 10th-gen model expands to 10.9-inches. There's a 2360 x 1640
pixel resolution on board, which results in the same 264ppi pixel density as the 9th gen
model, though you of course get more screen within a very similar footprint.

3.1.3 The 10th generation iPad supports True Tone, as well as the Apple Pencil - still only 1st
Generation - and it comes with a 500nits brightness and fingerprint-resistant oleophobic
coating.

iPad (10th gen) hardware and specs


A14 chip
64/264GB storage
USB-C
3.1.4 The iPad (10th generation) runs on the A14 Bionic chip, which makes sense as this is
still an upgrade from the 9th generation but a step down from the iPad Air that runs on the
M1 chip. It's the same chip as the iPhone 12 models and iPad Air (2020).

Page 12 of 98
3.1.5 Storage options for the 10th generation iPad are the same as the 9th generation model -
meaning it starts at 64GB, with a 256GB option too. As mentioned though, the iPad (10th
generation) switches to USB Type-C for charging over Lightning.

3.1.6 In terms of other hardware, there's a Smart Connector on the 10th generation model
again, as there is on the 9th generation model. There's also a 12-megapixel FaceTime HD
camera again, allowing for features like Centre Stage. On the rear, you'll also find a 12-
megapixel sensor.

3.1.7 The 3.5mm headphone jack has been removed for the 10th generation iPad.

iPad (10th gen) software


iPad OS 16
The Apple iPad (10th generation) will be compatible with iPad OS 16 . It will therefore offer
the same user experience as other iPads in the company's portfolio, though it will miss out on
a couple of features.

3.1.8 Minimum Hardware requirements for building project are:

- Processor: Modern multi-core processor (e.g., Intel Core i5 or AMD Ryzen) for optimal
performance.
- Memory (RAM): Minimum 4GB RAM, recommended 8GB or more for smoother
operation, especially when handling large datasets.
- Storage: At least 100MB of free disk space for installation, additional space for caching
and storing user data.
- Graphics: Standard integrated or dedicated graphics card for rendering user interface
elements.
- Display: Monitor with a resolution of at least 1024x768 pixels for optimal viewing of
charts and data.

3.2 Software Requirements

Page 13 of 98
3.2.1 Minimum software requirements are:

- Operating System: Compatibility with major desktop operating systems such as Windows
10, macOS, and Linux distributions (e.g., Ubuntu).
- Web Browser: Support for modern web browsers such as Google Chrome, Mozilla
Firefox, Safari, and Microsoft Edge for accessing web-based versions of the application.
- Runtime Environment: Installation of runtime environments like Node.js or Python (if
applicable) for server-side or scripting functionalities.
- Database: Compatibility with popular database systems such as MySQL, PostgreSQL, or
MongoDB for storing user preferences, watchlists, and historical data.

3.3 Introduction to Tools/Technologies/S/W used in project

The tools and technologies used in a chat analyzer can vary depending on the specific
features and objectives of the analyzer. Here are some common tools and technologies used
in building chat analyzers:

3.3.1 Altair (4.2.2)


Altair is a declarative statistical visualization library for Python, designed to make it easy to
create interactive visualizations of data. It allows users to generate a wide range of static and
interactive plots with concise and intuitive syntax.

Altair is designed to work seamlessly with Pandas DataFrames and supports a wide range of
chart types, including scatter plots, line charts, bar charts, histograms, heatmaps, and more.

Here are some key features and characteristics of Altair:

1. Declarative Syntax: Altair uses a declarative grammar of graphics approach, where you
specify the visualization's appearance and properties using a high-level description. This
makes it easy to create complex visualizations with minimal code.

Page 14 of 98
2. Integration with Pandas: Altair seamlessly integrates with Pandas DataFrames, allowing
you to visualize data directly from DataFrame objects. This makes it convenient for data
analysis workflows, as you can easily explore and visualize datasets without needing to
preprocess them extensively.

3. Interactive Visualizations: Altair enables you to create interactive visualizations that


respond to user interactions such as hovering, clicking, zooming, and panning. This allows
users to explore and interact with data dynamically, gaining insights in an intuitive manner.

4. Rich Set of Chart Types: Altair supports a wide variety of chart types, including basic
charts like scatter plots and bar charts, as well as more advanced visualizations like trellis
plots, facet grids, and layered charts. This allows you to choose the most appropriate
visualization type for your data and analysis goals.

5. Customization and Theming: Altair provides options for customizing the appearance and
style of visualizations, including control over colors, markers, axes, labels, and legends
6. Export and Sharing: Altair allows you to export visualizations to various formats, including
PNG, SVG, and HTML. This makes it easy to share visualizations with others or embed them
in reports, presentations, or web applications.

Overall, Altair is a powerful and user-friendly library for creating interactive and expressive
visualizations in Python, particularly suited for data exploration, analysis, and
communication tasks.

3.3.2 Calmap (0.0.9)


Calmap is a Python library that provides a calendar heatmap visualization for time series
data.
Calmap is a Python library for creating calendar heatmaps, also known as "yearly heatmaps"
or "time series heatmaps." These visualizations are useful for displaying time-series data
across days of the week, months, or years, with each cell colored according to the intensity of
the data.

Here are the main features and characteristics of calmap:

1. Calendar Heatmaps: calmap specializes in creating calendar heatmaps, where each cell
represents a day on the calendar, and the color intensity represents the value of the data for
that day. This allows for easy visualization of patterns and trends over time.

Page 15 of 98
2. Customization: calmap provides options for customizing the appearance of the calendar
heatmap, including the color map, color range, cell size, and cell shape. This allows you to
tailor the visualization to your specific needs and preferences.

3. Support for Time Series Data: calmap is designed to work with time-series data, making it
easy to visualize temporal patterns and trends. It supports various time resolutions, including
daily, monthly, and yearly aggregations.

4. Integration with Matplotlib: calmap is built on top of Matplotlib, a popular plotting library
in Python. This means that calmap can be seamlessly integrated into Matplotlib figures and
plots, allowing for easy customization and combination with other types of visualizations.

5. Ease of Use: calmap provides a simple and intuitive interface for creating calendar
heatmaps, with straightforward functions for plotting time-series data. This makes it easy for
users to generate high-quality visualizations with minimal effort.

Overall, calmap is a useful tool for visualizing time-series data in a calendar heatmap format,
allowing for easy interpretation and analysis of temporal patterns and trends. It's particularly
well-suited for tasks such as visualizing daily or monthly fluctuations in data, identifying
seasonal patterns, and detecting anomalies or outliers in time-series datasets.

3.3.3 Chat Miner (0.4.0)


chat-miner provides lean parsers for every major platform transforming chats into data
frames. Artistic visualizations allow you to explore your data and create artwork from your
chats.

3.3.4 Matplotlib inline (0.1.6)


"Matplotlib inline" is not a specific package or library on its own. Instead, it's typically a
feature or mode used in the Matplotlib library when working in interactive Python
environments like Jupyter Notebooks.

When you use `%matplotlib inline` in a Jupyter Notebook cell, it tells Jupyter to display
Matplotlib plots directly within the notebook, rather than opening a separate window or
generating a file. This is often used for convenience and to keep all visualizations within the
notebook itself.

Page 16 of 98
3.3.5 NumPy (1.24.2)
NumPy is a fundamental package for scientific computing with Python. It provides support
for large, multi-dimensional arrays and matrices, along with a collection of mathematical
functions to operate on these arrays efficiently. NumPy is widely used in various fields such
as data science, machine learning, engineering, and scientific research.

The version number "1.24.2" refers to a specific release version of the NumPy library. In
software development, version numbers typically follow a format like "major.minor.patch",
where:

- "Major" version changes indicate significant updates, often with breaking changes or major
new features.
- "Minor" version changes usually add new features or enhancements without breaking
existing functionality.
- "Patch" version changes typically address bugs or issues found in the software.

3.3.6 Pandas (1.5.3)


Pandas is a popular Python library for data manipulation and analysis, particularly for
working with structured data and time series data. It provides data structures like DataFrame
and Series, as well as functions and methods for performing operations such as filtering,
grouping, and merging data.

The version number "1.5.3" would follow the typical format for software versioning, where:

- "1" is the major version number, indicating significant updates or changes.


- "5" is the minor version number, indicating smaller updates or additions.
- "3" is the patch version number, typically indicating bug fixes or minor improvements.

3.3.7 Scikit-Learn (1.2.1)

Page 17 of 98
Scikit-learn, often abbreviated as sklearn, is an open-source machine learning library for
Python. It is built on top of other popular Python libraries such as NumPy, SciPy, and
matplotlib. Scikit-learn provides a simple and efficient toolset for data mining and data
analysis tasks, including classification, regression, clustering, dimensionality reduction, and
model selection.

Here are some key features and components of scikit-learn:

1. Consistent API: Scikit-learn provides a consistent interface for various machine learning
algorithms, making it easy to experiment with different models without needing to learn new
syntax for each one.

2. Supervised and Unsupervised Learning Algorithms: It includes implementations of a wide


range of supervised learning algorithms such as linear models, support vector machines,
decision trees, random forests, gradient boosting, and naive Bayes, as well as unsupervised
learning algorithms such as clustering, dimensionality reduction, and anomaly detection.

3. Model Evaluation and Selection: Scikit-learn offers tools for model evaluation, including
various metrics for assessing model performance such as accuracy, precision, recall, F1-
score, ROC curves, and AUC-ROC. It also provides utilities for hyperparameter tuning and
cross-validation to select the best-performing model.

4. Preprocessing and Feature Engineering: The library includes utilities for preprocessing
data, such as scaling, normalization, encoding categorical variables, and handling missing
values. It also provides feature extraction and transformation methods for creating new
features or transforming existing ones.

5. Pipeline: Scikit-learn allows you to construct machine learning pipelines that chain
together multiple processing steps, such as data preprocessing, feature selection, and model
training, into a single object. This makes it easy to apply the same preprocessing steps
consistently to training and test data.

6. Integration with other Libraries: Scikit-learn integrates well with other Python libraries
such as pandas for data manipulation and matplotlib for data visualization.

3.3.8 SciPy (1.10.0)

Page 18 of 98
SciPy is an open-source Python library that is used for scientific and technical computing. It
builds on NumPy, another Python library, and provides additional functionality for
optimization, integration, interpolation, linear algebra, statistics, signal processing, and more.

SciPy is a fundamental library for many scientific computing tasks in Python, providing
efficient and optimized implementations of numerous mathematical algorithms and functions.
It is often used alongside other libraries like NumPy, matplotlib, and scikit-learn to perform
various tasks ranging from data analysis to simulation to optimization.

SciPy is organized into several submodules, each focusing on different aspects of scientific
computing:

1. SciPy.optimize: Provides functions for numerical optimization, including unconstrained


and constrained optimization, least squares minimization, and root finding.

2. SciPy.integrate: Contains functions for numerical integration and solving ordinary


differential equations (ODEs), including both initial value problems and boundary value
problems.

3. SciPy.interpolate: Offers tools for interpolation and smoothing, including spline


interpolation, interpolation on unstructured data, and radial basis function interpolation.

4. SciPy.linalg: Provides linear algebra routines for solving linear systems, computing
eigenvalues and eigenvectors, singular value decomposition (SVD), and other matrix
operations.

5. SciPy.sparse: Deals with sparse matrices and related operations, which are often
encountered in large-scale scientific computing and numerical simulations.

6. SciPy.stats: Contains a wide range of statistical functions and probability distributions for
statistical analysis and hypothesis testing.

7. SciPy.signal: Offers signal processing tools, including filtering, spectral analysis, and
waveform generation.

Page 19 of 98
8. SciPy.special: Provides special functions such as Bessel functions, gamma functions, and
elliptic functions, which are commonly used in scientific computing and mathematical
physics.

3.3.9 Seaborn (0.12.2)

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level


interface for creating attractive and informative statistical graphics.

Here's an overview of Seaborn's key features and capabilities:

1. Statistical Visualization: Seaborn simplifies the process of creating statistical visualizations


by providing functions that automatically aggregate and summarize data. It offers support for
various plot types, including scatter plots, line plots, bar plots, histograms, box plots, violin
plots, and heatmaps.

2. Built-in Themes and Color Palettes: Seaborn includes several built-in themes and color
palettes that can be easily applied to plots to improve their aesthetics and readability. These
themes and palettes are designed to work well together and provide consistent styling across
different plot types.

3. Statistical Estimation: Seaborn provides functions for estimating and visualizing statistical
relationships in data, including linear regression models, correlation matrices, and kernel
density estimation plots.

4. Categorical Data Plotting: Seaborn offers specialized support for visualizing categorical
data, including functions for creating grouped bar plots, point plots, count plots, and
categorical scatter plots.

5. Matrix Plotting: Seaborn includes functions for creating matrix plots, such as cluster maps
and heatmap plots, which are useful for visualizing hierarchical clustering and pairwise
relationships in data.

Page 20 of 98
6. Flexible Configuration Options: Seaborn provides a range of configuration options for
customizing plot appearance and behavior, including control over axis labels, titles, legends,
tick marks, and grid lines.

3.3.10 SMmap (5.0.0)


SMmap is a Python library that provides a pure Python implementation of a sliding window
memory map manager. It's primarily used for efficient handling of large memory-mapped
files in Python. Memory-mapped files are used when you need to work with large datasets
that may not fit entirely into memory.

Here's a brief overview of SMmap:

1. Memory Mapping: SMmap allows you to memory-map large files so that you can access
their contents as if they were stored in memory. This enables you to work with files that are
larger than the available physical memory.

2. Sliding Window Technique: SMmap uses a sliding window technique to efficiently manage
memory maps. Instead of mapping the entire file into memory at once, it maps only a portion
of the file (window) into memory at a time. As you access different parts of the file, SMmap
automatically adjusts the window to ensure that the necessary data is available in memory.

3. Efficient Access: By using memory mapping, SMmap provides efficient random access to
large files. It minimizes the overhead associated with reading and writing large amounts of
data by leveraging the operating system's virtual memory system.

4. Pure Python Implementation: SMmap is implemented in pure Python, making it easy to


install and use on different platforms without requiring any external dependencies.

SMmap is particularly useful in scenarios where you need to process or analyze large datasets
that are stored in files, such as log files, databases, or scientific data files.

Page 21 of 98
3.3.11 Streamlit (1.28.1)
Streamlit is a Python library that enables you to create interactive web applications for
machine learning, data science, and other analytical tasks with minimal effort

Here are the key features and capabilities of Streamlit:

1. Simple and Intuitive: Streamlit allows you to build web applications using only Python
scripting. You can create interactive apps with just a few lines of code, without needing to
write HTML, CSS, or JavaScript.

2. Fast Iteration: With Streamlit, you can see the results of your code changes instantly in the
web app, thanks to its automatic rerunning and updating mechanism. This enables fast
iteration and experimentation during development.

3. Wide Range of Widgets: Streamlit provides a variety of widgets for building interactive
elements in your apps, including sliders, dropdowns, text inputs, buttons, and more. These
widgets make it easy to create user interfaces for controlling and visualizing data.

4. Integration with Python Libraries: Streamlit seamlessly integrates with popular Python
libraries such as Pandas, Matplotlib, Plotly, and scikit-learn, allowing you to leverage their
capabilities for data manipulation, visualization, and machine learning within your web apps.

Streamlit has gained popularity in the data science and machine learning community for its
simplicity and ease of use in building interactive applications.

3.3.12 Tqdm (4.64.1)


TQDM is a Python library that provides a fast, extensible progress bar for loops and iterables.
It stands for "taqaddum" in Arabic, which means "progress" or "advancement”.

Here's a summary of its main features and functionality:

1. Progress Bars: TQDM makes it easy to add progress bars to your Python loops and
iterables. This provides users with visual feedback on the progress of tasks, particularly
useful for lengthy computations or when dealing with large datasets.

Page 22 of 98
2. Customization: TQDM offers various customization options for progress bars. You can
adjust the appearance, including the bar's style, color, width, and position. Additionally, you
can configure features like dynamic updates, nested progress bars, and estimated time
remaining.

3. Integration: TQDM integrates seamlessly with a wide range of Python data structures and
libraries, including lists, dictionaries, NumPy arrays, Pandas dataframes, and file objects. It
can be used with loops, list comprehensions, map functions, and other iterable constructs.

4. Performance: TQDM is designed for efficiency, minimizing overhead to your code. It


employs smart algorithms to estimate iteration times and update progress bars in real-time
without significantly impacting performance.

5. Extensibility: TQDM is highly extensible, allowing you to create custom progress bars and
integrate them into your projects. You can subclass the TQDM class to implement custom
behavior or develop plugins to extend its functionality.

TQDM is widely used in various Python projects, particularly in data science, machine
learning, and computational tasks, to monitor progress and enhance the user experience.

3.3.13 Geopy (2.3.0)


Geopy is a Python library that provides geocoding and reverse geocoding capabilities,
allowing you to convert addresses into geographic coordinates (latitude and longitude) and
vice versa. It interfaces with various geocoding services such as Google Maps, Bing Maps,
OpenStreetMap Nominatim, and more.

Here's a breakdown of Geopy's main features and functionalities:

1. Geocoding: Geopy enables you to geocode addresses, meaning it can convert human-
readable addresses into geographic coordinates (latitude and longitude). This is useful for
tasks such as mapping, location-based services, and data analysis.

Page 23 of 98
2. Reverse Geocoding: Conversely, Geopy supports reverse geocoding, which involves
converting geographic coordinates into human-readable addresses. This can be helpful for
identifying the location of a point on a map or finding nearby places based on coordinates.

3. Support for Multiple Providers: Geopy supports multiple geocoding providers, allowing
you to choose the service that best fits your needs. Some of the supported providers include
Google Maps API, Bing Maps API, OpenStreetMap Nominatim, ArcGIS, and more.

4. Pluggable Architecture: Geopy has a pluggable architecture, making it easy to add support
for additional geocoding providers or customize existing ones. This flexibility allows
developers to adapt Geopy to their specific requirements or preferences.

5. Simple API: Geopy provides a simple and consistent API for geocoding and reverse
geocoding operations, making it easy to integrate into your Python projects. It abstracts away
the complexities of interacting with various geocoding services, allowing you to focus on
your application logic.

Page 24 of 98
Chapter 4: Software Requirements Analysis
Software requirements analysis is the process of identifying, eliciting, documenting, and
validating the needs and expectations of stakeholders for a software system. It's a crucial
phase in the software development lifecycle as it lays the foundation for the entire project.

4.1 Problem Definition

4.1.1 Chat Insight is a statistical analysis tool for WhatsApp chats. Working on the chat files
that can be exported from WhatsApp it generates various plots showing, for example, which
another participant a user responds to the most. We propose to employ dataset manipulation
techniques to have a better understanding of WhatsApp chat present in our phones.

4.1.2 "In today's digital age, communication via chat platforms has become ubiquitous,
spanning various domains such as customer service, social networking, and online
collaboration. However, the sheer volume and complexity of textual conversations pose
significant challenges in extracting meaningful insights and optimizing communication
processes. Businesses and organizations need a robust solution to analyze chat data
effectively, identify patterns, sentiments, and key information, and derive actionable insights
to enhance customer satisfaction, streamline operations, and make data-driven decisions. The
problem statement for a chat analyzer is to develop a comprehensive tool or software solution
that can automatically process, analyze, and visualize chat data from diverse sources,
providing users with valuable insights into conversation dynamics, sentiment trends, topic
clusters, and other relevant metrics. This solution should prioritize accuracy, scalability,
privacy, and ease of use, catering to the needs of businesses, researchers, and other
stakeholders seeking to leverage chat data for various purposes."

Let's break down the problem statement with examples:

1. Volume and Complexity of Chat Data:


- Example: A customer support department in an e-commerce company receives thousands
of chat messages daily from customers inquiring about products, seeking assistance with
orders, or lodging complaints. The sheer volume of text data makes it challenging for support
agents to manually analyze and respond to each message efficiently.

2. Extracting Meaningful Insights:


- Example: A social media monitoring team wants to understand the public sentiment
towards their brand by analyzing comments and messages across various platforms. They

Page 25 of 98
need to extract meaningful insights such as positive or negative sentiment, prevalent topics of
discussion, and key influencers driving conversations.

3. Optimizing Communication Processes:


- Example: A software development team uses a chat platform for real-time collaboration.
They face challenges in tracking project progress, identifying bottlenecks in communication,
and ensuring all team members are engaged and informed about updates and decisions.

4. Enhancing Customer Satisfaction:


- Example: A hospitality company relies on chatbots to handle customer inquiries and
bookings. However, they struggle to ensure that the chatbots accurately understand customer
requests, provide relevant information, and resolve issues promptly, leading to decreased
customer satisfaction.

5. Streamlining Operations:
- Example: A financial institution uses chat-based communication for internal discussions
among employees. They need a solution to analyze these conversations to identify areas for
process improvement, compliance risks, or opportunities for cost reduction.

In summary, the problem statement for a chat insight revolves around addressing the
challenges associated with analyzing large volumes of chat data, extracting meaningful
insights, optimizing communication processes, and ultimately leveraging this information to
improve customer satisfaction, streamline operations, and make informed decisions across
various domains and industries.

Page 26 of 98
4.2 Modules and their functionalities
Here are three to five modules commonly used in a Chat Insight project:

4.2.1 Data Preprocessing Module:


- This module is responsible for cleaning and preparing the raw chat data for analysis. It
may include tasks such as text normalization (e.g., removing punctuation, converting to
lowercase), tokenization (splitting text into words or phrases), stop word removal, and
stemming or lemmatization (reducing words to their base form). Data preprocessing ensures
that the text data is standardized and ready for further analysis.
The data preprocessing module in a chat insight project plays a crucial role in preparing raw
chat data for further analysis. Here are some tasks performed by data preprocessing module:

1. Text Cleaning:
- Text data often contains various artifacts, such as HTML tags, special characters, and
punctuation, which need to be removed to ensure uniformity and consistency. Text cleaning
involves tasks like stripping HTML tags, removing non-alphanumeric characters, and
handling special cases like emojis or emoticons.

2. Tokenization:
- Tokenization is the process of splitting text into smaller units, typically words or phrases,
known as tokens. This step breaks down the text into its fundamental components, making it
easier to analyze and process. Tokenization can be performed using simple whitespace
splitting or more advanced techniques that consider punctuation and context.

3. Normalization:
- Normalization aims to standardize the text by converting it to a common format. This may
involve converting text to lowercase to ensure case-insensitive matching, expanding
contractions (e.g., "don't" to "do not"), and replacing abbreviations or acronyms with their
full forms.

4. Stemming or Lemmatization:
- Stemming and lemmatization are techniques used to reduce words to their base or root
forms, allowing for more effective analysis. Stemming involves stripping suffixes from words
to obtain their stems (e.g., "running" becomes "run"), while lemmatization uses linguistic
rules to return the canonical form of a word (e.g., "ran" becomes "run"). Both techniques help
consolidate variations of words and improve text normalization.

Page 27 of 98
5. Spell Checking:
- Spell checking is optional but can be beneficial for improving the quality of text data. It
involves identifying and correcting spelling errors, ensuring that the text is accurate and
interpretable. Spell checking algorithms can automatically suggest corrections for misspelled
words based on dictionary lookup or statistical models.

By performing these preprocessing tasks, the data preprocessing module ensures that the chat
data is clean, standardized, and ready for subsequent analysis tasks such as sentiment
analysis, named entity recognition, and topic modeling. This module plays a critical role in
ensuring the accuracy and reliability of the insights derived from the chat data.

4.2.2. Sentiment Analysis Module:


- This module analyzes the sentiment expressed in chat messages, determining whether the
text conveys positive, negative, or neutral sentiment. It typically employs machine learning or
lexicon-based approaches to classify the sentiment of each message. Sentiment analysis is
crucial for understanding customer satisfaction, identifying potential issues or grievances, and
monitoring public opinion towards a brand or topic.
The sentiment analysis module in a chat insight project is responsible for determining the
sentiment expressed in chat messages. The functions performed by this module are:

1. Classification Approach:
- Sentiment analysis typically employs a classification approach, where each chat message
is classified into one or more sentiment categories, such as positive, negative, or neutral. This
classification can be binary (positive/negative) or multiclass (positive/negative/neutral)
depending on the granularity of sentiment analysis required.

2. Machine Learning Models:


- Machine learning models are commonly used for sentiment analysis, especially for tasks
requiring more nuanced sentiment classification. Supervised learning algorithms, such as
Support Vector Machines (SVM), Naive Bayes, and neural networks (e.g., Recurrent Neural
Networks, Convolutional Neural Networks), are trained on labeled datasets to predict the
sentiment of unseen chat messages.

3. Lexicon-Based Approaches:
- Lexicon-based approaches rely on predefined sentiment lexicons or dictionaries that map
words to sentiment scores. Each word in a chat message is assigned a sentiment score, and
the overall sentiment of the message is determined based on the aggregation of these scores.

Page 28 of 98
Lexicon-based methods are computationally efficient but may struggle with context-
dependent sentiment and nuances in language.

4. Rule-Based Systems:
- Rule-based systems use a set of predefined rules or patterns to identify sentiment in text.
These rules may be based on linguistic patterns, syntactic structures, or domain-specific
knowledge. Rule-based systems offer transparency and interpretability but may require
manual crafting of rules and lack the flexibility of machine learning approaches.

5. Aspect-Based Sentiment Analysis:


- Aspect-based sentiment analysis goes beyond overall sentiment classification to identify
the sentiment expressed towards specific aspects or entities mentioned in the chat messages.
This approach decomposes the sentiment of each message into multiple aspects (e.g., product
features, customer service quality) and analyzes the sentiment associated with each aspect
separately.

6. Handling Context and Negation:


- Sentiment analysis models need to account for contextual information and negation in
text. Contextual clues such as modifiers (e.g., "very good," "extremely bad") and intensifiers
(e.g., "excellent," "terrible") can influence the sentiment expressed in a message. Negation
cues (e.g., "not," "never") invert the sentiment of words following them, requiring models to
handle negation effectively.

7. Evaluation Metrics:
- The performance of sentiment analysis models is evaluated using metrics such as
accuracy, precision, recall, and F1-score. These metrics quantify the model's ability to
correctly classify chat messages into their respective sentiment categories. Additionally,
domain-specific evaluation may be necessary to ensure the model's effectiveness in real-
world applications.

The sentiment analysis module is integral to understanding the emotional tone and attitude
conveyed in chat conversations. By accurately identifying sentiment, businesses can gauge
customer satisfaction, detect emerging trends, and make informed decisions to enhance user
experience and brand perception.

Page 29 of 98
4.2.3. Named Entity Recognition (NER) Module:
- NER is a natural language processing task that identifies and categorizes named entities
mentioned in text, such as people, organizations, locations, dates, and other entities of
interest. This module extracts relevant entities from chat conversations, enabling users to
identify key entities and topics discussed. NER can be useful for various applications,
including trend analysis, topic modeling, and entity-based sentiment analysis.
The Named Entity Recognition (NER) module in a chat insight project is designed to identify
and classify named entities mentioned in text data. Here are some key functionalities of the
NER module:

1. Entity Identification:
- The primary functionality of the NER module is to identify named entities within chat
messages. Named entities can include various types of entities such as persons, organizations,
locations, dates, numerical values, and more. The module scans the text data to identify spans
of text that represent named entities.

2. Entity Classification:
- Once named entities are identified, the NER module classifies them into predefined
categories. Common categories include:
- Person: Individual names of people.
- Organization: Names of companies, institutions, or groups.
- Location: Names of places, including cities, countries, and landmarks.
- Date: Specific dates or time expressions.
- Numeric: Numerical values, such as quantities or measurements.
- Others: Additional categories specific to the domain, such as product names or event
names.

3. Contextual Understanding:
- The NER module takes into account the context surrounding named entities to improve
accuracy and relevance. Contextual understanding helps distinguish between entities that may
have multiple meanings or interpretations based on the surrounding text. For example, "Paris"
could refer to both the city in France and a person's name.

Page 30 of 98
4. Multi-lingual Support:
- Depending on the application requirements, the NER module may support multiple
languages. It should be capable of recognizing named entities in different languages and
adapting its classification approach accordingly. Multi-lingual support enhances the
versatility and usability of the chat analyzer across diverse linguistic contexts.

5. Error Handling and Ambiguity Resolution:


- The NER module should handle cases of ambiguity or uncertainty in entity recognition. It
may incorporate strategies to deal with ambiguous references, overlapping entities, or entities
with multiple interpretations. Error handling mechanisms ensure robustness and reliability in
entity extraction, even in challenging linguistic scenarios.

Overall, the NER module enhances the capability of the chat insight to extract structured
information from unstructured text data, enabling users to identify and analyze named entities
of interest within chat conversations.

4.2.4. Topic Modeling Module:


- This module identifies the underlying topics or themes present in a collection of chat
messages. It employs algorithms such as Latent Dirichlet Allocation (LDA) or Non-Negative
Matrix Factorization (NMF) to discover latent topics based on the distribution of words
across documents. Topic modeling helps users gain insights into the main subjects of
conversation, detect emerging trends, and categorize messages according to their thematic
content.
The Topic Modeling module in a Chat Insight project is responsible for identifying
underlying themes or topics present in a collection of chat messages. Here are the key
functionalities of the Topic Modeling module:

1. Topic Discovery:
- The primary functionality of the Topic Modeling module is to discover latent topics within
the chat data. It identifies groups of words that frequently co-occur across multiple chat
messages, suggesting the presence of underlying themes or subjects of discussion.

2. Unsupervised Learning:
- Topic modeling is typically an unsupervised learning task, meaning it does not require
labeled data for training. Instead, it automatically learns the latent topics present in the chat
data based on the distribution of words across documents.

Page 31 of 98
3. Algorithm Selection:
- The Topic Modeling module may employ various algorithms to perform topic discovery.
Common algorithms include Latent Dirichlet Allocation (LDA), Non-Negative Matrix
Factorization (NMF), Latent Semantic Analysis (LSA), and Probabilistic Latent Semantic
Analysis (PLSA). Each algorithm has its own strengths and weaknesses, and the choice
depends on factors such as the size of the dataset, computational resources, and desired
interpretability.

4. Topic Representation:
- Once topics are discovered, the Topic Modeling module represents them in a meaningful
way for interpretation. Each topic is typically represented as a distribution of words, with
words ranked by their probability of occurring in the topic. This representation allows users
to understand the key terms associated with each topic.

5. Topic Labeling:
- The module may provide functionality for labeling topics based on the most representative
words. Automatic topic labeling helps users quickly understand the content of each topic
without having to manually inspect the entire list of associated words.

4.2.5. Visualization Module:


- This module generates visualizations to present the results of chat analysis in a clear and
intuitive manner. It may include various types of visualizations such as word clouds, bar
charts, line charts, heatmaps, and network graphs. Visualization aids users in understanding
patterns, trends, and relationships within the chat data, facilitating data exploration,
interpretation, and communication of insights.
These modules collectively form the backbone of a chat insight project, enabling users to
preprocess, analyze, and visualize chat data to extract meaningful insights and derive
actionable conclusions.
The Visualization module in a chat insight project is responsible for presenting the results of
data analysis in a visual format. Here are the key functionalities of the Visualization module:

1. Data Exploration:
- The primary functionality of the Visualization module is to enable users to explore the
chat data visually. This includes displaying summary statistics, such as message frequency,

Page 32 of 98
word counts, and user activity over time. Visualization facilitates initial data inspection and
helps users gain a high-level understanding of the dataset.

2. Sentiment Visualization:
- The module can visualize the sentiment distribution of chat messages using various charts
or graphs. Common visualization techniques include sentiment histograms, pie charts, or
stacked bar charts showing the proportion of positive, negative, and neutral messages.
Sentiment visualization provides insights into the overall sentiment trends in the chat data.

3. Named Entity Visualization:


- Named entities extracted from chat messages can be visualized to provide insights into the
entities mentioned most frequently. This may involve generating word clouds, tag clouds, or
bar charts showing the frequency of different entity types (e.g., person names, organization
names, locations). Named entity visualization helps users identify key entities and topics of
discussion.

4. Topic Visualization:
- The module visualizes the topics discovered through topic modeling to aid in their
interpretation. This may include interactive topic proportion plots, word clouds showing the
most representative words for each topic, or dendrogram-based topic hierarchies. Topic
visualization enables users to explore the content of topics and understand their relationships.

5. Time Series Visualization:


- Chat data often includes timestamps, allowing for the visualization of temporal trends and
patterns. The module can generate time series plots or line charts showing message
frequency, sentiment scores, or topic proportions over time. Time series visualization helps
users identify trends, seasonality, and anomalies in chat activity.

6. Network Visualization:
- In cases where chat data involves interactions between users or entities, the module may
visualize the communication network. This could involve creating node-link diagrams or
social network graphs showing connections between users based on message exchanges.
Network visualization provides insights into the structure of communication and the
relationships between participants.

By providing these functionalities, the Visualization module enables users to visually inspect,
analyze, and communicate insights derived from chat data effectively.

Page 33 of 98
Chapter 5: S/W Design

5.1 S/W Development Lifecycle Model

5.1.1 For a chat analyzer software, we’ll want a development lifecycle model that allows for
iterative development, frequent testing, and continuous improvement.

1. Requirements Gathering and Analysis:


- Define the objectives and scope of the chat analyzer software.
- Gather requirements from stakeholders, including features, functionalities, and
performance expectations.
- Analyze the data sources (chat logs) and the types of analysis required.

2. Design:
- Create a high-level architectural design outlining the components of the chat analyzer.
- Design the user interface for interacting with the analyzer.
- Develop data models and algorithms for analyzing chat data.

3. Implementation:
- Develop the software according to the design specifications.
- Implement algorithms for sentiment analysis, entity recognition, topic modeling, etc.
- Integrate with any external APIs or services required for additional functionality.

4. Testing:
- Conduct unit testing for individual components to ensure they function as expected.
- Perform integration testing to ensure different modules work together seamlessly.
- Conduct functional testing to verify that the software meets the specified requirements.
- Perform performance testing to ensure the software can handle the expected workload.

5. Deployment:

Page 34 of 98
- Prepare the software for deployment in the target environment.
- Set up necessary infrastructure, including servers, databases, and security measures.
- Deploy the software to production or staging environments.

6. Feedback and Evaluation:


- Gather feedback from users and stakeholders on the deployed software.
- Evaluate the performance of the chat analyzer against the initial objectives.
- Identify areas for improvement or additional features based on feedback and evaluation.

7. Maintenance and Updates:


- Provide ongoing maintenance and support for the deployed software.
- Address any bugs or issues reported by users.
- Release updates with new features or improvements based on feedback and changing
requirements.

8. Continuous Improvement:
- Continuously monitor the performance and usage of the chat analyzer.
- Collect data on user behavior and system performance to inform future updates.
- Incorporate user feedback and market trends to prioritize future development efforts.

This lifecycle model combines elements of iterative development, agile methodologies, and
continuous improvement to ensure the chat analyzer software meets the evolving needs of its
users.

For a chat analyzer software, we can consider using an Agile software development model,
specifically Scrum. Scrum is well-suited for projects where requirements may evolve over
time, and there's a need for frequent feedback and adaptation. Here are some steps to adapt
Scrum to fit the development of a chat analyzer:

1. Product Backlog Creation:


- Define a product backlog containing all the desired features and functionalities of the chat
analyzer. This backlog is dynamic and can evolve throughout the project.

Page 35 of 98
2. Sprint Planning:
- Plan short development cycles called sprints, typically lasting 2-4 weeks.
- During sprint planning, select a subset of items from the product backlog to work on
during the sprint. These items should deliver tangible value to the users.

3. Daily Standups:
- Hold daily standup meetings where the development team discusses progress, challenges,
and plans for the day.
- These meetings ensure transparency and keep the team aligned towards sprint goals.

4. Development:
- Develop and implement the selected backlog items during the sprint.
- Use iterative development practices to deliver working software increments at the end of
each sprint.

5. Sprint Review:
- At the end of each sprint, conduct a sprint review meeting where the development team
demonstrates the completed work to stakeholders.
- Gather feedback from stakeholders to validate whether the delivered features meet their
expectations.

6. Sprint Retrospective:
- Hold a sprint retrospective meeting after the sprint review to reflect on what went well,
what could be improved, and any adjustments needed for the next sprint.
- Identify and address process improvements to enhance team productivity and product
quality.

7. Incremental Delivery:
- Continuously deliver increments of the chat analyzer software with each sprint, allowing
stakeholders to see tangible progress and provide feedback throughout the development
process.

Page 36 of 98
8. Adaptation:
- Use feedback gathered from stakeholders during sprint reviews to adapt the product
backlog and adjust development priorities as needed.
- Embrace change and flexibility to ensure the chat analyzer meets evolving requirements
and user needs.

By following the Scrum framework, we can effectively manage the development of the chat
analyzer software in a collaborative and adaptive manner, delivering value to users
incrementally while remaining responsive to changes in requirements and feedback.

Page 37 of 98
5.2 Progress flowchart of Project

Page 38 of 98
Chapter 6: Source Code

6.1 Frontend code

from geopy.geocoders import Nominatim

from helpers import *

import altair as alt

st.set_page_config(

page_title="CHAT INSIGHT",

page_icon="🧊",

layout="wide",

initial_sidebar_state="expanded",

menu_items={

'Report a bug': "https://github.com/Ankurkatri/Chatinsight/issues",

'About': "# This is Chat Insight"

st.cache_data.clear()

st.cache_resource.clear()

st.write("""

## Chat insight

""")

with st.expander("About this app"):

st.markdown(

"""

V1.0 2024-02-25:

### What's New?

- GIF, Sticker, Audio, Deleted, Location Message Statistics.

Page 39 of 98
- Maps for shared location

- Talkativeness & Messaging Trends

- General Formatting & Chart Redesign

### Info

- This does not save your chat file.

- Note that it only supports English and Turkish right now.

- Most of the charts are based on group chats but it works for dms too,

some of the charts will be pointless but give it a shot.

- Sometimes whatsapp can have problems while exporting with date formats.

If there is an error after uploading, check your file date format,

there might be some inconsistency in date formatting.

- It may take a while for around 2 minutes for 20mb of chat file on the

server.

- Possible to-dos:

- Aggregate multiple people into one. Sometimes a user can have multi

numbers and we should give a chance to see them as one single user.

- Charts can be change by year via dropdown.

- Add emoji support

- Exportable pdf

- More prescriptive

- Demo chat

- Last but not least - Thanks to [chat-miner](

https://github.com/joweich/chat-miner) for easy whatsapp parsing tool and

their awesome charts. Thanks to [Dinesh Vatvani](https://dvatvani.github.io/whatsapp-


analysis.html)

for his great analysis.

- source: [ankur](https://github.com/Ankurkatri/Chatinsight/)

"""

Page 40 of 98
def app():

session_state = st.session_state

if "df" not in session_state:

file = st.file_uploader("Upload WhatsApp chat file without media. "

"The file should be .txt", type="txt")

@st.cache_data

def read_sample_data():

df = pd.read_csv(

"https://raw.githubusercontent.com/koftezz/whatsapp-chat-analyzer/
0aee084ffb8b8ec4869da540dc95401b8e16b7dd/data/sample_file.txt", header=None)

return df.to_csv(index=False).encode('utf-8')

csv = read_sample_data()

st.download_button(

label="Download sample chat file.",

data=csv,

file_name='data/sample_file.txt',

if file is not None:

df = read_file(file)

df = df.sort_values("timestamp")

# first three entry is most likely is the group creation.

df = df[3:]

edited_df = df[["author"]].drop_duplicates()

edited_df["combined authors"] = edited_df["author"]

edited_df = st.data_editor(edited_df)

author_list = edited_df["combined authors"].drop_duplicates().tolist()

with st.form(key='my_form_to_submit'):

Page 41 of 98
selected_authors = st.multiselect(

"Choose authors of the group",

author_list)

selected_lang = st.radio(

"What\'s your Whatsapp Language?",

("English", 'Turkish'))

submit_button = st.form_submit_button(label='Submit')

if submit_button and len(selected_authors) < 2:

st.warning("Proceeding with all of the authors. Please "

"check that there might be some problematic "

"authors.",

icon="")

selected_authors = df["author"].drop_duplicates().tolist()

if submit_button:

df, locations = preprocess_data(df=df,

selected_lang=selected_lang,

selected_authors=selected_authors)

with st.expander("Show the `Chat` dataframe"):

st.dataframe(df)

# Set up colors to use for each author to

# keep them consistent across the analysis

author_list = df.groupby('author').size().index.tolist()

author_color_lookup = {author: f'C{n}' for n, author in

enumerate(author_list)}

author_color_lookup['Group total'] = 'k'

def formatted_barh_plot(s,

pct_axis=False,

Page 42 of 98
thousands_separator=False,

color_labels=True,

sort_values=True,

width=0.8,

**kwargs):

if sort_values:

s = s.sort_values()

s.plot(kind='barh',

color=s.index.to_series().map(

author_color_lookup).fillna(

'grey'),

width=width,

**kwargs)

if color_labels:

for color, tick in zip(

s.index.to_series().map(

author_color_lookup).fillna(

'grey'),

plt.gca().yaxis.get_major_ticks()):

tick.label1.set_color(

color) # set the color property

if pct_axis:

if type(pct_axis) == int:

decimals = pct_axis

else:

decimals = 0

plt.gca().xaxis.set_major_formatter(

ticker.PercentFormatter(1, decimals=decimals))

elif thousands_separator:

plt.gca().xaxis.set_major_formatter(

ticker.FuncFormatter(

Page 43 of 98
lambda x, p: format(int(x), ',')))

return plt.gca()

msg = f"## Overall Summary\n" \

f"{len(df)} total messages from" \

f" {len(df.author.unique())} " \

f"people " \

f"from {df.date.min()} to {df.date.max()}."

st.write(msg)

# Basic summary of messages

st.write(

"## Basic summary of messages")

with st.expander("More info"):

st.write("All the "

"statistics are average of the respective "

"columns. For Ex: Link means average of "

"messages sent with link.\n- Conversation starter \n"

"defined as a message sent at least 7 hours after the"

" previous message on the thread\n")

o = basic_stats(df=df)

st.dataframe(o, use_container_width=True)

# Basic summary of messages

st.write("## Summary across authors")

o = stats_overall(df=df)

st.dataframe(o)

st.write("## Talkativeness & Messaging Trends")

Page 44 of 98
author_df = trend_stats(df=df)

st.dataframe(author_df, use_container_width=True)

# Total messages sent stats

st.write("## Number of Messages Sent By Author")

o = pd.DataFrame(

df.groupby('author')["message"].count()).reset_index()

most_active = \

o.sort_values("message", ascending=False).iloc[0][

'author']

total_msg = o.sort_values("message",

ascending=False).iloc[0][

'message']

st.write(f"Here is the chatter of the group :red"

f"[{most_active}], by sending total of"

f" {total_msg} messages. Only by him/herself. 🤯")

c = alt.Chart(o).mark_bar().encode(

x=alt.X("author", sort="-y"),

y=alt.Y('message:Q'),

color='author',

rule = alt.Chart(o).mark_rule(color='red').encode(

y='mean(message):Q'

c = (c + rule).properties(width=600, height=600,

title='Number of messages sent'

st.altair_chart(c)

# Average message length by use

Page 45 of 98
o = activity(df=df)

most_active = \

o.sort_values("Activity %", ascending=False).iloc[0][

'author']

most_active_perc = o.sort_values("Activity %",

ascending=False).iloc[0][

'Activity %']

st.write(f"""## Activity Stats by Author""")

st.write(f":red[{most_active}] is online "

f"{round(most_active_perc, 2)}% of the conversations. "

f"Go get a job!")

with st.expander("More info"):

st.info(

"It shows the percent of an author who sent at least a"

" message in a random"

" day with active conversation.")

c = alt.Chart(o).mark_bar().encode(

x=alt.X("author:N", sort="-y"),

y=alt.Y('Activity %:Q'),

color='author',

rule = alt.Chart(o).mark_rule(color='red').encode(

y='mean(Activity %):Q'

c = (c + rule).properties(width=600, height=600,

title='Activity % by author'

st.altair_chart(c)

# Smoothed stacked activity area timeseries plot

st.write("""## Activity Area Plot """)

Page 46 of 98
with st.expander("More info"):

min_year = df.year.max() - 5

st.info("It is an absolute plot which we can see who has "

"been more active in terms of total messsages."

" It is smoothed with gaussian "

"distribution since the data is likely to be "

f"noisy.\nChart starts from year {min_year + 1}")

smoothed_daily_activity_df = smoothed_daily_activity(df=df)

st.area_chart(smoothed_daily_activity_df)

# Relative activity timeseries - 100% stacked area plot

st.write("""

## Relative Activity Area Plot

""")

with st.expander("More info"):

min_year = df.year.max() - 3

st.info("It is a relative plot which we can see who has "

"been more active."

"with respect to each other. It basically shows "

"the activity percentage of each author changes "

"over time. It is smoothed with gaussian "

"distribution since the data is likely to be "

f"noisy.\nChart starts from year {min_year + 1}")

o = relative_activity_ts(df=df)

st.area_chart(o)

# Timeseries: Activity by day of week

st.write("""

## Activity by day of week

0 - Monday

6 - Sunday

""")

Page 47 of 98
o = activity_day_of_week_ts(df=df)

st.line_chart(o)

# Timeseries: Activity by time of day

st.write("""

## Activity by time of day (UTC)

""")

b = activity_time_of_day_ts(df=df)

c = alt.Chart(b).transform_fold(

selected_authors,

as_=['author', "message"]

).mark_line().encode(

x=alt.X('utchoursminutes(timestamp):T', axis=alt.Axis(

format='%H:00'),

scale=alt.Scale(type='utc')),

y='message:Q',

color='author:N'

).properties(width=1000, height=600)

st.altair_chart(c)

# Response matrix

st.write("""

## Response Matrix

""")

with st.expander("More info"):

st.info("This does not consider the content of the "

"message. It is based on who is the previous "

"sender of the message. Self sonsecutive messages "

"within 3 minutes are excluded."

"")

Page 48 of 98
with st.container():

fig = response_matrix(df=df)

st.pyplot(fig)

st.write("""

## Response Time Distribution

""")

with st.expander("More info"):

st.info("Self consecutive messages "

"within 3 minutes are excluded."

" Median response time shows that author X "

"responded the messages at least y mins later, "

"half of the time."

"")

# Response time

prev_msg_lt_180_seconds = (df.timestamp - df.timestamp.shift(

1)).dt.seconds < 180

same_prev_author = (df.author == df.author.shift(1))

fig, ax = plt.subplots()

plt.subplot(121)

o = df[~(prev_msg_lt_180_seconds & same_prev_author)]

o["response_time"] = (

(o.timestamp - o.timestamp.shift(1)).dt.seconds

.replace(0, np.nan)

.div(60)

.apply(np.log10))

o = o[["author", "response_time"]]

o.groupby("author")["response_time"].apply(

sns.kdeplot)

plt.title('Response time distribution', fontsize=8)

Page 49 of 98
plt.ylabel('Relative frequency', fontsize=8)

plt.xlabel('Response time (Mins)', fontsize=8)

locs, ticks = plt.xticks()

plt.xticks(locs, [f"$10^{{{int(loc)}}}$" for loc in locs])

plt.subplot(122)

o = df[~(prev_msg_lt_180_seconds & same_prev_author)]

o["response_time"] = (

(o.timestamp - o.timestamp.shift(1)).dt.seconds

.replace(0, np.nan)

.div(60))

response = o[["author", "response_time", "letters"]]

response.groupby("author").median()["response_time"].pipe(

formatted_barh_plot)

plt.title('Median response time', fontsize=8)

plt.ylabel('')

plt.xlabel('Response time (Mins)', fontsize=8)

plt.tight_layout()

with st.container():

slow_typer = response.groupby("author").median()[

"response_time"]. \

sort_values()[-1:].index[0]

st.write(f"Looks like :red[{slow_typer}] has much to do, "

f"except responding to you guys on time.👨‍💻👩‍💻")

st.pyplot(fig)

std = response.response_time.std()

mean = response.response_time.mean()

response = response.loc[response["response_time"] <= mean + 3

Page 50 of 98
* std]

c = alt.Chart(response).mark_point(size=60).encode(

x='letters',

y='response_time',

color='author',

tooltip=["author", "response_time", "letters"]

c = (c + c.transform_regression("letters",

'response_time').mark_line()). \

properties(width=1000, height=600,

title='Response Time vs Number of letters in '

'a message'

).interactive()

st.write("## Is number of letters correlated with the "

"response time?")

with st.container():

st.altair_chart(c)

max_spammer, max_spam = spammer(df=df)

st.write("""

## Who is the spammer?

The most spam is from :red[%s] with %d consecutive

messages. """ % (

max_spammer, max_spam))

st.write("""

## Year x Month Total Messages

""")

year_content = year_month(df=df)

total_messages = year_content.sort_values("message",

Page 51 of 98
ascending=False).iloc[

0].message

year = int(year_content.sort_values("message",

ascending=False).iloc[

0].YearMonth / 100)

month = \

year_content.sort_values("message", ascending=False).iloc[

0].YearMonth % 100

st.write(f"You break the monthly record! Total of"

f" {total_messages} messages"

f" in {year}-{str(month).rjust(2, '0')}.💥 ")

c = alt.Chart(year_content).mark_bar().encode(

x=alt.X("YearMonth:O", ),

y=alt.Y('message:Q'),

color='year:O',

rule = alt.Chart(year_content).mark_rule(color='red').encode(

y='mean(message):Q'

c = (c + rule).properties(width=1000, height=600,

title='Total Number of messages '

'sent over years'

st.altair_chart(c)

st.write("""

## Sunburst: Message count per daytime

""")

with st.expander("More info"):

st.info("- Left chart shows the realized values."

Page 52 of 98
"\n- Right chart shows the adjusted values based "

"on "

"max message count.")

fig = sunburst(df=df)

st.pyplot(fig)

# st.write("""

# Radarchart: Message count per weekday

# """)

# fig = radar_chart(df=df)

# st.pyplot(fig)

st.write(""" ## Heatmap: Message count per day """)

fig = heatmap(df=df)

st.pyplot(fig)

if locations.shape[0] > 0:

st.write(""" ## Map of Locations""")

# geolocator = Nominatim(user_agent="loc_finder")

# with st.expander("More info"):

# st.info("This map shows all the locations which are "

# "sent by the authors via whatsapp. The "

# "latitude and Longitude values are extracted "

# "from google maps.")

# with st.spinner('This may take a while. Wait for it...'):

# for i, row in locations.iterrows():

# location = geolocator.reverse((row.lat, row.lon)).raw

# locations.loc[i, "country"] = location["address"]["country"]

# locations.loc[i, "town"] = location["address"]["town"]

# st.write("### Top shared locations")

Page 53 of 98
# st.dataframe(pd.DataFrame(locations.groupby(["country", "town"])["lat"]

# .count()).rename(columns={"lat":

# "count"}).sort_values(

# "count", ascending=False))

st.map(locations)

st.cache_data.clear()

st.cache_resource.clear()

if __name__ == "__main__":

app()

Page 54 of 98
6.2 Backend code

import streamlit as st

import pandas as pd

import numpy as np

from chatminer.chatparsers import WhatsAppParser

import seaborn as sns

import datetime

import tempfile

import chatminer.visualizations as vis

from matplotlib import pyplot as plt

from matplotlib import ticker

from matplotlib.colors import LinearSegmentedColormap

from scipy.ndimage import gaussian_filter

import math

from collections import Counter

from wordcloud import WordCloud,STOPWORDS

def create_wordcloud(selected_user,df):

f = open('stop_hinglish.txt', 'r')

stop_words = f.read()

Page 55 of 98
if selected_user != 'Overall':

df = df[df['user'] == selected_user]

temp = df[df['user'] != 'group_notification']

temp = temp[temp['message'] != '<Media omitted>\n']

def remove_stop_words(message):

y = []

for word in message.lower().split():

if word not in stop_words:

y.append(word)

return " ".join(y)

wc = WordCloud(width=500,height=500,min_font_size=10,background_color='white')

temp['message'] = temp['message'].apply(remove_stop_words)

df_wc = wc.generate(temp['message'].str.cat(sep=" "))

return df_wc

def set_custom_matplotlib_style():

plt.style.use('seaborn-dark')

plt.rcParams['figure.figsize'] = [10, 6.5]

plt.rcParams['axes.titlesize'] = 13.0

plt.rcParams['axes.titleweight'] = 500

plt.rcParams['figure.titlesize'] = 13.0

plt.rcParams['figure.titleweight'] = 500

plt.rcParams['text.color'] = '#242121'

plt.rcParams['xtick.color'] = '#242121'

plt.rcParams['ytick.color'] = '#242121'

plt.rcParams['axes.labelcolor'] = '#242121'

Page 56 of 98
plt.rcParams['font.family'] = ['Source Sans Pro', 'Verdana', 'sans-serif']

return (None)

@st.cache_data(show_spinner=False)

def read_file(file):

with tempfile.NamedTemporaryFile(mode="wb") as temp:

with st.spinner('This may take a while. Wait for it...'):

bytes_data = file.getvalue()

temp.write(bytes_data)

parser = WhatsAppParser(temp.name)

parser.parse_file()

df = parser.parsed_messages.get_df()

return df

def preprocess_data(df: pd.DataFrame,

selected_lang: str,

selected_authors: list):

# language settings

lang = {"English": {"picture": "image omitted",

"video": "video omitted",

"gif": "GIF omitted",

"audio": "audio omitted",

"sticker": "sticker omitted",

"deleted": ["This message was deleted.",

"You deleted this message."],

"location": "Location https://"

},

"Turkish": {"picture": "görüntü dahil edilmedi",

"video": "video dahil edilmedi",

Page 57 of 98
"gif": "GIF dahil edilmedi",

"audio": "ses dahil edilmedi",

"sticker": "Çıkartma dahil edilmedi",

"deleted": ["Bu mesaj silindi.",

"Bu mesajı sildiniz."],

"location": "Konum https://"}}

df["timestamp"] = pd.to_datetime(df["timestamp"],

errors='coerce')

df["date"] = df["timestamp"].dt.strftime('%Y-%m-%d')

df = df.loc[df["author"].isin(selected_authors)]

df = df.sort_values(["timestamp"])

df["is_location"] = (df.message.str.contains('maps.google') == True) * 1

locations = df.loc[df["is_location"] == 1]

df.loc[df.is_location == 1, 'message'] = np.nan

if locations.shape[0] > 0:

locs = locations["message"].str.split(" ", expand=True)

locs[1] = locs[1].str[27:]

locs = locs[1].str.split(",", expand=True)

locs = locs.rename(columns={0: "lat", 1: "lon"})

locs = locs.loc[(locs["lat"] != "") & (locs["lon"] != "") \

& (~locs["lat"].isna()) & (~locs["lon"].isna())]

locations = locs[["lat", "lon"]].astype(float).drop_duplicates()

# Deal with links (URLS) as messages

df['is_link'] = ~df.message.str.extract('(https?:\S*)',

expand=False).isnull() * 1

# Extract message length

Page 58 of 98
df['msg_length'] = df.message.str.len()

df.loc[df.is_link == 1, 'msg_length'] = np.nan

# Deal with multimedia messages to flag them and

# set the text to null

df['is_image'] = (df.message == lang[selected_lang][

"picture"]) * 1

df.loc[df.is_image == 1, 'message'] = np.nan

df['is_video'] = (df.message == lang[selected_lang][

"video"]) * 1

df.loc[df.is_video == 1, 'message'] = np.nan

df['is_gif'] = (df.message == lang[selected_lang][

"gif"]) * 1

df.loc[df.is_video == 1, 'message'] = np.nan

df['is_sticker'] = (df.message == lang[selected_lang][

"sticker"]) * 1

df.loc[df.is_video == 1, 'message'] = np.nan

df['is_audio'] = (df.message == lang[selected_lang][

"audio"]) * 1

df.loc[df.is_video == 1, 'message'] = np.nan

df['is_deleted'] = (df.message.isin(lang[selected_lang]["deleted"])) * 1

df.loc[df.is_deleted == 1, 'message'] = np.nan

# df["emoji_count"] = df["message"].apply(lambda word_list:

# collections.Counter([match[

# "message"]

# for word in

Page 59 of 98
# word_list for

# match in

# emoji.emoji_list(

# word)]))

# Filter out rows with no known author

# or phone numbers as authors

df = df[~(~df.author.str.extract('(\+)',

expand=False).isnull() |

df.author.isnull())]

# Add field to flag the start of a new conversation

# Conversation starter defined as a message sent at least

# 7 hours after the previous message on the thread

df['is_conversation_starter'] = ((

df.timestamp - df.timestamp.shift(

1)) > pd.Timedelta(

'7 hours')) * 1

return df, locations

def basic_stats(df: pd.DataFrame):

df = df.drop("hour", axis=1).groupby(

'author').mean().rename(

columns={

"words": "Words",

"msg_length": "Message Length",

"letters": "Letters",

"is_link": "Link",

"is_conversation_starter": "Is Conversation Starter",

"is_image": "Image",

Page 60 of 98
"is_video": "Video",

"is_gif": "GIF",

"is_audio": "Audio",

"is_sticker": "Sticker",

"is_deleted": "Deleted",

"is_location": "Location"

).style.format({

'Words': '{:.2f}',

'Message Length': '{:.1f}',

'Letters': '{:.1f}',

'Link': '{:.2%}',

'Is Conversation Starter': '{:.2%}',

'Image': '{:.2%}',

'Video': '{:.2%}',

'GIF': '{:.2%}',

'Audio': '{:.2%}',

'Sticker': '{:.2%}',

'Deleted': '{:.2%}',

'Location': '{:.2%}'

}).background_gradient(axis=0)

return df

def stats_overall(df: pd.DataFrame):

authors = df[["author"]].drop_duplicates()

temp = df.loc[df["is_image"] == 1]

images = pd.DataFrame(

temp.groupby("author")["is_image"].sum() / temp[

"is_image"].sum()).reset_index()

Page 61 of 98
temp = df.loc[df["is_video"] == 1]

videos = pd.DataFrame(temp.groupby("author")["is_video"].sum() / temp[

"is_video"].sum()).reset_index()

temp = df.loc[df["is_link"] == 1]

links = pd.DataFrame(temp.groupby("author")["is_link"].sum() / temp[

"is_link"].sum()).reset_index()

temp = df.loc[df["is_conversation_starter"] == 1]

con_starters = pd.DataFrame(

temp.groupby("author")["is_conversation_starter"].sum() / temp[

"is_conversation_starter"].sum()).reset_index()

temp = df.loc[df["is_gif"] == 1]

gifs = pd.DataFrame(

temp.groupby("author")["is_gif"].sum() / temp[

"is_gif"].sum()).reset_index()

temp = df.loc[df["is_audio"] == 1]

audios = pd.DataFrame(

temp.groupby("author")["is_audio"].sum() / temp[

"is_audio"].sum()).reset_index()

temp = df.loc[df["is_sticker"] == 1]

stickers = pd.DataFrame(

temp.groupby("author")["is_sticker"].sum() / temp[

"is_sticker"].sum()).reset_index()

temp = df.loc[df["is_deleted"] == 1]

delete = pd.DataFrame(

Page 62 of 98
temp.groupby("author")["is_deleted"].sum() / temp[

"is_deleted"].sum()).reset_index()

temp = df.loc[df["is_location"] == 1]

locs = pd.DataFrame(

temp.groupby("author")["is_location"].sum() / temp[

"is_location"].sum()).reset_index()

authors = pd.merge(authors, images, on=["author"], how="left")

authors = pd.merge(authors, videos, on=["author"], how="left")

authors = pd.merge(authors, audios, on=["author"], how="left")

authors = pd.merge(authors, con_starters, on=["author"], how="left")

authors = pd.merge(authors, links, on=["author"], how="left")

authors = pd.merge(authors, gifs, on=["author"], how="left")

authors = pd.merge(authors, stickers, on=["author"], how="left")

authors = pd.merge(authors, delete, on=["author"], how="left")

authors = pd.merge(authors, locs, on=["author"], how="left")

authors = authors.fillna(

{"is_sticker": 0,

"is_gif": 0,

"is_audio": 0,

"is_video": 0,

"is_conversation_starter": 0,

"is_deleted": 0,

"is_location": 0}).rename(

columns={

"is_link": "Link",

"is_conversation_starter": "Is Conversation Starter",

"is_image": "Image",

"is_video": "Video",

"is_gif": "GIF",

Page 63 of 98
"is_audio": "Audio",

"is_sticker": "Sticker",

"is_deleted": "Deleted",

"is_location": "Location"

).style.format({

'Link': '{:.2%}',

'Is Conversation Starter': '{:.2%}',

'Image': '{:.2%}',

'Video': '{:.2%}',

'GIF': '{:.2%}',

'Audio': '{:.2%}',

'Sticker': '{:.2%}',

'Deleted': '{:.2%}',

'Location': '{:.2%}'

}).background_gradient(axis=0)

return authors

def smoothed_daily_activity(df: pd.DataFrame):

df["year"] = df["timestamp"].dt.year

min_year = df.year.max() - 5

daily_activity_df = df.loc[df["year"] > min_year].groupby(

['author',

'timestamp']).first().unstack(

level=0).resample('D').sum().msg_length.fillna(0)

smoothed_daily_activity_df = pd.DataFrame(

gaussian_filter(daily_activity_df,

(6, 0)),

index=daily_activity_df.index,

Page 64 of 98
columns=daily_activity_df.columns)

# fig, ax = plt.subplots()

# subplots = daily_activity_df.plot(figsize=[8,2*len(df.author.unique())],

# subplots=True, sharey=True, lw=0.3, label=None)

# ax = smoothed_daily_activity_df.plot(figsize=[8, 2*len(

# df.author.unique())], subplots=True, ax=ax)

# [ax.set_title(auth) for auth, ax in zip(df.groupby('author').size().index, subplots)]

# [ax.set_ylabel('Activity (characters per day)') for auth, ax in zip(df.groupby('author').size().index,


subplots)]

# plt.xlabel('')

# [ax.legend(['Daily activity', 'Gaussian-smoothed']) for ax in subplots]

# [ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, p: format(int(x), ','))) for ax in


subplots]

# plt.tight_layout()

# plt.subplots_adjust(wspace=10, hspace=10)

# st.pyplot(fig)

return smoothed_daily_activity_df

def activity(df: pd.DataFrame):

distinct_dates = df[["date"]].drop_duplicates()

distinct_authors = df[["author"]].drop_duplicates()

distinct_authors['key'] = 1

distinct_dates['key'] = 1

distinct_dates = pd.merge(distinct_dates, distinct_authors,

on="key").drop(

"key", 1)

activity = pd.DataFrame(

df.groupby(["author", "date"])[

"words"].nunique()).reset_index()

activity["start_date"] = activity.groupby(["author"])[

"date"].transform(

Page 65 of 98
"min")

activity["is_active"] = np.where(activity['words'] > 0, 1, 0)

distinct_dates = pd.merge(distinct_dates, activity,

on=["date", "author"],

how="left")

distinct_dates["max_date"] = df.date.max()

distinct_dates[['max_date', 'start_date']] = distinct_dates[

['max_date', 'start_date']].apply(pd.to_datetime)

distinct_dates["date_diff"] = (

distinct_dates['max_date'] - distinct_dates[

'start_date']).dt.days

o = distinct_dates.groupby("author").agg(

{"is_active": "sum", "date_diff": "max"})

o["is_active_percent"] = 100 * (o["is_active"] / o["date_diff"])

return o.reset_index().drop(["is_active", "date_diff"], 1) \

.rename(columns={

"is_active_percent": "Activity %"

def create_colormap(colors=['w', 'g'], n_bins=256):

"""

Function to create bespoke linear segmented color map.

Will be useful to create colormaps for each user

consistent with their colour scheme

:param colors:

:param n_bins:

:return:

"""

n_bins = 256 # Discretizes the interpolation into bins

Page 66 of 98
cmap_name = 'temp_cmap'

# Create the colormap

cm = LinearSegmentedColormap.from_list(cmap_name, colors, N=n_bins)

return (cm)

def relative_activity_ts(df: pd.DataFrame):

min_year = df.year.max() - 3

daily_activity_df = df.loc[df["year"] > min_year].groupby(

['author', 'timestamp']).first().unstack(level=0).resample(

'D').sum().msg_length.fillna(0)

smoothed_daily_activity_df = pd.DataFrame(

gaussian_filter(daily_activity_df, (6, 0)),

index=daily_activity_df.index,

columns=daily_activity_df.columns)

o = smoothed_daily_activity_df.div(

smoothed_daily_activity_df.sum(axis=1),

axis=0)

return o

def activity_day_of_week_ts(df: pd.DataFrame):

o = df.groupby(

[df.timestamp.dt.dayofweek, df.author]).msg_length.sum(

).unstack().fillna(0)

# o = o.reindex(["Monday", "Tuesday", "Wednesday",

# "Thursday", "Friday", "Saturday",

# "Sunday"])

return o

Page 67 of 98
def activity_time_of_day_ts(df: pd.DataFrame):

a = df.groupby(

[df.timestamp.dt.time,

df.author]).msg_length.sum().unstack().fillna(0)

a = a.reindex(

[datetime.time(i, j) for i in range(24) for j in

range(60)]).fillna(0)

# Temporarily add the tail at the start and head and the end of the

# data frame for the gaussian smoothing

# to be continuous around midnight

a = pd.concat([a.tail(120), a, a.head(120)])

# Apply gaussian convolution

b = pd.DataFrame(gaussian_filter(a.values, (60, 0)),

index=a.index,

columns=a.columns)

# Remove the points temporarily added from the ends

b = b.iloc[120:-120]

b = b.reset_index()

b = b.rename_axis(None, axis=1)

b['timestamp'] = pd.to_datetime(b['timestamp'].astype(str))

# o = b.reset_index()

# o["hour"] = o["timestamp"].apply(lambda x: x.hour)

# # o["hour"] = o["hour"].astype(str) + ":00"

# o = o.sort_values("timestamp")

Page 68 of 98
# b = o.set_index(["hour", "timestamp"])

# # Plot the smoothed data

# o = b.plot(ax=plt.gca())

# plt.xticks(range(0,24*60*60+1, 3*60*60))

# plt.xlabel('Time of day')

# plt.ylabel('Relative activity')

# plt.ylim(0, plt.ylim()[1])

# plt.title('Activity by time of day')

# plt.gca().legend(title=None)

# plt.tight_layout()

# st.pyplot(o)

return b

def response_matrix(df: pd.DataFrame):

prev_msg_lt_180_seconds = (df.timestamp - df.timestamp.shift(

1)).dt.seconds < 180

same_prev_author = (df.author == df.author.shift(1))

fig, ax = plt.subplots()

ax = (df

[~(prev_msg_lt_180_seconds & same_prev_author)]

.groupby([df.author.rename('Message author'),

df.author.shift(1).rename('Responding to...')])

.size()

.unstack()

.pipe(lambda x: x.div(x.sum(axis=1), axis=0))

.pipe(sns.heatmap, vmin=0, annot=True, fmt='.0%',

cmap='viridis',

cbar=False))

Page 69 of 98
plt.title('Reponse Martix\n ')

plt.gca().text(.5, 1.04,

"Author of previous message when a message is sent*",

ha='center', va='center', size=8,

transform=plt.gca().transAxes);

plt.gca().set_yticklabels(plt.gca().get_yticklabels(),

va='center',

minor=False,

fontsize=8)

# plt.gcf().text(0, 0,

# "*Excludes messages to self within 3 mins",

# va='bottom')

plt.tight_layout()

return fig

def spammer(df: pd.DataFrame):

prev_sender = []

max_spam = 0

tmp_spam = 0

for jj in range(len(df)):

current_sender = df['author'].iloc[jj]

if current_sender == prev_sender:

tmp_spam += 1

if tmp_spam > max_spam:

max_spam = tmp_spam

max_spammer = current_sender

else:

tmp_spam = 0

Page 70 of 98
prev_sender = current_sender

return max_spammer, max_spam

def year_month(df: pd.DataFrame):

df[

'YearMonth'] = df.timestamp.dt.year * 100 + df.timestamp.dt.month

year_content = df.groupby(["year", 'YearMonth'], as_index=False).count()[

["year", 'YearMonth', 'message']]

year_content = year_content.sort_values('YearMonth')

return year_content

def radar_chart(df: pd.DataFrame):

fig, ax = plt.subplots(1, 2, figsize=(7, 3),

subplot_kw={'projection': 'radar'})

current_year = df.year.max()

last_year = current_year - 1

ax[0] = vis.radar(df.loc[df["year"] == last_year], ax=ax[0])

ax[1] = vis.radar(df.loc[df["year"] == current_year], ax=ax[1], color='C1',

alpha=0)

ax[0].set_title(last_year)

ax[1].set_title(current_year)

plt.tight_layout()

return fig

def heatmap(df: pd.DataFrame):

current_year = df.year.max()

last_year = current_year - 1

Page 71 of 98
if df.loc[df["year"] == last_year].shape[0] > 0:

fig, ax = plt.subplots(2, 1, figsize=(9, 3))

ax[0] = vis.calendar_heatmap(df, year=last_year,

monthly_border=True,

cmap='Oranges',

ax=ax[0])

ax[1] = vis.calendar_heatmap(df, year=current_year,

linewidth=0,

monthly_border=True, ax=ax[1])

ax[0].set_title(last_year)

ax[1].set_title(current_year)

else:

fig, ax = plt.subplots(figsize=(9, 3))

ax = vis.calendar_heatmap(df, year=current_year,

linewidth=0,

monthly_border=True)

ax.set_title(current_year)

return fig

def sunburst(df: pd.DataFrame):

fig, ax = plt.subplots(1, 2, figsize=(7, 3),

subplot_kw={'projection': 'polar'})

ax[0] = vis.sunburst(df, highlight_max=True,

isolines=[2500, 5000],

isolines_relative=False, ax=ax[0])

ax[1] = vis.sunburst(df, highlight_max=False,

isolines=[0.5, 1],

color='C1', ax=ax[1])

return fig

Page 72 of 98
def trend_stats(df: pd.DataFrame):

author_df = df["author"].value_counts().reset_index()

author_df.rename(columns={"index": "Author",

"author": "Number of messages"},

inplace=True)

author_df["Total %"] = round(

author_df["Number of messages"] * 100 / df.shape[0], 2)

author_df["Talkativeness"] = author_df["Total %"].apply(

lambda x: talkativeness(x, df["author"].nunique()))

df["year"] = df["timestamp"].dt.year

df["month"] = df["timestamp"].dt.month

max_year = df.year.max()

max_month = df.loc[

df.year == max_year].month.max()

df["yearmonth"] = df["year"] * 100 + \

df["month"]

df.loc[

df["yearmonth"] <= max_year * 100 + max_month]

temp_df = df.pivot_table(

index=["yearmonth"], columns=["author"],

values=["message"], aggfunc="count", fill_value=0)

temp_df.columns = [col_[1] for col_ in

temp_df.columns]

temp_df = temp_df.reset_index().sort_values(

["yearmonth"])

temp_df.set_index('yearmonth', inplace=True)

author_df["Messaging Trend Last 12 Months"] = author_df[

Page 73 of 98
"Author"].apply(

lambda x: trendline(temp_df.tail(12)[x]))

author_df["Messaging Trend Last 6 Months"] = author_df[

"Author"].apply(

lambda x: trendline(temp_df.tail(6)[x]))

author_df["Messaging Trend Last 3 Months"] = author_df[

"Author"].apply(

lambda x: trendline(temp_df.tail(3)[x]))

return author_df

def word_stats(df: pd.DataFrame):

words_lst = (

''.join(df["message"].values.astype(str))).split(' ')

words_lst = [i for i in words_lst if len(i) > 3]

df = pd.DataFrame.from_dict(Counter(words_lst),

orient='index',

columns=[

"count"]).reset_index().rename(

columns={'index': 'word'})

df.sort_values('count', ascending=False,

inplace=True, ignore_index=True)

df[""] = df["count"].apply(

lambda x: percent_helper(x / df.shape[0]))

return df

def trendline(df: pd.DataFrame, order=1):

index = range(0, len(df))

coeffs = np.polyfit(index, list(df), order)

slope = coeffs[-2]

Page 74 of 98
if slope > 0:

return "Increasing (" + str(round(slope, 2)) + ")"

else:

return "Decreasing (" + str(round(slope, 2)) + ")"

def talkativeness(percent_message, total_authors):

mean = 100 / total_authors

threshold = mean * .25

if percent_message > (mean + threshold):

return "Very talkative"

elif percent_message < (mean - threshold):

return "Quiet, untalkative"

else:

return "Moderately talkative"

def extract_emojis(s):

return ''.join(c for c in s if c in emoji.UNICODE_EMOJI)

# Returns smallest integer k such that k

# * str becomes natural. str is an input floating point number #

def gcd(a, b):

if b == 0:

return a

return gcd(b, a % b)

Page 75 of 98
def findnum(str):

# Find size of string representing a

# floating point number.

n = len(str)

# Below is used to find denominator in

# fraction form.

count_after_dot = 0

# Used to find value of count_after_dot

dot_seen = 0

# To find numerator in fraction form of

# given number. For example, for 30.25,

# numerator would be 3025.

num = 0

for i in range(n):

if str[i] != '.':

num = num * 10 + int(str[i])

if dot_seen == 1:

count_after_dot += 1

else:

dot_seen = 1

# If there was no dot, then number

# is already a natural.

if dot_seen == 0:

return 1

# Find denominator in fraction form. For example,

# for 30.25, denominator is 100

dem = int(math.pow(10, count_after_dot))

Page 76 of 98
# Result is denominator divided by

# GCD-of-numerator-and-denominator. For example, for

# 30.25, result is 100 / GCD(3025,100) = 100/25 = 4

return dem / gcd(num, dem)

def percent_helper(percent):

percent = math.floor(percent * 100) / 100

if percent > 0.01:

ans = findnum(str(percent))

return "{} out of {} messages".format(int(percent * ans), int(1 * ans))

else:

return "<1 out of 100 messages"

Page 77 of 98
Chapter 7: Results

Page 78 of 98
Pic 7.1

Page 79 of 98
Pic 7.2

Page 80 of 98
Pic 7.3

Page 81 of 98
Pic 7.4

Page 82 of 98
Pic 7.5

Page 83 of 98
Pic 7.6

Page 84 of 98
Pic 7.7

Page 85 of 98
Pic 7.8

Page 86 of 98
Pic 7.9

Page 87 of 98
Pic 7.10

Page 88 of 98
Pic 7.11

Page 89 of 98
Pic 7.12

Page 90 of 98
Pic 7.13

Page 91 of 98
Chapter 8: Future Scope

8.1 The future scope of a Chat Insight project is promising, with several potential avenues for
expansion and enhancement. Here are some future scopes for further development:

8.1.1. Real-time Analysis:


Implementing real-time analysis capabilities would enable the chat analyzer to process and
analyze chat data as it streams in, providing immediate insights and allowing for proactive
responses to emerging trends or issues.

8.1.2. Multimodal Analysis:


Integrating support for multimodal data, such as analyzing text alongside images, videos, or
voice recordings, would enable a more comprehensive understanding of conversations across
different modalities.

8.1.3. Advanced NLP Techniques:


Leveraging advanced natural language processing (NLP) techniques, such as transformer-
based models (e.g., BERT, GPT), would improve the accuracy and sophistication of tasks
such as sentiment analysis, named entity recognition, and topic modeling.

8.1.4. Emotion Analysis:


Expanding the sentiment analysis module to include emotion analysis would enable the
identification of specific emotions expressed in chat messages, providing deeper insights into
user sentiment and engagement.

8.1.5. Personalization and User Profiling:


Implementing user profiling and personalization features would enable the chat insight to
tailor its analysis and recommendations based on individual user preferences, behavior, and
historical interactions.

Page 92 of 98
8.1.6. Integration with Chatbots and AI Assistants:
Integrating the chat insight with chatbots and AI assistants would enable automated insights
generation and decision-making based on real-time chat analysis, enhancing the efficiency
and effectiveness of chat-based interactions.

8.1.7. Domain-specific Solutions:


Developing domain-specific versions of the chat analyzer tailored to specific industries or
use cases, such as healthcare, finance, e-commerce, or social media, would enable
customized analysis and insights generation tailored to the unique requirements and
challenges of each domain.

8.1.8. Enhanced Visualization and Interaction:


Continuously improving the visualization module with advanced visualization techniques,
interactive features, and support for immersive visualization environments (e.g., virtual
reality, augmented reality) would enhance user engagement and facilitate deeper exploration
and analysis of chat data.

8.1.9. Ethical and Privacy Considerations:


Incorporating robust ethical and privacy considerations into the design and development of
the chat analyzer to ensure responsible handling of sensitive chat data and compliance with
relevant regulations and guidelines.

8.1.10. Scalability and Performance Optimization:


Optimizing the scalability and performance of the chat analyzer to handle large volumes of
chat data efficiently, support concurrent user interactions, and maintain responsiveness under
heavy loads or spikes in demand.

By exploring these future scopes, the chat insight project can continue to evolve and adapt to
meet the growing demands and challenges of analyzing chat data in diverse contexts and
domains.

Page 93 of 98
Chapter 9: Conclusion

9.1 In conclusion, the chat Insight project offers a comprehensive solution for analyzing and
extracting insights from textual conversations across various chat platforms. Through the
integration of advanced natural language processing techniques, machine learning algorithms,
and interactive visualization tools, the chat insight empowers users to gain a deeper
understanding of chat data and derive actionable insights for a wide range of applications.

The project addresses key challenges in chat data analysis, including sentiment analysis,
named entity recognition, topic modeling, and visualization, providing users with a holistic
view of conversation dynamics, sentiment trends, and thematic content. By leveraging the
capabilities of the chat analyzer, businesses can enhance customer satisfaction, optimize
communication processes, and make data-driven decisions to drive growth and innovation.

9.2 Moving forward, the Chat Insight project has promising future scopes for further
development, including real-time analysis, multimodal support, advanced NLP techniques,
and domain-specific solutions. By continuously evolving and adapting to meet the evolving
needs and challenges of analyzing chat data, the chat analyzer project remains at the forefront
of enabling organizations to unlock the full potential of their textual conversations and drive
meaningful outcomes in the digital age.

Page 94 of 98
Chapter 10: References

https://dvatvani.com/joweich/chat-miner
https://github.com/joweich/chat-miner

Thanks to chat-miner for easy whatsapp parsing tool and their awesome charts .

Thanks to Dinesh Vatvani for his great analysis.

Page 95 of 98
Page 96 of 98
Page 97 of 98
Page 98 of 98

You might also like