BI Case Study 3
BI Case Study 3
Ismail Sahin
Meshack Oniera
Rakshini Prabu
1
Contents
Summary of the case:.................................................................3
Deliverables................................................................................4
Step 1: Text Analysis Application for Business:.........................4
Step 2: Data Collection Strategy...............................................4
Step 3: Data Storage Strategy...................................................5
Step 4: Text Corpus Construction in Python.....................7
Step 5: Discussion of Challenges...............................................8
Results and Key Findings:.........................................................9
Conclusion:.............................................................................10
3
Deliverables
Step 1: Text Analysis Application for Business:
● Data Source: User-generated content (reviews, forum posts, expert commentary)
pertaining to the iPhone 16.
● Business Value: Analyzing customer feedback provides direct insights into
customer perceptions, preferences, and pain points. This informs data-driven
decisions to improve products/services, enhance customer satisfaction, and
drive competitive advantage.
● Example: Identification of prevalent complaints regarding battery life serves as a
clear indicator for improvements in future iPhone models.
4
Step 3: Data Storage Strategy
1. Data Structure: The CSV file stores the data in a tabular format. Each row
represents a single review, and the columns represent different attributes of the
review and its analysis. Based on the notebook, the columns include:
● "Review #": The index or number of the review.
● "Review": The actual text of the iPhone 16 review.
● "Sentiment": The overall sentiment classification assigned to the review
(e.g., "Positive", "Negative", or "Neutral").
● "Score": The compound sentiment score generated by the VADER
sentiment analyzer (a numerical value indicating the intensity and
direction of the sentiment).
2. Implementation: With the use of the pandas library to create a DataFrame from
the sentiment analysis results and then save the DataFrame to a CSV file using
the to_csv() function. The index=False argument prevents pandas from writing
the DataFrame index as a separate column in the CSV.
3. Rationale: As previously stated, the reasons for choosing CSV in this context are:
5
● Simplicity: CSV is a very simple file format, making it easy to understand,
create, and parse.
● Portability: CSV files can be opened and viewed in almost any spreadsheet
program (Excel, Google Sheets, etc.) or text editor.
● Ease of Use with Pandas: pandas provides excellent support for reading
and writing CSV files, making it a natural choice for data analysis
workflows.
● Suitable for Tabular Data: The sentiment analysis results are inherently
tabular, making CSV a good fit.
6
● Querying: CSV files are not efficient for complex queries. we would need to
load the entire file into memory and then perform filtering or searching.
● Concurrency: CSV files are not designed for concurrent access by multiple
users or processes.
● Data Integrity: CSV files do not enforce data types or constraints.
7
Results & Key Changes
The team encountered several challenges during the text analysis process, which
influenced the project's efficiency, accuracy, and scalability.
1. Limitations in Manual Data Collection
● Challenge: Labor-intensive and potentially biased data collection due to
the absence of APIs or web scraping tools.
● Proposed Solution: Employ web scraping tools (e.g., BeautifulSoup) or
leverage APIs (e.g., Reddit API, Google Search API) to automate data
collection, expand data volume, and ensure a more representative dataset.
2. Presence of Data Noise and Inconsistencies
● Challenge: Non-standard expressions, spelling errors, emojis, and
inconsistent formatting compromised analysis quality.
● Proposed Solution: Implement advanced text-cleaning functions, such as
emoji removal and typo correction with libraries like TextBlob, to enhance
data quality further.
3. Ambiguity and Complexity in Natural Language
● Challenge: Ambiguity in natural language, sarcasm, and mixed sentiments
complicated the interpretation of results from the VADER sentiment
analyzer.
● Proposed Solution: Combine VADER with other NLP models like TextBlob
or utilize transformer-based models like BERT (especially for larger
datasets) to enhance the accuracy of sentiment detection.
8
4. Limited Size of the Dataset
● Challenge: The small dataset (15 reviews) limited the generalizability of
sentiment findings and increased susceptibility to outliers.
● Proposed Solution: Automate data collection to increase the dataset to
hundreds or thousands of reviews, enabling more thorough statistical
analysis and trend visualization.
5. Constraints in Storage and Data Management
● Challenge: CSV file limitations for managing larger or real-time data,
restrictions regarding querying or simultaneous access.
● Proposed Solution: Implement structured databases like SQLite or NoSQL
options such as MongoDB to facilitate efficient storage, indexing, and
querying of unstructured text data for larger-scale analyses.
9
4. Proposed Solutions:
● Automation of data collection using web scraping or APIs to increase
dataset size and diversity.
● Use of advanced NLP models (e.g., BERT) for more nuanced sentiment
analysis.
● Transition to structured databases (e.g., SQLite, MongoDB) for efficient
storage and querying of larger datasets.
Conclusion:
The text analysis project successfully demonstrated the value of customer feedback in
identifying areas for improvement and guiding product development for the iPhone 16.
While the sentiment analysis provided actionable insights into customer perceptions,
scalability and accuracy challenges were evident due to the manual data collection
process and limited dataset size. Future iterations should leverage automation tools
and advanced NLP techniques to enhance efficiency and analytical depth, ensuring
more comprehensive and reliable insights for business decision-making.
10