0% found this document useful (0 votes)
3K views134 pages

Rapport PFE Smart Lead Generation Scrum English

This document is an internship report presented by Ben Hadj Kacem Souha to obtain a degree in computer engineering from the Private Higher School of Information Technologies and Management of Nabeul. The report details the design and development of a predictive decision support system for B2B lead generation during a 6-month internship at Satoripop. The project uses agile methodology, including Scrum and CRISP-DM, to build a system that collects data through web scraping, evaluates leads, and provides business intelligence to support marketing and sales decisions.

Uploaded by

bhkacem souha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views134 pages

Rapport PFE Smart Lead Generation Scrum English

This document is an internship report presented by Ben Hadj Kacem Souha to obtain a degree in computer engineering from the Private Higher School of Information Technologies and Management of Nabeul. The report details the design and development of a predictive decision support system for B2B lead generation during a 6-month internship at Satoripop. The project uses agile methodology, including Scrum and CRISP-DM, to build a system that collects data through web scraping, evaluates leads, and provides business intelligence to support marketing and sales decisions.

Uploaded by

bhkacem souha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

Private Higher School of Information Technologies and Management of Nabeul

Internship Report

Presented in order to obtain the degree of

COMPUTER ENGINEER

SPECIALITY : BI

Elaborated by

Ben Hadj Kacem Souha

Design and Development of a Predictive Decision Support System for


B2B Lead Generation

Hosting Company

SATORIPOP

Supervised by

Academic Supervisor(s) Professional Supervisor(s)


Dr. Amor Messaoud Mr. Zied Machkena

Academic Year
2022 – 2023
Dedications

In heartfelt dedication to this endeavor:

To my beloved Mother, Father, and Brother, whose unending belief,


encouragement, and sacrifices have paved the way for my accomplishments. This
work stands as a tribute to your unwavering faith in me.

To the cherished memory of my Grandparents, who left our side a few months
before my graduation, you have left an indelible mark on my heart. Your wisdom
and love continue to prevail, even in your absence.

To the unwavering spirits of my dearest friends, Sabrine Bettaib, El Horry


Slim,Sourour Saied, and Soumaya ben mhamed . Your presence and unwavering
support have illuminated my journey and transformed challenges into opportunities
for growth, a constant source of cheers and positivity, may your paths be paved with
even greater triumphs.

To my ever-faithful feline companions, my Cats, whose silent company provided


solace and comfort during countless hours of contemplation.

To the incredible souls who lent their hands and hearts, whether near or far, shaping
moments of camaraderie and shared dedication during the journey to complete this
work. Your warmth and encouragement infused each step with purpose.

With profound gratitude and a heart full of appreciation, I extend my thanks.

i
Acknowledgments:

I am profoundly grateful to have chosen this page as a space to convey my deep


appreciation to all individuals who played a pivotal role in the success of my
internship and provided their invaluable assistance during the composition of this
project.

We wish to extend our heartfelt gratitude to all those who contributed to the
accomplishment of our final year project and offered their unwavering support.

We are incredibly thankful to Mr. Amor Messaoud, my academic supervisor at


ITBS, for his continuous encouragement, availability, and invaluable support in
achieving our objectives. His invaluable guidance and dedication to mentoring have
been of immense value.

Our sincere gratitude also goes to Mr.Khaireddine Fredj, the director of the
company, for welcoming me and providing this opportunity. I would like to extend
my heartfelt thanks to Mr. Zied Machkena, my mentor at Satoripop, for his
insightful guidance that contributed to my continual growth throughout these six
months of internship.

Lastly, I extend a heartfelt thank you to all those who directly or indirectly
contributed to the success of my final year project.

Thank you all for being a part of this journey.

ii
Contents

List of Acronyms 1

General introduction 1

1 Projet Contexte 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Hosting Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Satoripop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Organizational structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Project Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Existing Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.4 Proposed Solution: Smart Lead Generation . . . . . . . . . . . . . . . . . . 6
1.4 Methodological Study and Planification . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Comparative Analysis of Project Management Approaches . . . . . . . . . . 6
1.4.2 Chosen Approach: Agile . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 Project Management Frameworks . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Adopted Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4.1 SCRUM Framework . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4.2 CRISP-ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.4.3 GIMSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Planification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Project Planification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 State of the art 15


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Marketing Lead Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Lead Generation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Leads Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 WebScraping Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii
2.4.2 Webscraping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 The Role of Web Scraping in Lead Generation . . . . . . . . . . . . . . . . 18
2.5 Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 BI Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Decisional System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.3 Multidimensional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 Data Warehouse Conception Approches . . . . . . . . . . . . . . . . . . . . 21
2.5.4.1 The Bottom-Up Approach . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4.2 The Top-Down Approach . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4.3 The Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.5 Comparative Study Between Data Integration Processes . . . . . . . . . . . 22
2.5.6 Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.7 OLAP Tabular vs Multidimensional Models . . . . . . . . . . . . . . . . . 24
2.5.8 The Role of Business Intelligence in B2B Lead Generation marketing . . . . 25
2.6 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.1.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 28
2.7.1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.2 The Role of Machine Learning in The Marketing Field . . . . . . . . . . . . 30
2.7.3 Unsupervised Learning for Segmentation Problem . . . . . . . . . . . . . . 31
2.7.4 Similar Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Preliminary Analysis 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.1 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.3 Non Functional Requirempents . . . . . . . . . . . . . . . . . . . 37
3.2.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2.1 Use Case: User . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2.2 Use Case: Admin . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Project management with SCRUM . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Team and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Product backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Release planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Architecture: MVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

iv
3.4.1 Model Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 View Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Template Leyar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Framework: Django . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Client-side Interaction: Javascript . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.3 Front-end Design: Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.4 Integrated Development Environment (IDE) and Version Control . . . . . . . 45
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 First Release 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Presentation of Release 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Sprint 1.1: Account Creation and Admin Panel . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Sprint 1.1 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Increment of Sprint 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2.1 Registration and email verification . . . . . . . . . . . . . . . . . 47
4.3.2.2 Administrator Login Interface and Home page . . . . . . . . . . . 50
4.3.2.3 Users Management Interface . . . . . . . . . . . . . . . . . . . . 51
4.3.2.4 Group Management . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Sprint 1.2: Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Sprint 1.2 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 Related research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.3 Sources Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.4 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.5.1 Linkedin Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.5.2 Company employees . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.5.3 Website Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.6 Increment of Sprint 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.6.1 Leads Listing and Search Interface . . . . . . . . . . . . . . . . . 62
4.4.6.2 Leads Management . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Second Release 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Presentation of Release 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Backlog of sprint 2.1: Prospects Segmentation . . . . . . . . . . . . . . . . . . . . . 65
5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.1 Data Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.2.1 Python for Machine Learning . . . . . . . . . . . . . . . . . . . . 70

v
5.4.2.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3.1 Further Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . 72
5.4.4 Agglomerative Hierarchical Clustering (AHC) . . . . . . . . . . . . . . . . 75
5.4.4.1 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.4.2 Evaluation and Profiles . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.5 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.5.1 Number of Clusters: Elbow Method . . . . . . . . . . . . . . . . 80
5.4.5.2 Evaluation and Profiles . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Increment of sprint 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Sprint 2.2: Digital Maturity Assessment . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.1 Sprint 2.2 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2 Sprint 2.2 Increment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2.1 User-side Assessments . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2.2 Admin-side Assessment Management . . . . . . . . . . . . . . . 85
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Third Release 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Presentation of Release 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Sprint 3.1 Backlog: BI Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 BI Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Multidimensional Conception of BI Solution . . . . . . . . . . . . . . . . . . . . . . 88
6.5.1 Global DW Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.2 Dimensions Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5.3 Conception of Network Datamart . . . . . . . . . . . . . . . . . . . . . . . 91
6.5.4 Conception of Digital Datamart . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5.5 Conception of Company Datamart . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6.1 KPI Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6.2 Development of Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6.2.1 Mock-up scenario: ”Digital Exploration” . . . . . . . . . . . . . 95
6.6.2.2 Mock-up Scenario: ”Enterprise Exploration” . . . . . . . . . . . 95
6.6.2.3 Mock-up Scenario: ”Network Exploration” . . . . . . . . . . . . 95
6.7 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.7.1 Comparative Evaluation: Integration Software . . . . . . . . . . . . . . . . . 96
6.7.2 Comparative Evaluation: Reporting Software . . . . . . . . . . . . . . . . . 97
6.7.3 Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7.3.1 SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7.3.2 SSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

vi
6.7.3.3 SQL Server Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.3.4 SSAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.3.5 PowerBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8 Integration Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8.1 ETL: Extraction, Transformation, Loading . . . . . . . . . . . . . . . . . . 99
6.8.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.3 Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.4 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.8.5 Deployment and Configuration of SSIS Package . . . . . . . . . . . . . . . 104
6.8.5.1 SSIS package deployment . . . . . . . . . . . . . . . . . . . . . . 104
6.8.5.2 Jobs Planing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.9 Analysis Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9.1 OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9.2 Deployment and Configuration of OLAP Cubes . . . . . . . . . . . . . . . . 109
6.10 Reporting Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.10.1 Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.10.2 Market Analysis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

General Conclusion 114

Bibliography 117

A Evaluation Metrics 120


A.1 The Silhouette Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.2 The Calinski-Harabasz Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

B Optimization Methods 121


B.0.1 Label Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.0.2 StandardScaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

vii
List of Figures

1.1 Logo of the Hosting company: SATORIPOP . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Project lifecycle in CRISP-ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Gantt chart of the project planification . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Business Intelligence project Architecture . . . . . . . . . . . . . . . . . . . . . . . 19


2.2 Star Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Snowflake Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Galaxy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Bottom-up approach by Ralph Kimball . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Top-Down approach by Bill Inmon . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 The Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Machine learning representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Supervised Learning representation . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 Unsupervised Learning representation . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.11 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.12 Machine Learning differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Use-case diagram for user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


3.2 Use-case diagram for admin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 MVT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Framework: Django . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Javascript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 VS code and Github . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Interface Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


4.2 Email verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Login interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Interface Request a Reset Password . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Interface Password reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Interface: Admin login . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Interface: Admin home page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.8 Interface: user Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9 Interface: Added user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

viii
4.10 Interface: Add - Update user groups . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 Interface: update - remove user permissions . . . . . . . . . . . . . . . . . . . . . . 52
4.12 Interface: Update - Delete user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.13 Interface: Add Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.14 Interface: Group permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.15 result Group added successfully . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.16 Data Harvesting Process/steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.17 DataKund initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.18 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.19 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.20 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.21 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.22 Processus de Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.23 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.24 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.25 listing interface in Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.26 Interface: Add lead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.27 Result: Lead added successfully . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1 Anaconda logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


5.2 Jupyter Notebook logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Checking for outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 data distribution (industry, location, and size) . . . . . . . . . . . . . . . . . . . . . 73
5.6 data distribution (records, responsiveness, and contact) . . . . . . . . . . . . . . . . 73
5.7 data distribution (SSL/TLS, analytics, and Expiry status) . . . . . . . . . . . . . . . 74
5.8 Dataframe after label encoder transformation . . . . . . . . . . . . . . . . . . . . . 74
5.9 Dataframe after StandardScaler transformation . . . . . . . . . . . . . . . . . . . . 75
5.10 Dendogram of Ward Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.11 Dendogram of Complete Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.12 Dendogram of Average Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.13 Silhouette analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.14 Chosen Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.15 the resulting elbow method for the Dataset . . . . . . . . . . . . . . . . . . . . . . . 81
5.16 Kmeans Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.17 PowerBI dashboard for Lead segmentation result . . . . . . . . . . . . . . . . . . . 84
5.18 Digital Maturity Assessment Form Interface . . . . . . . . . . . . . . . . . . . . . . 85
5.19 Interface for Digital Maturity Prediction Results and History . . . . . . . . . . . . . 85
5.20 Conducting a Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.21 Prediction History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 BI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Global DW Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

ix
6.3 Conception of Network Datamart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Conception of Digital Datamart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 Conception of Company Datamart . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Scenario Prototype ”Digital Exploration” . . . . . . . . . . . . . . . . . . . . . . . 95
6.7 Scenario Prototype ”Company Exploration” . . . . . . . . . . . . . . . . . . . . . . 95
6.8 Scenario Prototype ”Network Exploration” . . . . . . . . . . . . . . . . . . . . . . 96
6.9 SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.10 SSIS software logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.11 SSAS software logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.12 PowerBI software logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.13 The Loading of the Staging area . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.14 The Loading of the ODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.15 Transformation in table Company . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.16 The loading of the Data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.17 Fact Digital implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.18 SSIS package deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.19 SQL Agent: job steps configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.20 SQL Agent: job planning configuration . . . . . . . . . . . . . . . . . . . . . . . . 106
6.21 SQL Agent: job Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.22 SQL Agent: job Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.23 Digital analysis Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.24 Network analysis Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.25 Company Analysis Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.26 Cubes deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.27 Dashboard Company analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.28 Dashboard Digital analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.29 Dashboard Website analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.30 Dashboard Interface for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

x
List of Tables

1.1 Agile vs Classic Project management approaches . . . . . . . . . . . . . . . . . . . 8

2.1 Comparative Analysis ETL vs ELT . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


2.2 Comparative Analysis of OLAP Models Tabular VS Multidimensional . . . . . . . . 25

3.1 Product Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


3.1 Product Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Releases and Their Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Sprint 1.1 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


4.2 Sprint 1.2 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 sprint 2.1 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


5.2 Count analytics Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Count contact Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Verification Date Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Responsive Or Not Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 SSL TLS Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.7 Expiry Status Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.8 Count records Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.9 Count analytics Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.10 Count contact Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.11 Verification Date Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.12 Expiry Status Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.13 ResponsiveOrNot Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.14 SSL TLS Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.15 Count records Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.16 Sprint 2.2 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1 Backlog of sprint 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


6.2 Dimensions Identification for DW . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 KPI of Fact Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 KPI of Fact Digital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5 KPI of Fact Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6 SSIS vs Talend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

xi
List of Acronyms

AHC Agglomerative Hierarchical Clustering


API Application Programming Interface
BI Business Intelligence
CRISP-ML Cross Industry Standard Process for Machine Learning
CSV Comma Separated Values
DAX Data Analysis Expression
DMARC Domain-based Message Authentication, Reporting, and Conformance
DWH Data Warehouse
ELT Extract - Load-Transform
GAID Google Analytic ID
GIMSI Generalization Information System Initiative
HTTP HyperText Transfer Protocol
IA Intelligence Artificielle
IP Internet Protocol
KPI Key Performance Indicator
MDX Multidimensional Expressions
ML Machine Learning
ODS Operational Data Store
OLAP Online Analytical Processing
SCD Slowly Changing Dimension
SPF Sender Policy Framework
SQL Structured Query Language
SSAS SQL Server Analysis Services
SSIS SQL Server Integration Services
SSL Secure Sockets Layer
STG Staging Area
TLS Transport Layer Security
UML Unified Modeling Language

1
General Introduction

The realm of decision-based computing encompasses a wide array of tools, applications, and
methodologies enabling organizations to gather data from various sources. This data is then
prepared for analysis to generate reports, dashboards, and machine-learning models, making analysis
accessible to decision-makers and operational staff.

Currently, enterprises employ BI software to extract valuable insights from their extensive
data repositories. Such tools facilitate the extraction of information like competitive intelligence,
market trends, performance tracking, and reasons behind missed opportunities. Typically rooted in
historical data analysis, Business Intelligence now even incorporates machine learning for predictive
capabilities.

In this context, our culminating project titled ”B2B lead generation and evaluation system”
at Satoripop aims to implement a decision-based solution within the industrial sector, aiding the
decision-makers to visualize the market and evaluate digital maturity.

Our report is structured into six chapters:

The primary focus is Business Understanding, where our first chapter provides an overall
comprehension of our project framework. It includes an introduction to the core business, a clear
delineation of the project scope, an identification of challenges, and a presentation of the envisioned
solution. Additionally, we elaborate on the development methodology and offer insights into our
project plan.
In the second chapter, we delve into various lead generation methods in marketing, lead assessment,
data collection, business intelligence, and the role of machine learning.
The third chapter addresses preliminary design, needs, stakeholders, and requirements, and introduces
SCRUM methodologies, architecture, and the development environment.
Moving on to the fourth chapter describes the first project release, covering account creation, the
administrator dashboard, and data collection.
In the fifth chapter, we present the second release, with a focus on prospect segmentation and the
implementation of company assessment.
Finally, the sixth chapter introduces the third release, exploring sprint 5, which centers on business
intelligence, data warehouse design, performance indicators, and integration, analysis, and reporting
phases.

2
Chapter 1

Projet Contexte

1.1 Introduction
The initial chapter serves as an introduction to our project’s background. To begin with, we will
introduce the hosting organization, followed by an exploration of the issues at hand, the overall
framework, and project goals. Subsequently, we will delve into the proposed solution, outlining
the methodology, and providing an overview of the project’s timeline.

1.2 The Hosting Organization


1.2.1 Satoripop

Figure 1.1: Logo of the Hosting company: SATORIPOP

Satoripop (figure 1.1 presents its logo) is a software development company with a dedicated
creative team that provides software solutions tackling diverse industry challenges, it offers:

• AI Services, Mobile, and Web Development

• Cloud Computing

• UX/UI Design

• Digital Marketing

They also offer products utilizing cutting-edge technologies, such as:

3
CHAPTER 1. PROJET CONTEXTE 4

• QuickText: A personalized AI-based Chatbot for the hospitality industry.

• VodFlow: A video streaming and management platform hosted on Azure.

• Swieq: An AI solution for customer behavior prediction for retailers and e-commerce
platforms.

• Coachingfoot: An online football game providing real-time accurate information for


entertainment and product promotion.

1.2.2 Organizational structure


The company was founded in 2014 by the current CEO, Khaireddine Fredj. Since then, it has grown
from a small group of engineers working from home to a team of 500 employees distributed across
Tunisia, where the headquarters is located, and two other subsidiaries in France, and the United States.
The company’s organizational structure is notably innovative compared to traditional hierarchies.
While it has a CEO, a Chief Financial Officer, a Chief Technology Officer, and a Chief Human
Resources Officer, Satoripop adopts a departmental structure. It is divided into four main departments:

1. Retail

2. FSI (Financial Services Industry)

3. SMB (Small and Medium Businesses)

4. Cloud

In each of these departments, you will find employees with specific skills and expertise tailored
to their respective roles. For instance, there may be financial analysts within the FSI department and
administrative staff supporting various company activities. Additionally, each department may have a
dedicated project manager overseeing their operations and ensuring smooth coordination.

1.3 Project Presentation


1.3.1 Objective
The primary mission involves developing the engine for this tool, which automatically generates
data from online information. This development entails using the Python programming language,
along with Machine Learning, and Business Intelligence techniques. With the completion of this
work, the tool now can retrieve, structure, and classify prospects. Our objective is to provide high-
quality data on targeted B2B prospects and analyze them to offer insights and visualizations for
decision-makers, enabling them to evaluate the prospects effectively.
CHAPTER 1. PROJET CONTEXTE 5

1.3.2 Problem
Every company requires outbound sales and marketing campaigns where prospects are essential,
as it constitutes one of the primary tasks for any business. Currently, the online advertising market
is witnessing a significant boom in lead generation, leading to soaring demand for lead management
services. Unfortunately, 61% of B2B marketers consider generating high-quality prospects as one of
their major challenges.

1.3.3 Existing Study


Wappalyzer Lead Generation [22] : This popular tool caters to web developers, marketers, and
researchers who seek insights into the technological stacks used by various websites. It can also be
employed to identify security vulnerabilities and compare technology stacks used by competitors.

• Advantages:

1. Precision: Wappalyzer is highly accurate in discovering technologies used on a website,


providing precise information.
2. User-Friendly: The user interface of Wappalyzer is simple and easy to use. Simply enter
a website’s URL to obtain information about the technologies used.
3. Availability: Wappalyzer is available as a browser extension for Chrome, Firefox, and
Opera, as well as a library for developers.

• Disadvantages:

1. Limitations of the Free Version: The free version of Wappalyzer has certain query
limitations. Regular usage may require upgrading to the paid version.
2. Limited Precision for Some Tools: Wappalyzer may not detect certain newer or less
popular technologies used on a website.
3. Data Privacy: As Wappalyzer collects information about the technologies used on a
website, data privacy concerns may arise. However, it is worth noting that Wappalyzer
does not collect personal data and does not store the collected information.

BuiltWith [20] : This lead generation tool identifies the technologies used by websites and provides
a list of websites that utilize specific technology, along with other data points such as contact
information and social media profiles.

• Advantages:

1. Detailed Technology Information: BuiltWith offers detailed information about the


technology being used, including software versions.
2. Comprehensive Data: It provides extensive data on e-commerce stores, advertising,
analytics, and other website technologies.
CHAPTER 1. PROJET CONTEXTE 6

• Disadvantages:

1. Inconsistent Data Accuracy: Some users report that the accuracy of the provided data can
be inconsistent.

Datanyze [21] : This business intelligence platform offers a range of lead generation and market
analysis tools. It includes a technology tracking feature that identifies the technologies used by
websites, along with other data points such as funding, employee count, etc.

• Advantages:

1. Wide Range of Data Points: Datanyze offers an array of data points beyond technology
usage, including firmographics, funding data, and employee count.
2. Comprehensive Search Filters: It provides users with a comprehensive set of filters and
searches.

• Disadvantages:

1. The user interface can be somewhat complex and challenging to navigate.


2. The price is relatively expensive compared to other lead generation tools.

1.3.4 Proposed Solution: Smart Lead Generation


Our proposed solution, Smart Lead Generation, is a system that generates B2B prospects
and evaluates them using Machine Learning techniques. It also provides analytical reports and
visualizations to assist the Sales/Marketing team in their decision-making process.

1.4 Methodological Study and Planification


In the current landscape, data science holds a prominent position within organizations, leading
to collaborative efforts among teams to extract valuable insights from data emphasizing teamwork
over individual work. Despite this trend, there remains a lack of comprehensive knowledge about the
actual dynamics of collaboration in practical settings.

1.4.1 Comparative Analysis of Project Management Approaches


During the implementation of a solution, it is crucial to monitor and manage the project to ensure
its smooth progress and achieve the expected outcomes. Based on a Comparative Analysis of Classic
approaches and agile approaches [23] we found the following results:

• Orientation of the Methodology:

– Classic Approaches: Classic project management methodologies, are characterized by a


sequential and linear approach. They emphasize detailed planning from the beginning
CHAPTER 1. PROJET CONTEXTE 7

of the project and follow a fixed sequence of phases: initiation, planning, execution,
monitoring, and closure.
– Agile Approaches: Agile approaches, like Scrum and Kanban, are iterative and
progressive. They focus on delivering value to the customer through small, frequent
increments rather than attempting to complete the entire project simultaneously. Agile
projects adapt to changes and continuously improve through regular iterations.

• Flexibility and Adaptability:

– Traditional methods are less flexible and adaptable to changes in requirements. Once a
phase is completed, it is difficult to make changes without going back and reworking the
previous phases, which can be time-consuming and costly.
– Agile approaches encourage change and favor adaptability. They prioritize
accommodating changing requirements and stakeholder feedback, making it easier to
make adjustments as development progresses.

• Project Planning :

– Classic methodologies require comprehensive and detailed planning from the start. The
entire project scope, schedule, and resources are defined upfront, and any modifications
require a formal change management process.
– Agile projects focus on planning incrementally. Planning is done progressively, typically
for the next iteration or sprint, allowing for more flexibility as the project advances.

• Delivery :

– Classic methodologies deliver the entire project as a single package at the end. The final
product is often tested and validated only after all development is completed.
– Agile approaches deliver the project through small, frequent iterations. Each iteration
results in a potentially shippable product, ensuring regular feedback and validation from
stakeholders.

• Roles and Responsibilities:

– Roles are often more rigid in classic methodologies, with distinct roles for project
managers, team members, and stakeholders.
– Approaches Agiles: Agile approaches promote self-organizing teams, where members
collectively take responsibility for planning, executing, and delivering the work.
Traditional project management roles may be more flexible in agile environments.

• Communication and Collaboration :

– Communication in classic approaches often follows a formal and top-down structure with
predefined communication channels.
CHAPTER 1. PROJET CONTEXTE 8

– Agile approaches encourage frequent communication, collaboration, and face-to-face


interactions among team members and stakeholders to foster a shared understanding of
project goals and progress.

• Risk Management:

– In classic Approaches: Risk management is typically performed early in the project and
is less dynamic throughout the project lifecycle.
– Agile projects integrate risk management throughout the development process, identifying
and addressing risks as they arise during iterations.

Table 1.1 summarises the main differences between traditional approaches and agile approaches.

Aspect Agile Approach Traditional Approach


initial requirements and
Objective Meet customer expectations
commitments
Change Open to change Not open to change
One single delivery at the end of the
Delivery Regular deliveries
project
Each person perceives their own
Teamwork Collaborative work
contribution to the work
Bidirectional and collaborative
Communication Downward communication
communication
Promote the idea of continuous Adjustments are considered only at
Improvement
improvement at all project stages the end

Table 1.1: Agile vs Classic Project management approaches

1.4.2 Chosen Approach: Agile


In summary, classical approaches are more suitable for projects with well-defined requirements
and limited changes, while agile approaches are preferred for projects requiring flexibility, frequent
feedback, and continuous adaptation to changing conditions. Each methodology has its strengths and
weaknesses, and the choice of the agile method was based on the specific nature and requirements of
our project.

1.4.3 Project Management Frameworks


There are several methodologies employed in project management, each with its unique strengths and
focus. Some of the popular methodologies include:

• Scrum: an agile project management methodology that emphasizes iterative and incremental
software development. It promotes interdisciplinary collaboration, transparency, and the
continuous delivery of features. Scrum projects are divided into iterations called ”sprints,”
typically lasting 2 to 4 weeks. This methodology encourages close collaboration among team
CHAPTER 1. PROJET CONTEXTE 9

members, with an ideal Scrum team consisting of around 7 +/- 2 members. Communication
within the Scrum team is informal, fostering seamless information exchange. Scrum also
places a strong emphasis on transparency and visibility of ongoing work through artifacts like
the Scrum board. A key aspect of Scrum is the ongoing involvement of the client, with the
product owner acting on behalf of the client throughout the development process, allowing for
adjustments to priorities and features based on changing needs.

• XP (Extreme Programming): an agile software development methodology that prioritizes code


quality, automated testing, and short development cycles. It encourages close communication
between developers and clients. XP iterations, called ”XP iterations,” can vary in duration
from 1 to 6 weeks, providing considerable flexibility. XP teams are typically small, with
fewer than 20 members, promoting close developer communication. Team communication
in XP is informal, encouraging tight collaboration. XP places a strong emphasis on code
quality, with practices such as Test-Driven Development (TDD) and pair programming. This
methodology also focuses on short development cycles to quickly deliver value-added features.
The client is involved throughout the development process, allowing for continuous adjustments
to requirements based on feedback and changing needs.

• KANBAN: a visual project management methodology that enables workflow tracking and
task management optimization. It emphasizes visualizing work in progress, limiting work in
progress, and continuous improvement. Unlike Scrum and XP, KANBAN does not rely on
fixed iterations. Work begins immediately after the completion of a previous task, offering
great flexibility. This methodology can be adopted by any competent or multidisciplinary team.
Team communication in KANBAN is often informal and can occur face-to-face. KANBAN
focuses on visualizing the workflow, with a KANBAN board displaying work in progress, tasks
to be done, and those already completed.

1.4.4 Adopted Methodology


1.4.4.1 SCRUM Framework

After studying various agile methodologies to improve client communication, enhance deliverable
visibility, and ensure high product quality, we have chosen to adopt the Scrum framework. Our
decision is motivated by the following reasons:

• Scrum is currently the most widely employed methodology.

• It aligns well with our project type.

Scrum principles are integral to the widely practiced agile software development techniques
worldwide. Scrum comprises three core processes: roles, artifacts, and timeframes.
Scrum is frequently employed in the software development process to address complex challenges,
consistently demonstrating increased productivity and reduced software development costs [23]
The core principle of the Scrum methodology involves incremental software development by
breaking the project into iterations or ”Sprints.” The goal is to deliver an incremental portion of
CHAPTER 1. PROJET CONTEXTE 10

the software to the client at the end of each sprint. This methodology relies on iterative development
cycles lasting 2 to 4 weeks, making it conducive to accommodating adjustments compared to other
approaches.
Regarding Scrum roles, there are three primary ones:

• Product Owner: Responsible for managing the product backlog. Collaborates with stakeholders
to understand product needs and requirements, prioritizes backlog items, and defines them in
terms of ”User stories.” Ensures the development team comprehends product requirements and
goals and remains available to address queries.

• Development Team: Tasked with completing selected product backlog items for each sprint.
The team is self-organizing and may comprise developers, testers, designers, and other
professionals involved in product development. They work collectively to achieve sprint goals
and deliver high-quality features.

• Scrum Master: Responsible for implementing the Scrum methodology and ensuring the
development team adheres to the process. Facilitates sprint meetings, aids in resolving
issues and obstacles encountered by the development team, and ensures proper Scrum process
adherence.

Scrum also includes several meetings to maintain communication and collaboration within the
development team and with stakeholders. These meetings include sprint planning, daily Scrum, sprint
review, and sprint retrospective meetings.
Regarding Scrum artifacts, the methodology provides several essential elements, including the
product backlog, sprint backlog, burndown chart, and task board. These contribute to effective
management and visualization of ongoing work and sprint goals.

1.4.4.2 CRISP-ML

CRISP-ML, which stands for Cross-Industry Standard Process for Machine Learning, is a widely
adopted methodology for developing machine learning projects. It provides a well-structured
framework to carry out various tasks and activities throughout the machine learning lifecycle, from
problem understanding to model deployment. One of the primary reasons for embracing CRISP-ML
is its systematic and structured approach, which enhances the chances of success. It offers a clear
roadmap for the project, ensuring that all relevant steps are followed in the correct sequence with
well-defined inputs and outputs.
Another advantage of using CRISP-ML is its established and widely accepted nature, which
means ample support and resources are available for those who use it. This can be especially beneficial
for machine learning novices or complex projects. Overall, the use of CRISP-ML can significantly
increase the likelihood of success in a machine learning project by providing a clear and structured
approach while leveraging the support and resources within the machine learning community.
The CRISP-ML methodology is illustrated in Figure 1.2 taken from [1], it encompasses the
following key stages:
CHAPTER 1. PROJET CONTEXTE 11

1. Business and Data Understanding: The development of machine learning applications starts
with identifying the project’s scope, success criteria, and data quality verification to ensure
feasibility. Success criteria, including those related to the market, should be defined with
measurable performance indicators.

2. Data Engineering (Data Preparation): In this phase, data is prepared by selecting relevant
market segments and cleaning them to ensure data quality. Important features for market
segmentation are identified, and data is normalized to avoid errors.

3. Machine Learning Model Engineering: The modeling phase focuses on specifying machine
learning models suitable for market segmentation. Evaluation metrics include the ability to
identify market segments, model robustness, and interpretability.

4. Machine Learning Model Evaluation: After training, models are evaluated for their ability
to segment the market effectively and accurately. Performance, robustness, and interpretability
metrics are used to assess the models.

5. Deployment: Model deployment involves integrating machine learning models into existing
systems to enable real-time segmentation. Deployment approaches vary depending on market
segmentation needs, whether online or batch.

6. Monitoring and Maintenance: Once in production, models are monitored to ensure they
maintain their ability to segment the market accurately. Adjustments are made based on market
changes to ensure the continuous relevance of the segmentation.

Figure 1.2: Project lifecycle in CRISP-ML


CHAPTER 1. PROJET CONTEXTE 12

CRISP-ML can be effectively combined with agile approaches like Scrum or Kanban. While it
provides a systematic and structured way to handle machine-learning projects, agile methodologies
bring flexibility, collaboration, and iterative delivery. By integrating both approaches, teams can
efficiently manage the complexities of data projects, steadily deliver value, and adapt to evolving
requirements and insights.

1.4.4.3 GIMSI

The GIMSI approach, an agile methodology, centers on users and meaningful insights in business
intelligence. It offers a structured framework for successful dashboard integration projects, focusing
on optimizing performance.
In addition, rigorous research has been conducted on GIMSI. Evolving technology and human
behavior pose challenges to businesses, requiring adaptability and proactive measures. Choosing the
right approach for aligning policies and strategies is complex. Notably, the GIMSI process consists
of 10 defined steps grouped into four phases:

1. Identification

• Examination of the company’s environment: This phase involves analyzing the economic
environment and the company’s strategy to outline the project’s scope clearly.
• Company identification: This step entails scrutinizing the organizational structure,
business processes, and involved stakeholders of the company.

2. Design

• Defining company objectives: During this stage, we thoroughly explore the strategic
aspirations of operational teams, seeking their tactical goals and specific ambitions.
• Defining a dashboard: This phase encompasses defining and characterizing an individual
dashboard for each team, serving as a decision-making aid with relevant performance
indicators.
• Selection of performance indicators: Choosing performance indicators is a crucial step
based on objectives, context, and stakeholders identified in prior stages, providing
valuable guidance for selecting the most relevant indicators.
• Information collection: This phase aims to gather essential data required for developing
indicators.
• Dashboard system: Constructing the dashboard system and ensuring overall consistency
control.

3. Implementation

• Software selection: Establishing an evaluation framework to choose appropriate software


solutions.
• Integration and deployment: At this point, we proceed with implementing the chosen
software solutions, integrating them, and deploying them within the company.
CHAPTER 1. PROJET CONTEXTE 13

4. Continuous Improvement Ongoing monitoring of the system involves continuous


performance and decision-making system usage assessment. This aids in identifying issues,
recognizing improvement opportunities, and taking corrective actions to ensure smooth
operation. Regular monitoring ensures ongoing system optimization and user satisfaction.

1.5 Planification
1.5.1 Project Planification
Our project unfolds through several pivotal phases that encompass diverse aspects of data-driven
analysis and application development. In the initial stage of Planification and Requirement Analysis,
we lay the foundation by meticulously outlining the project’s scope and objectives. The culmination
arrives with the Development of a Web Application, an interactive platform showcasing the fruits
of our labor. This app provides access to an array of valuable resources, including visually rich
dashboards and a comprehensive list of web-scraped data. Furthermore, it furnishes the capability
to make informed predictions using the segmentation model. This holistic approach encapsulates the
essence of our project, merging cutting-edge analytics with user-friendly interactivity.
Following this, we embark on the journey of Data Collection via Web Scraping, harnessing the
power of automated data extraction to amass relevant information from online sources.
Our project’s sophistication ascends with the Building of Segmentation Machine Learning
Models. We create two models, meticulously training and validating them. Following a rigorous
evaluation, we select the most suitable model, ensuring it aligns seamlessly with our objectives.
Subsequently, in the Data Integration phase, we utilize robust ETL (Extract, Transform, Load)
processes to harmonize and merge the collected data. We establish a systematic job plan to ensure
consistent updates, maintaining data relevance over time. This integrated dataset serves as the bedrock
for the subsequent steps.
In the Data Analysis phase, we leverage Online Analytical Processing (OLAP) cubes to delve into
the multidimensional insights hidden within the data. Visual Dashboards provide a comprehensive
visual representation of these insights, empowering stakeholders to gain quick and meaningful
insights.

1.5.2 Gantt Chart


Figure 1.3 illustrates the timeline of the project planning and the various completed stages through a
Gantt Chart following our chosen methodology :
CHAPTER 1. PROJET CONTEXTE 14

Figure 1.3: Gantt chart of the project planification

1.6 Conclusion
This chapter allowed us to develop our work methodology and outline the different phases to follow.
This framework provided us with the essential foundations to progress in our approach. we were able
to grasp the challenges of our project by contextualizing it within the company where the internship
took place, as well as the detailed specification of our requirements. This step will enable us to lay
the necessary groundwork for the implementation of our project. The next chapter will present the
fundamental theoretical principles underlying our solution.
Chapter 2

State of the art

2.1 Introduction
Following the presentation of the problem and the proposed solution, This chapter embarks on an
examination of the current state of knowledge and developments in our project’s domain. It delves
into research work, methodologies, and existing solutions to our lead generation issue.
Subsequently, we dive into the specific realm of data collection and cleansing. We focus on web
scraping, a method we judiciously employed in the data collection phase. This approach enabled
us to acquire essential data and subsequently enter the realm of business intelligence to integrate and
prepare the data for analysis. As we progress, we explore machine learning methods, with a particular
focus on unsupervised learning techniques for segmentation problems in the field of Marketing.

2.2 Marketing Lead Generation


This pivotal step aims to establish a profound understanding of the requirements, concerns, and
business objectives that form the foundation of our project.
At the core of this stage lies a meticulous exploration of the operational challenges and strategic
opportunities that our hosting company seeks to address. It is imperative to grasp the contextual
intricacies.
Prospect generation is identifying potential customers interested in a company’s products and
services. This is typically achieved through various marketing efforts and advertisements designed
to attract the attention and interest of potential customers, encouraging them to provide contact
details such as names, email addresses, and phone numbers. Once they provide their information,
they become ”prospects,” and businesses follow up with them, usually through marketing emails and
phone calls, to convert them to paying customers. Prospect generation plays a crucial role in the
sales and marketing strategies of many companies, ensuring a steady flow of potential customers to
interact with, and ultimately convert into revenue.

Conforming to [39], lead generation helps organizations increase brand awareness, establish
relationships, and attract more potential customers to fill their sales pipeline. It affects the significance
of organizations and how their value effectively increases following the implementation of tools

15
CHAPTER 2. STATE OF THE ART 16

provided by processes that are integral to lead generation.

2.2.1 Lead Generation Methods


According to [35], identifying potential clients is also known as lead generation through live seminars,
trade shows, advertising, and referrals. These methods are classified into two types:

• Reactive Methods: A customer contacts a company after seeing an advertisement. Sales


representatives then handle the sale.

• Proactive Methods: A potential customer is cold-contacted by a company. Sales representatives


attempt to verify that the potential customer is indeed a prospect.

2.3 Leads Evaluation


Businesses face fierce competition in the market, often leading them to expand their search for
potential prospects, resulting in an increased number of leads entering the customer relationship
management module. Engaging and tracking these prospects without evaluating their quality can
be challenging for companies, making a B2B lead qualification model necessary. Hence, the
propensity of B2B, and B2C prospects has recently been a carefully studied research area due to
its significant impact on sales effectiveness and internal workflow optimization for customer handling.

According to [30] concluded that lead qualification is an essential task for the marketing team as
it enhances the efficiency of campaigns conducted by the sales teams. A well-qualified lead will help
the sales team increase the conversion rate. Other factors, such as time optimization, targeting the
right type of prospects, and transforming the lead management process to be more meaningful, also
play a crucial role.

Furthermore, Industry 4.0 emphasizes interconnectivity, predictive analytics, and machine


learning to innovate business operations and growth promoting quick responses in dynamic markets.
According to [37] the implementation of Industry 4.0 in businesses. The current state, the positive
consequences, the high-level technological advancements when implemented, and the alignment
between production and intelligent digital technology.

2.4 Data Collection


2.4.1 WebScraping Fundamentals
According to [28], web scraping refers to the process of automatically extracting data from
websites using computer programs or softwares. It is a particularly crucial process in modern Business
Intelligence. Web scraping is a technology that allows us to extract structured data from text formats
such as HTML. Web scraping is highly valuable in situations where data is not provided in a machine-
readable format like JSON or XML. It can also be used to gather information not available through
CHAPTER 2. STATE OF THE ART 17

conventional methods. Research has shown that using web scraping produces more comprehensive,
accurate, and consistent data compared to manual data entry. Based on the results, it has been
concluded that web scraping is very useful in the information age.
These extracted data can then create marketing lists and target potential customers with specific
offers or promotions. Web scraping can also help businesses gather data about their competitors,
including their marketing strategies and product offerings, which can be used to inform their own
sales and marketing efforts.

2.4.2 Webscraping Techniques


Based on [28], there are several different techniques for web scraping, with some of the most
common ones include:

• HTML Parsing: This involves analyzing the HTML code of a website to extract specific
information:

1. Analyzing HTML structure using tools like Beautiful Soup and lxml to select specific
elements for extraction.
2. Using CSS, and XPath selectors to locate specific elements on a webpage for extraction,
such as tags, classes, IDs, and attributes.
3. Web Browser Automation: For more complex websites, automating the web browser may
be necessary to simulate human interaction. Tools like Selenium can be used for this
purpose.
4. Handling Speed Limitations: Websites may implement rate limits to prevent excessive
web scraping. To overcome this, web scraping experts can use techniques like rotating IP
addresses, breaking queries into multiple sessions, and setting delays between requests.
5. Data Storage and Processing: Once the data is extracted, it needs to be stored and
processed.

• Web Crawling: This involves automatically navigating a website to extract data from multiple
pages. Before starting a crawl, defining the crawl scope, i.e., the web pages to explore,
is important. Then, link crawling involves exploring the website’s internal links to collect
additional data.

• API Scraping: This refers to using an Application Programming Interface (API) to extract data
from a website.

• Screen Scraping: This involves extracting data directly from the visual elements of a website,
such as text, images, and forms.

Additionally, based on [25], we found that web scraping using the regex method consumes the least
memory compared to the HTML DOM, and XPath methods. On the other hand, HTML DOM has
the least time and the least data consumption compared to regex, and XPath methods.
CHAPTER 2. STATE OF THE ART 18

1. Depending on the complexity of the website and the data to be extracted, using an API is
the most reliable option if it provides accurate and up-to-date data. However, in the case of
limitations with the LinkedIn API, HTML structure analysis, especially using regex, is the
simplest in terms of memory usage and accessibility.

2. During website exploration, there may be constraints to consider, such as rate limits and
bandwidth limitations. Crawlers can be configured to comply with these constraints by slowing
down the exploration speed or dividing the exploration into multiple sessions. As a result,
errors may occur during website exploration, such as encountering 404 pages or server errors.
Crawlers can be configured to handle these errors by attempting to recover missing pages or
aborting the exploration.

2.4.3 The Role of Web Scraping in Lead Generation


This is where lead web scraping comes into play as an effective solution for promoting and engaging
with customers. Everyone desires targeted data of specific industry criteria.
Prospect generation involves collecting information about businesses and individuals within a
specific target audience that is likely to generate revenue. Leads are vital for commercial success, and
lead collecting is a valuable method to gather all the necessary information about potential buyers.
By utilizing lead scraping, one can browse web pages to find relevant prospects using specified
keywords. Interestingly, 91% of marketers consider new lead generation as their top priority.

2.5 Business Intelligence


2.5.1 BI Fundamentals
Business Intelligence (BI) encompasses various techniques, procedures, structures, and tools that
convert raw data from different sources into relevant, practical, and actionable insights through
analysis and transformations. These insights guide strategic, tactical, and operational decisions.
Analytical results are presented in forms such as reports and dashboards to make information
accessible and actionable for decision-makers and stakeholders. Regardless of the industry, businesses
share the common need for a BI system, aiming to achieve several objectives:

• Ensuring easy and rapid access to necessary information.

• Ensuring data credibility and quality, maintaining information coherence.

• Adapting to changes by updating system data according to evolving needs and technology while
keeping users informed of these modifications.

• Extracting significant business value from vast datasets using analytical tools to aid decision-
making.
CHAPTER 2. STATE OF THE ART 19

2.5.2 Decisional System Architecture


Based on [36] system typically consists of four essential phases: data collection, data integration
(ETL, ODS), data organization (data warehouse), and presentation (analysis tools, reporting, etc.)

Figure 2.1: Business Intelligence project Architecture

In figure 2.1 taken from [2], the project architecture’s in-depth structure of such a decision-making
system consists of the following components:

• Data Sources: These are varied and diverse data origins that can be generated both within and
outside the organization.

• ETL Process: This is a procedure involving collecting, transforming, and loading data into a
data warehouse or target system.

• Operational Data Store (ODS): This serves as an intermediate storage system between
operational data sources and the data warehouse. It allows real-time access to operational data
for daily activities and operational reports.

• Data Warehouse: This centralized database efficiently organizes and stores structured data from
different company sources. Its purpose is to facilitate analysis and decision-making by enabling
quick and consistent access to historical and current data.

• Data Marts: These are compact, specialized databases that consolidate domain or department-
specific data. They aim to provide aggregated and pre-formatted information suitable for precise
analyses and reports within a given context.

• OLAP Cubes: These enable interactive analysis of data using multidimensional structures.

• Data Visualization Tools: These tools visually represent information and data graphically and
intuitively.
CHAPTER 2. STATE OF THE ART 20

2.5.3 Multidimensional Modeling


In the context of multidimensional modeling, a schema is composed of a central table, the fact table,
which holds a composite primary key, and additional auxiliary tables known as dimension tables.

• The fact table captures detailed data about specific events or transactions, with numerical
measures tied to various dimensions.

• Dimension tables offer extra descriptive data related to the fact table, providing specific
contexts and viewpoints for the recorded measures.

Three distinct logical data model categories exist:

1. Star Model see figure 2.2, where a central fact table is encircled by dimension tables,
resembling a star shape and catering to user-friendly data exploration, and multidimensional
analysis.

Figure 2.2: Star Modeling

2. Snowflake Model, see figure 2.3, a variation where dimensions are divided into sub-tables,
forming a hierarchy that improves data normalization but can complicate queries.

Figure 2.3: Snowflake Modeling


CHAPTER 2. STATE OF THE ART 21

3. Galaxy Model, see figure 2.4, which involves multiple star models sharing common dimensions,
potentially involving different facts and dimensions.

Figure 2.4: Galaxy Modeling

2.5.4 Data Warehouse Conception Approches


First we need to define strategies for creating a data warehouse architecture and conduct a
comparison to identify the most fitting approach for our requirements.

2.5.4.1 The Bottom-Up Approach

Initiated by Ralph Kimball, it involves gradually building essential components toward a complete
system. It starts with crafting data marts that cater to specific business needs, providing user-friendly
reporting and analysis for particular processes in figure 2.5 (taken from [3]).

Figure 2.5: Bottom-up approach by Ralph Kimball

2.5.4.2 The Top-Down Approach

Initiated by Bill Inmon, It begins with an overarching vision and moves towards specifics. Here,
a data warehouse acts as a centralized repository using a standardized business model in figure 2.6
CHAPTER 2. STATE OF THE ART 22

(taken from [3]).

Figure 2.6: Top-Down approach by Bill Inmon

2.5.4.3 The Hybrid Approach

Combining Inmon and Kimball’s methodologies for efficient data warehouse design. In practice,
many companies use a hybrid approach, employing Inmon’s method to establish a centralized data
warehouse and Kimball’s technique to create data marts using a star schema. This blend reaps the
benefits of both approaches, catering to the business’s specific needs in figure 2.7 (taken from [3]).

Figure 2.7: The Hybrid approach

2.5.5 Comparative Study Between Data Integration Processes


ETL (Extract Transform Load) et ELT (Extract Load Transform) describe processes of data
cleaning, enrichment, and transformation from various sources before integrating them for use in
data analysis, business intelligence, and data science. Based on the analyses [19], we have developed
this comparative study illustrated in table 2.1:
CHAPTER 2. STATE OF THE ART 23

Aspect ETL ELT


Data Warehouse Support Traditional process for structured data in Modern process for
the warehouse, on the cloud, or on-site. structured/unstructured data into
the cloud-based warehouse.
Data Lake/Lakehouse Less suitable for Data Lakes. Designed for Data Lakes and Data
Marts.
Dataset Size/Type Suitable for smaller relational datasets, Handles any size/type, ideal for big
with complex transformations. structured/unstructured data.
Implementation Mature tool ecosystem, available Relatively new, growing tool and expert
experts. ecosystem.
Transformation Transformation outside the warehouse, Quick transformation, a potential
preloading. slowdown in querying.
Loading is slower than ELT because of Multi-step Direct loading, faster due to a single
loading. step.
Maintenance Requires maintenance (on-site/cloud). Minimal maintenance due to continuous
data availability.

Table 2.1: Comparative Analysis ETL vs ELT

2.5.6 Key Performance Indicators


The success of a business is interconnected with the overall performance of the company. This
connection allows for the identification of effective growth drivers and the formulation of a potent
action plan to bolster them. Additionally, business performance serves as a dependable measure
to pinpoint a company’s weaknesses, thereby paving the way for problem-solving and proactive
anticipation of potential challenges. There are 3 types of KPIs classified by Measurement Methods:

• Calculated KPI: A calculated KPI is an indicator determined through mathematical or


statistical calculations. It is often based on quantitative data and can be objectively measured.

• Non-calculated KPI: A non-calculated KPI doesn’t require calculations for determination. It


usually relies on qualitative assessments, surveys, or subjective evaluations.

• Semi-calculated KPI: A semi-calculated KPI combines elements of calculation and qualitative


assessments. It may include both quantitative data and subjective evaluations.

KPIs are generally grouped into four distinct categories, each of them with its distinct attributes:

• Strategic KPIs, offer a comprehensive look at a company’s health. Though not providing
intricate details, they are frequently used by executives to gauge return on investment, profit
margins, and total revenue.

• Functional KPIs center around specific company departments or functions. For instance,
the marketing department measures the clicks on each email distribution. These KPIs can be
strategic or maybe operational, and they offer substantial value to specific user groups.

• Operational KPIs focuse on shorter periods, assessing a company’s performance from month
to month or even day to day. They allow management to analyze specific processes or segments.
CHAPTER 2. STATE OF THE ART 24

• Leading or Lagging KPIs illustrate the nature of the analyzed data and indicates whether they
predict forthcoming events or reflect already occurred events, on the other hand, result from
past operations and are considered a lagging indicator.

2.5.7 OLAP Tabular vs Multidimensional Models


SQL Server Analysis Services provides modeling flexibility in the realm of business intelligence by
offering two distinct approaches: the tabular approach and the multidimensional approach illustrated
in table 2.2.
This range of options addresses the unique requirements of businesses and users, thereby
delivering a customized modeling experience for each scenario. With these available choices,
organizations can opt for the most suitable method based on their data modeling preferences and
specific characteristics.

Tabular Model Multidimensional Model


Storage Uses columnar storage, which is Uses multidimensional storage with
optimized for performance and cubes, dimensions, and hierarchies.
compression.
Data Structure Utilizes relational data model with tables Utilizes a multidimensional data model
and relationships, similar to a traditional with cubes, dimensions, measures, and
relational database. calculated members.
Scalability Scales well for large datasets and Might be less scalable for complex
complex calculations due to column- calculations and large datasets due to the
based storage. cube structure.
Performance Generally offers faster query Might have slightly slower query
performance due to columnar storage performance compared to tabular
and in-memory processing. models, especially for complex queries.
Aggregations Aggregates data on the fly, which might Pre-aggregates data in the cube structure
impact query performance initially but for faster query response times.
improves over time with caching.
Flexibility More flexible and easier to model and Less flexible compared to the tabular
customize due to its tabular nature and model, especially for handling non-
DAX language. standard scenarios.
Complexity Generally less complex and easier to Can be more complex, especially
understand for developers and business for business users, due to the
users. multidimensional structure.
Usage Suitable for self-service BI scenarios Suitable for more traditional BI
and ad-hoc analysis. scenarios and complex analytical
applications.
CHAPTER 2. STATE OF THE ART 25

Tooling Supports Power BI, Analysis Services Supports Analysis Services


Tabular, and Azure Analysis Services. Multidimensional and older versions of
Excel-based Power Pivot.
Data Model Relies on relationships between Uses cubes with predefined hierarchies,
tables and uses DAX expressions for dimensions, measures, and MDX
calculations and measures. expressions for calculations.

Table 2.2: Comparative Analysis of OLAP Models Tabular VS Multidimensional

According to [27] the collection and analysis of marketing data and information are the
scientific basis for marketing decision-making. The study analyzed that key technologies supporting
business intelligence include data warehousing, data mining, and OLAP. It also examined the
application of business intelligence in corporate marketing decision-making.

2.5.8 The Role of Business Intelligence in B2B Lead Generation marketing


According to [29], the applications of data-driven marketing, which is closely aligned with the concept
of BI, and examines its influence on customer engagement and overall marketing effectiveness. The
research highlights the pivotal role of data analytics and insights in crafting targeted marketing
strategies that enhance customer engagement and drive business growth in the B2B context. This
study showcases the practical implementation and positive outcomes of leveraging data-driven
approaches in modern marketing practices, corroborating the importance of BI in B2B lead generation
marketing.
Also, Business Intelligence (BI) has penetrated as a transformative catalyst. Leveraging data-
driven insights, BI empowers businesses to finely calibrate their marketing strategies, tailoring
campaigns to resonate with potential leads. Through precise segmentation and a nuanced
understanding of customer behaviors, BI guides resource allocation toward high-conversion
prospects, enhancing both efficiency and effectiveness. As a central pillar of modern marketing,
BI drives informed decision-making, sustains competitive advantage, and nurtures enduring customer
relationships in today’s dynamic landscape.

2.6 Artificial Intelligence


Artificial Intelligence (AI) has emerged as a transformative field encompassing a diverse range of
learning techniques, with machine learning and deep learning at its forefront. Machine learning, as
pioneered by Arthur Samuel in 1959, has evolved to become a cornerstone of AI, allowing systems
to improve their performance on tasks through data-driven algorithms. Deep learning, a subset of
machine learning, has gained prominence owing to its ability to emulate human neural networks,
resulting in remarkable advances in areas such as computer vision and natural language processing.
The application of deep neural networks has revolutionized AI by enabling computers to automatically
learn and extract intricate patterns from vast datasets. The dynamic interplay between these learning
CHAPTER 2. STATE OF THE ART 26

types has catalyzed breakthroughs across industries, ushering in a new era of AI-powered innovation
with profound societal implications. For more details see [31]

2.7 Machine Learning


Machine learning (ML) is a component of artificial intelligence and computer science that focuses
on utilizing algorithms and data to emulate human learning processes, progressively enhancing its
precision. These algorithms develop a model based on sample data, termed training data, to generate
forecasts or choices without requiring explicit programming.
ML empowers a system to autonomously learn from data, refine performance based on
experiences, and anticipate outcomes without direct programming. Leveraging historical data
samples, designated as training data, machine learning algorithms formulate mathematical models
that facilitate forecasts or decisions without explicit coding. By merging computer science and
statistics, machine learning forges predictive models via algorithms that glean insights from past
data. Enhanced data input leads to elevated performance.
ML models can be conceptualized as programs trained to detect patterns in new data and make
anticipatory projections. These models manifest as mathematical functions, receiving input data for
prediction and delivering corresponding outputs. Initially, these models are educated on a dataset and
subsequently equipped with an algorithm to analyze data, extract patterns from input information,
and accumulate knowledge. Once trained, these models become adept at predicting unseen datasets
as illustrated in Figure 2.8 taken from [4].

Figure 2.8: Machine learning representation

2.7.1 Types of Machine Learning


2.7.1.1 Supervised Learning

Supervised machine learning operates under the principle of guidance. This involves instructing
machines through the use of a ”labeled” dataset, whereby the machine is trained and subsequently
makes predictions based on this training. The term ”labeled data” signifies that specific inputs are
already linked to their respective outputs. To elaborate further, the process begins by training the
machine with input-output pairs, followed by tasking the machine with predicting outputs when
presented with a separate test dataset.
CHAPTER 2. STATE OF THE ART 27

Figure 2.9: Supervised Learning representation

As illustrated in Figure 2.9 taken from [5], supervised learning embodies a category of
ML wherein the algorithm undergoes training with a dataset that includes both input data and
corresponding output labels. This training enables the algorithm to establish associations between
input data and the correct corresponding output, based on the provided labels. The overarching
objective of supervised learning is to facilitate accurate predictions for novel, unseen data, leveraging
the general patterns and relationships absorbed during the training phase.

The primary objective underlying the supervised learning approach is to establish a mapping
between the input variable (x) and the output variable (y). It can be divided into two distinct
categories:

• Classification algorithms are tailored to address classification quandaries, specified by


categorical output variables such as ”Yes” or ”No,” ”Male” or ”Female,” ”Red” or ”Blue,” and
so on. These algorithms are designed to predict the categories represented within the dataset.

• Regression algorithms are engineered to solve regression predicaments, wherein a linear


relationship is observed between input and output variables. These algorithms prove efficacious
in forecasting continuous output variables, exemplified by domains like market trends and
weather predictions.

2.7.1.2 Unsupervised Learning

Unsupervised learning stands as a distinctive approach in machine learning, where models operate
without guided instruction from a training dataset. Instead, these models autonomously uncover
concealed patterns and insights within provided data. The analogy to human learning to the
assimilation of new knowledge. In essence:
Unsupervised learning is a machine learning category wherein models are trained using unlabeled
datasets, enabling them to make informed decisions without directed oversight.
The algorithm’s objective within this context is to unveil latent patterns, structures, or relationships
embedded within the data, devoid of explicit steering. As illustrated in figure 2.10 taken from [6],
CHAPTER 2. STATE OF THE ART 28

the core objective of unsupervised learning algorithms is to categorize unorganized datasets based on
similarities. Tasks like clustering and dimensionality reduction exemplify this paradigm. Clustering
involves amalgamating data points based on inherent attributes, while dimensionality reduction
techniques aspire to encapsulate complex data within a reduced-dimensional space while retaining
crucial insights. Unsupervised machine learning can be divided into two distinct categories, as
delineated below:

• Clustering: Employed when intrinsic groups within data necessitate discovery. This technique
groups objects so that the most similar items congregate, while dissimilarity dominates between
different groups. An instance is customer grouping by purchasing behavior.

• Association: This technique, a subset of unsupervised learning, unveils meaningful


relationships among variables within extensive datasets. It seeks interdependencies between
data items, forming variable correlations to maximize gain.

Figure 2.10: Unsupervised Learning representation

2.7.1.3 Semi-Supervised Learning

Combining aspects of both supervised and unsupervised learning. Semi-supervised learning is


acquired from both labeled and unlabeled data, leveraging limited labeled examples and a more
extensive pool of unlabeled instances to enhance its understanding and performance.
This technique mimics human learning, where we often gather more knowledge from our
environment without explicit guidance. In the context of machine learning, labeled data is expensive
and time-consuming to obtain, while unlabeled data is more abundant. Semi-supervised learning
exploits this abundance, using the unlabeled data to augment the learning process.
The semi-supervised learning process typically begins with training on the limited labeled data
available. It then utilizes the patterns and insights gleaned from this data to generalize and make
predictions on the larger set of unlabeled data. The addition of unlabeled data helps the AI system
refine its understanding of complex patterns and variations in the data.
CHAPTER 2. STATE OF THE ART 29

While traditional supervised learning focuses solely on labeled data and unsupervised learning
deals with unlabeled data, semi-supervised learning offers a balanced approach that capitalizes on the
advantages of both paradigms. This technique is valuable in scenarios where acquiring large amounts
of labeled data is challenging or expensive, yet you want to improve model performance beyond what
unsupervised learning can achieve alone.

2.7.1.4 Reinforcement Learning

Figure 2.11 (taken from [7]) shows that reinforcement learning operates through a feedback-driven
procedure where an AI agent (a software component) autonomously explores its environment through
trial and error. It takes action, learns from its encounters, and enhances its performance. The
agent is rewarded for favorable actions and penalized for unfavorable ones, with the primary aim
of maximizing cumulative rewards.
In contrast to supervised learning, reinforcement learning lacks labeled data and solely relies on
experiential learning.

Figure 2.11: Reinforcement Learning

The process of reinforcement learning mirrors human learning, to how a child acquires knowledge
through daily experiences. A tangible instance is playing a game, wherein the game serves as the
environment, the agent’s moves represent states, and the objective is to achieve a high score. The
agent receives feedback in the form of rewards and penalties.
Reinforcement learning’s operational paradigm has found applications across diverse domains
including game theory, operations research, information theory, and multi-agent systems.
Formally, a reinforcement learning challenge can be defined using the framework of a Markov
Decision Process (MDP). Within this context, the agent engages continually with the environment,
executing actions that result in environment responses and subsequent state transitions.

Reinforcement learning can be broadly categorized into two methodologies:

• Positive Reinforcement Learning: This approach reinforces desired behavior by introducing


positive elements, thereby increasing the likelihood of the behavior’s recurrence. It strengthens
CHAPTER 2. STATE OF THE ART 30

the agent’s behavior and yields positive effects.

• Negative Reinforcement Learning: In contrast, negative reinforcement learning employs


avoidance techniques to encourage the recurrence of specific behavior. It aims to prevent
undesirable outcomes and fosters the repetition of desired actions.

Machine learning encompasses various types tailored for distinct tasks and precise results divided
as illustrated in figure 2.12 ( taken from [7]) from the Machine Learning Techniques article illustrating
its applications and challenges [33] :

Figure 2.12: Machine Learning differentiation

2.7.2 The Role of Machine Learning in The Marketing Field


Based on the article Impact of Machine Learning in Digital Marketing Applications [24] In
light of emerging technologies, companies are adopting customized digital marketing strategies to
enhance customer attraction and retention. These strategies leverage artificial intelligence applications
to consolidate and support marketing efforts. The utilization of machine learning techniques and
algorithms has been instrumental in streamlining marketing processes, boosting business profitability,
and becoming an integral part of digital marketing practices. Moreover, machine learning brings
significant benefits to market segmentation and lead evaluation, enhancing precision, personalization,
and efficiency in targeting the right audience and identifying high-potential prospects. Machine
learning algorithms offer a dual advantage: firstly, they address the issue of uneven distribution of
marketing targets, and secondly, they effectively mitigate the loss of potential consumers.
CHAPTER 2. STATE OF THE ART 31

2.7.3 Unsupervised Learning for Segmentation Problem


Segmentation is a common task in machine learning where the goal is to partition data into distinct
clusters or groups based on certain characteristics. There are various algorithms used for segmentation
:

• K-Means Clustering: The algorithm seeks to minimize the sum of squared distances from
each point to the centroid of its assigned cluster. It assigns each data point to the nearest cluster
centroid and then updates the centroids based on the mean of the points in each cluster. This
process is repeated iteratively until convergence. K-means can work well when clusters are
well-defined and roughly spherical. Its performance can be evaluated using metrics like the
silhouette score or within-cluster sum of squares.

n
X k
arg min min ||xi − µ j ||2
clusters j=1
i=1

– arg minclusters : The argument that minimizes over possible cluster assignments.
– ni=1 : The summation over all data points.
P

– minkj=1 : The minimum over cluster centroids.


– ||xi − µ j ||2 : The squared Euclidean distance between data point xi and centroid µ j .

• Hierarchical Clustering: Hierarchical clustering starts by considering each data point as a


separate cluster. Then, it iteratively merges clusters based on their similarity until all data points
are part of a single cluster or a stopping criterion is met. This algorithm creates a dendrogram
that represents the arrangement of data points into a tree-like structure. The linkage criterion
defines how to calculate the distance between clusters. Distance metric based on linkage criteria
(single, complete, average, ward, etc.) Compute distances between clusters using different
linkage criteria, forming a dendrogram that represents the hierarchy of clusters.

1. Single Linkage: defines the distance between two clusters as the minimum distance
between any pair of points, one from each cluster.

D(Ci , C j ) = min ||x − y||


x∈Ci ,y∈C j

– D(Ci , C j ): The distance between clusters Ci and C j .


– x: A data point in cluster Ci
– y: A data point in cluster C j
2. Complete Linkage: defines the distance between two clusters as the maximum distance
between any pair of points, one from each cluster

D(Ci , C j ) = max ||x − y||


x∈Ci ,y∈C j

– D(Ci , C j ): The distance between clusters Ci and C j .


– x: A data point in cluster Ci
CHAPTER 2. STATE OF THE ART 32

– y: A data point in cluster C j


3. Average Linkage: Defines the distance between two clusters as the average (mean)
distance between all pairs of points, where one point is from the first cluster, and the
other is from the second cluster.
1 XX
D(Ci , C j ) = ||x − y||
|Ci | · |C j | x∈C y∈C
i j

– D(Ci , C j ): The distance between clusters Ci and C j .


– |Ci |: The number of data points in cluster Ci .
– |C j |: The number of data points in cluster C j .
– x: An individual data point from cluster Ci .
– y: An individual data point from cluster C j .
4. Ward Linkage: Ward’s linkage calculates the distance between two clusters based on the
increase in the sum of squared distances from the centroids and of each cluster, weighted
by the sizes of the clusters.
s
|Ci | · |C j |
D(Ci , C j ) = · ||µi − µ j ||
|Ci | + |C j |

– D(Ci , C j ): The distance between clusters Ci and C j .


– |Ci |: The number of elements in cluster Ci .
– |C j |: The number of elements in cluster C j .
– µi : The centroid of cluster Ci .
– µ j : The centroid of cluster C j .

The quality of hierarchical clustering can be visualized through dendrograms and cluster
selection can be guided by metrics such as cophenetic correlation or silhouette score:

• Gaussian Mixture Models (GMM): GMM represents data as a mixture of Gaussian


distributions. It involves estimating the covariances, means, and mixing coefficients of the
Gaussian components. GMM assigns data points to different Gaussian distributions and
iteratively updates the parameters using the Expectation-Maximization (EM) algorithm.

k
X
p(xi |θ) = π j N(xi |µ j , Σ j )
j=1

– p(xi |θ): The probability of data point xi given the parameters θ.


– k: The number of Gaussian components in the mixture model.
– π j : The mixing coefficient of the j-th Gaussian component.
– N(xi |µ j , Σ j ): The Gaussian probability density function for data point xi with mean µ j and
covariance matrix Σ j .
CHAPTER 2. STATE OF THE ART 33

GMM can model clusters of arbitrary shapes and can capture data distribution complexities.
Evaluation can be done using log-likelihood or the Bayesian Information Criterion (BIC):

• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN defines


clusters based on density connectivity. Core points, density-reachable points, and noise points
are defined using epsilon () and minimum points (MinPts) parameters. It forms clusters around
core points with density-reachable points and identifies noise points. It doesn’t require the
number of clusters as an input but rather defines clusters based on the density connectivity of
data points within a specified distance and minimum number of points (MinPts).

• MeanSchift: It is a non-parametric technique that seeks modes of data distribution. It involves


shifting a window towards the higher-density region. It iteratively shifts data points towards
areas of higher density until convergence. Clusters emerge around points that converge to
the same mode. Model Performance: Mean Shift can identify irregularly shaped clusters.
Evaluation involves metrics like silhouette score or visual assessment.
Pn
K(x − xi )xi
m(x) = Pi=1n
i=1 K(x − xi )

– m(x): The mean shift vector at point x.


– n: The number of data points in the dataset.
– x: The current data point.
– xi : An individual data point from the dataset.
– K: The kernel function that assigns weights to data points based on their proximity to x.

2.7.4 Similar Solutions


Based on [34] We found that to achieve an unsupervised classification that can be adopted in
a decision-making process, the use of unsupervised learning techniques plays a crucial role in
complementing traditional services with new business intelligence services that meet the needs of
companies, stakeholders, and customers.
As per [38], using the FP-growth algorithm to obtain sets of frequent items, extracting and
analyzing user behavior association rules to derive feature vectors for user classification. They then
employed the Naive Bayes algorithm on the feature vectors to implement cluster-based learning for
more accurate marketing.
In line with [32], using a supervised learning approach for lead scoring using algorithms such as
Logistic Regression and Decision Trees to predict the probability of purchase (utilizing knowledge
and behavioral data). The authors found that the algorithm yielding the best performance is the
Random Forest model.
In Summary :

• The authors employed a variety of measures and validation approaches instead of relying solely
on accuracy criteria to evaluate model performance.
CHAPTER 2. STATE OF THE ART 34

• The authors introduced processing time and computational power as useful criteria in model
selection to maintain stable performance on large datasets.

• ML can not only significantly enhance the performance of large-scale data exploration but also
achieve precise marketing and further increase the marginal profit by approximately 20% for
each product type.

2.8 Conclusion
In this chapter, we have examined the impact of Data integration and predictive analysis technologies
on economic and business decision-making. The next chapter will focus on presenting our work and
outline the different phases we followed.
Chapter 3

Preliminary Analysis

3.1 Introduction
In this chapter, we will begin by examining the requirements of this project. Following that, as we
embark on our web application project, we will establish the design foundations, and outline the
various tasks carried out following meetings with the Scrum Master and the Product Owner. During
these meetings, we formulated the project backlog and segmented it into iterations. Subsequently,
we will outline the overall architecture of our project, and finally, we will delve into the development
environment for the work.

3.2 Conception
3.2.1 Requirements Analysis
By implementing these functionalities, users gain the ability to make more informed business
decisions based on precise data and in-depth analysis.

3.2.1.1 Actors

User: Represents standard users who have access to the web application to view the site. This could be
a business director or any decision-maker who would benefit from leads, predictions, or visualizations
for an overall market analysis.
Administrator: This is the web management system administrator with elevated data access rights.
They can manage, review, add, modify, and delete significant elements and users.

3.2.1.2 Functional Requirements

• Data Extraction from Targeted Company Profiles: The process of gathering relevant
information and details about specific companies, such as their history, industry, size, location,
and key personnel, from various sources.

• Data Extraction from Targeted Company Websites: Collecting data from specific company
websites, which may include contact information and other relevant content to obtain insights

35
CHAPTER 3. PRELIMINARY ANALYSIS 36

into their online presence.

• Data Integration: Combining data from different sources and formats into a unified and
consistent format, allowing decision-makers to analyze and make informed decisions based
on a comprehensive view of their data.

• Standardized Reports: Predefined and structured reports presenting key performance indicators
(KPIs) and metrics in a consistent format, enabling easy and quick access to essential business
information.

• Company Segmentation: The process of categorizing company websites or prospects into


distinct groups based on specific criteria to tailor marketing and sales strategies for more
effective targeting.

• Platform-Accessible Features: Providing user-friendly features and tools on a single platform,


allowing easy access and usage of various functions, such as data analysis, reporting, and lead
management, for streamlined and efficient operations.

From an Administrator’s Perspective

1. User Management:

• Viewing and searching for a user


• Adding a new user
• Modifying user details
• Deleting a user account

2. Group Management:

• Listing and searching for groups


• Creating a new group
• Modifying group details
• Deleting a group

3. Prospect Management:

• Listing and searching for prospects


• Adding a new prospect
• Modifying prospect information
• Deleting prospects
CHAPTER 3. PRELIMINARY ANALYSIS 37

From a Client’s Perspective

1. Accessing the list of prospects

2. Conducting assessments

3. Viewing dashboards

3.2.1.3 Non Functional Requirempents

These are the inherent system characteristics, encapsulating the implied requisites to which the system
must accord. Amidst these necessities, we highlight:

• Responsiveness Time: Swift responsiveness is sought-after for an application, enhancing a


seamless and instantly interactive user encounter.

• Security Measures: Safeguarding data confidentiality, and accessibility for both its own data
and user data.

• Dependability: The data yielded by the application must be accurate and assured.

• User-Centric Design: User interfaces should exude user-friendliness, essentially manifesting


simplicity, ergonomic design, and personalized adaptation.

3.2.2 Use Case Diagram


3.2.2.1 Use Case: User

The ideal user would be a sales or marketing manager who would benefit from the insights provided
in our app, he can perform the following actions as illustrated in Figure 7.1

1. Login and registration: The first step is to create an account which requires email verification,
then the user who has created his account can log in to the platform.

2. Leads Search and Filtering: The user can search leads by industry, location, or even company
name. and can view the global list of the leads in the database in the form of a table.

3. Evaluate a company’s digital maturity: The user can input the information related to the digital
presence of a company through a form and predict its digital maturity, he can also view the
history of predictions.

4. View Market analysis dashboards: View the dynamic visuals of multiple dashboards to gain
information and inspiration on his next marketing strategy.
CHAPTER 3. PRELIMINARY ANALYSIS 38

Figure 3.1: Use-case diagram for user

3.2.2.2 Use Case: Admin

The web management system administrator can perform the following actions which are illustrated
in Figure 7.2

1. Leads Management: The admin can manage models defined in the application. create, update,
delete, and view instances of database models directly from the admin interface.

2. prediction’s history Management: The admin can manage models defined in the application.
create, update, delete, and view instances of database models directly from the admin interface.

3. Authentication and Authorization control: The admin panel requires users to log in with valid
credentials. It offers role-based access control, allowing different users to have varying levels
of access and control over different parts of the application. Superusers can assign permissions
and roles to other users, including creating new superusers.

4. Search and Filtering: The admin can easily search for records using keyword searches and apply
filters to narrow down results, enhancing usability for administrators managing large datasets.

5. List and Detail Views: The admin panel provides list views to display records in tabular format
and detail views for individual records, making it straightforward to view and manage data.

6. Actions and Bulk Operations: Developers can define custom actions that can be applied to
multiple records at once, simplifying bulk operations such as deleting or updating records.

7. Manage Users and Groups: Superusers can create, update, and delete user accounts as well as
manage groups and permissions.
CHAPTER 3. PRELIMINARY ANALYSIS 39

Figure 3.2: Use-case diagram for admin

3.2.3 Class Diagram


Our class diagram is composed of the following classes which are illustrated in Figure 7.3

• User Description: The User class table is responsible for storing user accounts and
authentication information. It enables users to log in, access personalized content, and interact
with the web app’s features. User data includes attributes like username, email, password,
and permissions. This class is fundamental for managing user identities and access within the
application.

• Group Description: The Group class table represents user groups, which can help in organizing
and managing permissions efficiently. User accounts can be assigned to specific groups,
simplifying permission management by applying access controls to entire groups instead of
individual users. This class table typically includes attributes like group name and associated
permissions.

• Leads Description: The Prospect class table handles the management of potential clients or
prospects. It stores information about potential customers who have shown interest in the
services or products offered by the web app. Attributes within this table may include prospect
name, contact information, interaction history, and status (e.g., active, inactive).

• Data Description: The Assessment class table is used to store evaluation data conducted by
clients.

• Superuser Description: The Superuser class table represents the highest level of administrative
access within the application. Superusers have the authority to manage user accounts, groups,
CHAPTER 3. PRELIMINARY ANALYSIS 40

and other administrative tasks. This class table may store attributes like username, email,
password, and additional permissions specific to superusers.

Figure 3.3: Class diagram

3.3 Project management with SCRUM


3.3.1 Team and roles
In our project, there are several key individuals, each playing a distinct role that contributes to the
project’s success.
Machkena Zied takes on the pivotal role of Product Owner. His primary mission involves
defining the project’s needs with precision. This encompasses determining the order of priority in
which various functionalities will be developed, a crucial aspect of the project’s strategic planning.
Additionally, Zied plays the role of the validation of tasks performed by the team, providing essential
input and evaluating proposed ideas to ensure they align with project goals.
El Horry Slim serves as the Scrum Master, a role vital for maintaining the project’s smooth
operation. His responsibilities extend to overseeing the correct application of the project
methodology, including aspects like Sprint planning and Scrum events. Furthermore, Slim is in
charge of fostering team engagement and helping them overcome obstacles that may arise during
the project’s execution. He also assists the team in identifying solutions to challenges meticulously
tracks completed tasks and proposed ideas for continuous improvement.
Ben Hadj Kacem Souha and El Horry Slim jointly contribute as valuable members of the Satoripop
interns team, serving in the capacity of the Development Team. Their multifaceted role encompasses
CHAPTER 3. PRELIMINARY ANALYSIS 41

a spectrum of responsibilities, including generating solution proposals, conducting in-depth research


and development activities, participating in the design phase, and ultimately realizing these solutions.
This dynamic duo’s collaborative efforts are essential to the project’s success, as they bring innovation
and expertise to the development process.
In summary, the project benefits from the dedicated efforts of these individuals, each bringing
their unique skills and perspectives to the table. Their well-defined roles ensure efficient project
management, smooth implementation of methodologies, and the innovative development of solutions
that contribute to our project’s overall success.

3.3.2 Product backlog


Here is the list (refer to Table 3.1) of expected product features. Specifically, beyond this aspect,
it represents the most crucial element of Scrum. This encompasses all the functional or technical
characteristics that make up the desired product. Functional features are referred to as ”user stories,”
while technical features are known as ”technical stories.”

Table 3.1: Product Backlog

Story Type Code Functionality Description Priority


Technical Story TS1 Preparation of BI As a developer, I have to prepare High
dashboards the dashboards based on the study of
customer’s needs
TS2 Data Extraction As a developer, I have to develop a system High
of Targeted for gathering information about specific
Companies companies
Profiles
TS3 Data Extraction As a developer, I have to develop a High
of Targeted system for gathering information about
Companies’ Companies’ Websites
Websites
TS4 Data Extraction As a developer, I have to develop a High
of Targeted system for gathering information about
Companies’ Companies’ employees
employees
TS5 Companies As a developer, I have to develop a feature High
Segmentation for categorizing company websites or
leads into distinct ranked groups
User Story US1 Integration of BI As a user, I want to get a better vision High
dashboards of data via interactive decision-making
dashboards
US2 Account Creation As a user, I want to create an account to Medium
access the application
CHAPTER 3. PRELIMINARY ANALYSIS 42

Table 3.1: Product Backlog

Story Type Code Functionality Description Priority


US3 Accessing the As a user, I want to access the list of High
Prospect List prospects
US4 Performing As a user, I want to perform assessments Medium
Assessments on listed prospects
Admin Story AS1 User As an admin, I want to perform user Medium
Management management actions
AS2 Group As an admin, I want to perform group Medium
Management management actions
AS3 Managing As an administrator, I want to perform Medium
Evaluation actions related to the management of
History evaluation history
AS4 Prospect As an admin, I want to perform leads High
Management management actions

3.3.3 Release planning


After defining the product roadmap, we devised a delivery schedule based on the logical development
sequence. This project is split into two deliveries, with each module expected to be completed within
one delivery cycle. Table 3.2 illustrates the two sprints along with their respective durations:

Table 3.2: Releases and Their Execution Times

Iteration ID Functionalities Duration


Release 1 8 weeks
TS2 Extract the data of targeted companies Profiles
TS3 Extract the data of targeted companies’ Websites
TS4 Extract the data of targeted companies’ Employees
US2 Accessing the prospect list
AS1 User management
AS2 Group management
AS3 Prospect management
Release 2 8 weeks
TS5 Companies segmentation
US3 Performing evaluations
AS4 evaluations management
Release 3 8 weeks
TS1 Preparation of BI dashboards
US1 Integration of BI dashboards
CHAPTER 3. PRELIMINARY ANALYSIS 43

3.4 Architecture: MVT


The Django architecture illustrated in Figure 7.8 (taken from [8]) promotes the separation of concerns,
making it easier to manage and maintain web applications and encouraging us to write clean and
reusable code.

Figure 3.4: MVT Architecture

3.4.1 Model Layer


The Model represents the data and the database schema of the application. It defines the structure of
the data and includes methods to interact with the database. Django uses Object-Relational Mapping
(ORM) to interact with the database, which allows developers to work with database tables as Python
classes and database records as objects.

3.4.2 View Layer


The View handles the business logic and serves as the intermediary between the Model and the
Template. It receives user requests, processes data from the Model, and then renders the appropriate
Template to present the data back to the user. In Django, views are implemented as Python functions
or classes.

3.4.3 Template Leyar


The Template is responsible for the user interface and presentation of the data. It defines how
the data should be displayed to the users. Django templates use a simple template language with
placeholders for dynamic content, making it easy to generate dynamic HTML pages.
CHAPTER 3. PRELIMINARY ANALYSIS 44

3.5 Development Environment


3.5.1 Framework: Django
Django (Figure 7.4 taken from [9]), a high-level Python web framework, served as the backbone
of the application. Its powerful features, including its model-view-controller architecture, eased the
development process by providing a structured and organized framework.

Figure 3.5: Framework: Django

3.5.2 Client-side Interaction: Javascript


JavaScript (Figure 7.5 taken from [10]), a versatile scripting language, was employed to enhance
the user experience by adding interactive and dynamic elements to the application. This allowed for
real-time updates, form validation, and smooth transitions.

Figure 3.6: Javascript

3.5.3 Front-end Design: Bootstrap


The application’s front-end design was facilitated by Bootstrap (Figure 7.6 taken from [11]), a popular
open-source CSS framework. Bootstrap’s responsive grid system and pre-designed components for
appealing user interfaces.
CHAPTER 3. PRELIMINARY ANALYSIS 45

Figure 3.7: Bootstrap

3.5.4 Integrated Development Environment (IDE) and Version Control


Visual Studio (Figure 7.7 taken from [12]), a powerful integrated development environment,
was used for coding, debugging, and managing the project files. Its extensive features and user-
friendly interface greatly contributed to the development process. Git (Figure 7.7 taken from [12]),
a distributed version control system, was utilized to track changes in the project’s source code. This
enabled collaboration among team members and provided the ability to roll back to previous versions
if needed.

Figure 3.8: VS code and Github

3.6 Conclusion
In this chapter, we began by defining the key participants in our application, outlining their
respective roles and use cases, all while maintaining the uniqueness of the content. Following that,
we delved into the functional and non-functional specifications of our solution. Subsequently, we
elaborated on the approach we will adopt for our project using the Scrum methodology. To wrap up,
we concluded this chapter with an overview of the software environment. The next chapter will be
entirely dedicated to the first deliverable, ”Delivery 1.”
Chapter 4

First Release

4.1 Introduction
After examining and defining our client’s overall requirements, this chapter will delve into the various
steps involved in developing the first delivery’s two sprints. We will begin by presenting the product
backlog for each sprint, followed by a detailed analysis, feature design, and ultimately, a showcase of
the user interfaces.

4.2 Presentation of Release 1


A meeting was organized with the Scrum team to determine the features to include in this delivery.
Our first delivery, titled ”Module: User and Prospect Management,” will consist of two sprints, as
follows:

• Sprint 1.1: Account Creation and Admin Panel

• Sprint 1.2: Data Collection and Prospect Management

For each sprint, we will present its sprint backlog, and an analysis will be explored to illustrate
the interfaces created.

4.3 Sprint 1.1: Account Creation and Admin Panel


This section focuses on presenting the product backlog for this sprint and includes screenshots of the
interfaces.

46
CHAPTER 4. FIRST RELEASE 47

4.3.1 Sprint 1.1 Backlog


In Table 4.1, we will list the different user stories for this first sprint in the product backlog.

US/TS User Story


US2.1 As a user, I want to create a new account.
US2.2 As a user, I want email verification.
US2.3 As a user, I want to log in.
US2.4 As a user, I want to reset my password.
AS1.1 As an administrator, I want to view and search for users.
AS1.2 As an administrator, I want to add a new user.
AS1.3 As an administrator, I want to be able to modify user details.
AS1.4 As an administrator, I want to delete a user account.
AS2.1 As an administrator, I want to list and search for groups.
AS2.2 As an administrator, I want to create a new group.
AS2.3 As an administrator, I want to be able to modify group details.
AS2.4 As an administrator, I want to be able to delete a group.

Table 4.1: Sprint 1.1 Backlog

4.3.2 Increment of Sprint 1.1


4.3.2.1 Registration and email verification

Figure 4.1 depicts the interface that enables the user to create an account

Figure 4.1: Interface Registration


CHAPTER 4. FIRST RELEASE 48

Figure 4.2 depicts the confirmation email:

Figure 4.2: Email verification

Figure 4.3 depicts the login interface from which the user can access the home page:

Figure 4.3: Login interface


CHAPTER 4. FIRST RELEASE 49

Figure 4.4 depicts the password reset request interface from which the user can request to reset
his password in case he forgot it which will send him an automatic reset email:

Figure 4.4: Interface Request a Reset Password

Figure 4.5 depicts the actual password reset interface from which the user can reset his password
in case he forgot it:

Figure 4.5: Interface Password reset


CHAPTER 4. FIRST RELEASE 50

4.3.2.2 Administrator Login Interface and Home page

Figure 4.6 depicts the interface that enables the admin to access his account:

Figure 4.6: Interface: Admin login

Figure 4.7 depicts the administrator’s home page :

Figure 4.7: Interface: Admin home page


CHAPTER 4. FIRST RELEASE 51

4.3.2.3 Users Management Interface

Figure 4.8 depicts the user management interface, particularly adding the user :

Figure 4.8: Interface: user Add

Figure 4.9 depicts the successful addition of a user :

Figure 4.9: Interface: Added user


CHAPTER 4. FIRST RELEASE 52

Figure 4.10 depicts the user group management interface, particularly adding or updating the user
group:

Figure 4.10: Interface: Add - Update user groups

Figure 4.11 depicts the user permissions management interface particularly updating or removing
user permissions:

Figure 4.11: Interface: update - remove user permissions


CHAPTER 4. FIRST RELEASE 53

Figure 4.12 depicts the user management interface particularly updating or deleting users:

Figure 4.12: Interface: Update - Delete user

4.3.2.4 Group Management

Figure 4.13 depicts the group management interface, particularly adding group :

Figure 4.13: Interface: Add Group


CHAPTER 4. FIRST RELEASE 54

Figure 4.14 depicts the user group management interface, particularly managing group
permissions ;

Figure 4.14: Interface: Group permissions

Figure 4.15 depicts the resulting group added successfully :

Figure 4.15: result Group added successfully


CHAPTER 4. FIRST RELEASE 55

4.4 Sprint 1.2: Data Collection


This section focuses on presenting the product backlog for this sprint, its implementation, and
interface screenshots.

4.4.1 Sprint 1.2 Backlog


In Table 4.2, we will outline the various user stories for the second sprint in the product backlog.

User Story/Technical Story Description


TS2 As a developer, I need to create a system for collecting
information about specific companies.
TS3 As a developer, I need to create a system for gathering
information from the websites of targeted companies.
TS4 As a developer, I need to create a system for gathering
information about employees of targeted companies.
US2 As a user, I want to access the list of prospects.
AS3 As an administrator, I want to perform lead management
actions.

Table 4.2: Sprint 1.2 Backlog

4.4.2 Related research


According to [26] research, they demonstrate that Web scraping tools and APIs play a significant
role in extracting information from the Internet. It is a vital technique in Marketing and Data Science,
especially for analyzing structured and unstructured data from Open Data and social media. However,
web scraping alone cannot replace research expertise, and relying solely on easily available data may
lead to erroneous conclusions and legal problems. To ensure data accuracy and compliance, data
management concepts such as data lakes and metadata management should be employed. While
some research may have been conducted to study web scraping techniques on LinkedIn or social
media platforms in general, it is essential to respect the website’s terms of service and adhere to legal
and ethical guidelines. Unethical or unauthorized web scraping can result in legal consequences and
may violate the platform’s policies.

4.4.3 Sources Identification


In this web scraping process, we utilized LinkedIn to extract a targeted list of companies based on
specific criteria such as location and industry. By leveraging web scraping tools, We collected relevant
data on these companies from their LinkedIn profiles. After obtaining the initial list of companies,
we embarked on a thorough exploration of each company’s profile to gather more in-depth data.
This involved looping through each profile to extract additional information. By navigating through
the profiles systematically, We obtained a comprehensive understanding of each company’s presence
within the LinkedIn platform. However, data extraction efforts didn’t stop there. Recognizing the
significance of a company’s online presence, we further extended the scraping process to delve into
CHAPTER 4. FIRST RELEASE 56

each company’s website. Utilizing a looping mechanism, we sent requests to each website in the list
and collected valuable data. This exhaustive approach allowed us to gain insights into the company’s
online content, enabling a deeper analysis.
By combining LinkedIn data with website scraping, We were able to create a robust dataset,
providing a comprehensive view of the companies’ profiles and online presence. This wealth
of information serves as a valuable resource for strategic decision-making, market research, and
competitive analysis. However, it is crucial to mention that during this web scraping process, we
adhered to ethical and legal guidelines, respecting LinkedIn’s terms of service and ensuring the
privacy and security of the scraped data. All of which is summarised in Figure 4.16.

Figure 4.16: Data Harvesting Process/steps

4.4.4 Development Environment


Python is an open-source programming language, eliminating the need for compilation before
execution. For this purpose, we utilized Jupiter and Anaconda for script execution.

4.4.5 Implementation
4.4.5.1 Linkedin Scraping

The main tool that helped us automate the data collection process to work efficiently in bulk is
DataKund with several functions dedicated to social media platforms such as YouTube, Facebook,
Instagram, and Linkedin. Initialized as illustrated in Figure 4.17.
CHAPTER 4. FIRST RELEASE 57

Figure 4.17: DataKund initiation

In the first part, we began by extracting a targeted list of companies based on specific criteria such
as location and sector. The code shown in Figure 4.18 illustrates the implementation of this function.

Figure 4.18: Code Extract

The scraping resulted in the following data:

• Website: The website URL of the company, provides a direct link to their online presence and
offerings, allowing users to explore their products and services easily.

• Linkedin: The LinkedIn profile link of the company, giving insights into their professional
network, company updates, and potential collaborations.

• Industry: The industry category in which the company operates, providing an understanding of
its market focus and niche.

• Phone: The contact phone number of the company, allowing users to reach out for inquiries or
support.

• Company Size: The size of the company in terms of employee count, offering an idea of its
scale and workforce.

• Headquarters: The location of the company’s main office or headquarters, indicating their
primary operational base.

• Founded: The year in which the company was established, providing insights into its history
and experience in the industry.
CHAPTER 4. FIRST RELEASE 58

4.4.5.2 Company employees

In the pursuit of gathering valuable insights into the key personnel of companies on our designated
list, a sophisticated script has been meticulously designed and executed. Our script, characterized by
its dynamic nature, adeptly traverses LinkedIn to extract vital information about major employees.
Beginning with the identification of each company’s name from their respective LinkedIn URLs,
in figure 4.19, the script proceeds to systematically search and collect data on individuals who
hold prominent positions within the organizations included in this list : keywords = [”CEO”,
”Founder”, ”Owner”, ”Chef”, ”CTO”, ”Chief”, ”Executive”, ”Partner”, ”Director”, ”Vice President”,
”Directeur”, ”Fondateur”, ”DGA”, ”PDG”, ”RH”, ”Responsible”]

Figure 4.19: Code Extract

Employing a systematic and thorough approach, our script iterates through LinkedIn search results,
capturing pertinent details such as names, job titles, locations, and profile links of individuals who
fit predefined criteria. This method ensures the comprehensive compilation of valuable data for our
analysis illustrated in Figure 4.20.

Figure 4.20: Code Extract


CHAPTER 4. FIRST RELEASE 59

The script’s ability to adapt to different company profiles and the precision with which it identifies
major employees highlights its effectiveness in assisting our research efforts. This innovative approach
empowers us with a robust dataset for further analysis and strategic decision-making, ultimately
enhancing our understanding of the corporate landscape.

4.4.5.3 Website Scraping

To extract data from websites, we employed the following tools:

• BeautifulSoup: was utilized to parse HTML and XML documents, simplifying our ability to
navigate the analysis tree and extract relevant data from web pages.

• Requests: We employed Requests to perform HTTP requests to web pages. While it is not
specifically designed for HTML parsing like BeautifulSoup, it is frequently used in conjunction
with it to retrieve web pages before analysis. An example of using these two libraries is when we
collected the Google Analytics ID and the Publisher ID using the regex method, as illustrated
in Figure 4.21.

Figure 4.21: Code Extract

• SSL: has enabled us to work with SSL/TLS certificates. It provides tools for creating
secure SSL/TLS connections, managing certificates, and verifying the authenticity of SSL/TLS
certificates presented by remote servers.

• SOCKET: Python’s SOCKET module has provided us with an interface for handling sockets,
which act as endpoints for network communication.

• OpenSSL: This library has empowered us to effectively manage certificates and secure
sockets. The previously mentioned SSL library utilizes OpenSSL for performing underlying
cryptographic operations.
The utilization of these three components in our project is depicted in Figure 4.22.
CHAPTER 4. FIRST RELEASE 60

Figure 4.22: Processus de Scraping

• DNS.Resolver: Integrated within the Python DNS (Domain Name System) library, this tool
has provided us with the capability to automate DNS queries. This resource has enabled us to
extract SPF and DMARC records, as illustrated in Figure 4.23.

Figure 4.23: Code Extract

• IPWHOIS: We can retrieve information regarding IP addresses using the WHOIS protocol,
which is used to query databases containing data about Internet resources such as domain names
and IP address allocations. We were able to extract comprehensive details about the entity or
organization that owns a specific IP address, including contact information and registration
details, as illustrated in Figure 4.24.

As previously mentioned, libraries have been employed to carry out HTTP requests. In our scenario,
they are utilized in combination to efficiently extract data from websites as follows:
CHAPTER 4. FIRST RELEASE 61

Figure 4.24: Code Extract

• Description: A brief overview of the company and its products/services.

• IP address: The unique IP address associated with the company’s website.

• IP country: The country where the company’s website server is located.

• Email: The contact email address of the company.

• Phones: The phone numbers through which the company can be contacted.

• Title: The title or designation associated with the data entry.

• Language: The primary language used on the company’s website.

• Verification Date: The date when the data was verified or extracted.

• Copyright year: The year when the company’s website content was copyrighted.

• Copyright owner: The entity or person owning the copyright of the website content.

• responsive Or Not: Indicates whether the company’s website is responsive or not.

• Issuer: The entity that issued the SSL/TLS certificate for the website.

• Cert Country: The country where the SSL/TLS certificate was issued.

• Cert state: The state associated with the SSL/TLS certificate.

• Cert Locality: The locality associated with the SSL/TLS certificate.

• Serial number: The unique serial number of the SSL/TLS certificate.

• Cert start date: The start date of the SSL/TLS certificate’s validity.

• Cert expiry: The expiry date of the SSL/TLS certificate.

• SSL TLS: Indicates whether SSL/TLS security is enabled on the website.

• Cert Protocol: The specific protocol used by the SSL/TLS certificate.

• Cert Organization: The organization associated with the SSL/TLS certificate.


CHAPTER 4. FIRST RELEASE 62

• Schema type: The type of schema used for structured data on the website.

• Google Analytics ID: The unique Google Analytics ID linked to the website.

• AdSense ID: The unique AdSense ID associated with the website.

• Spf record: The Sender Policy Framework (SPF) record for the website’s domain.

• Dmarc record: The Domain-based Message Authentication, Reporting, and Conformance


(DMARC) record for the website’s domain.

• Facebook: The company’s profile URL on the Facebook social media platform.

• Twitter: The company’s profile URL on the Twitter social media platform.

• Instagram: The company’s profile URL on the Instagram social media platform.

• Youtube: The company’s profile URL on the YouTube platform.

• Owler: The company’s profile URL on the Owler platform.

• Pinterest: The company’s profile URL on the Pinterest platform.

• Skype: The Skype username or profile URL associated with the company.

• WhatsApp: The contact information (e.g., phone number) for the company on WhatsApp.

This extensive dataset provides a comprehensive representation of a company’s features and online
presence. It encompasses a range of informative fields that address various aspects of the business.

4.4.6 Increment of Sprint 1.2


4.4.6.1 Leads Listing and Search Interface

Figure 4.25 depicts the home page and the leads listing interface from which the user can search for
leads and navigate the web app:
CHAPTER 4. FIRST RELEASE 63

Figure 4.25: listing interface in Home Page

4.4.6.2 Leads Management

Figure 4.26 depicts the lead management interface: Add a lead :

Figure 4.26: Interface: Add lead


CHAPTER 4. FIRST RELEASE 64

Figure 4.27 depicts the resulting lead added successfully:

Figure 4.27: Result: Lead added successfully

4.5 Conclusion
In this chapter, we have presented the initial version of our solution, consisting of two iterations.
For each iteration, we began by introducing the product roadmap. Subsequently, we showcased the
various functionalities through visual representations and provided textual descriptions of specific
use cases. Lastly, we developed the graphical interfaces. This comprehensive approach allowed
us to gather extensive insights from the web landscape, enabling us to collect crucial information
relevant to our research objectives. The data acquired from various web portals and websites will
be systematically integrated into the next chapter, further enhancing our understanding of industry
trends, competitive landscapes, and market dynamics.
Chapter 5

Second Release

5.1 Introduction
In line with the same approach as the first version, we commence by presenting the second version
based on the product backlog for each sprint. The development of a model involves a series of well-
defined steps that are crucial for project success. In this chapter, we will introduce the various stages of
the process used. Furthermore, we will discuss and compare the implementation of two segmentation
models.

5.2 Presentation of Release 2


A meeting was convened with the Scrum team to delineate the features to be included in this release.
Our initial delivery, titled ”Module: Prospect Segmentation and Evaluation,” will consist of a
single sprint, as outlined below:

• sprint 2.1: Prospects Segmentation

• sprint 2.2: Prospects Evaluation

For each sprint, we will present its sprint backlog, and an analysis will be conducted to illustrate
the interfaces developed.

5.3 Backlog of sprint 2.1: Prospects Segmentation


The development of a model involves a series of well-defined steps that are critical. We will introduce
the various stages of our process used to build a value model. Additionally, we will discuss and
compare the implementation of two segmentation models, as depicted in the sprint 2.1 backlog (Table
5.1).

65
CHAPTER 5. SECOND RELEASE 66

Table 5.1: sprint 2.1 Backlog

US/TS/AS User Story


US5 As a user, I want to visualize the distribution of prospects based on
segmentation.
TS5 As a developer, I need to create functionality to categorize company
websites or leads into distinct and ranked groups.

5.4 Implementation
5.4.1 Data Identification
The data acquired from different web portals and websites which we have harvested through Web
scraping.

1. Field Name: Industry

• Data Type: Text.


• Field Description: Industry category of the company.
• Field Format: Textual representation of the industry category.
• Importance for Segmentation Analysis: Crucial for segmenting companies based on
industry.

2. Field Name: Size

• Data Type: Text.


• Field Description: Company’s size or employee count.
• Field Format: Textual representation of the company’s size.
• Importance for Segmentation Analysis: relevant for company size segmentation.

3. Field Name: Location

• Data Type: Text.


• Field Description: Company’s geographical location.
• Field Format: Textual representation of the location.
• Importance for Segmentation Analysis: relevant for regional segmentation.

4. Field Name: Founded

• Data Type: Numeric.


• Field Description: Year the company was founded.
• Field Format: Numerical representation of the founding year.
• Importance for Segmentation Analysis: relevant for age-based segmentation.
CHAPTER 5. SECOND RELEASE 67

5. Field Name: responsive Or Not

• Data Type: Text.


• Field Description: Indicates if the company’s website is responsive (mobile-friendly).
• Field Format: Textual representation (Yes/No).
• Importance for Segmentation Analysis: relevant user experience and engagement.

6. Field Name: Cert expiry

• Data Type: Date and Time.


• Field Description: Expiry date of the SSL/TLS certificate’s validity.
• Field Format: Date and time representation.
• Importance for Segmentation Analysis: Relevant for security analysis.

7. Field Name: SSL TLS

• Data Type: Text.


• Field Description: Type of SSL/TLS protocol used.
• Field Format: Textual representation of the protocol.
• Importance for Segmentation Analysis: Relevant for security analysis.

8. Field Name: GoogleAnalytics ID

• Data Type: Text.


• Field Description: Google Analytics tracking ID for website analytics.
• Field Format: Textual representation of the tracking ID.
• Importance for Segmentation Analysis: Relevant for website traffic and monetization
strategies.

9. Field Name: AdSense ID

• Data Type: Text.


• Field Description: Google AdSense ID for advertising integration.
• Field Format: Textual representation of the AdSense ID.
• Importance for Segmentation Analysis: Relevant for website traffic and monetization
strategies.

10. Field Name: Spf record

• Data Type: Text.


• Field Description: Sender Policy Framework (SPF) record for email authentication.
• Field Format: Textual representation of the SPF record.
CHAPTER 5. SECOND RELEASE 68

• Importance for Segmentation Analysis: Relevant for email security analysis.

11. Field Name: Dmarc record

• Data Type: Text.


• Field Description: (DMARC) record for email authentication.
• Field Format: Textual representation of the DMARC record.
• Importance for Segmentation Analysis: Relevant for email security assessment.

12. Field Name: Email

• Data Type: Text.


• Field Description: Email address associated with the company.
• Field Format: Textual representation of the email address.
• Importance for Segmentation Analysis: Relevant for contact-based analysis.

13. Field Name: Phones

• Data Type: Text.


• Field Description: Phone numbers associated with the company.
• Field Format: Textual representation of phone numbers.
• Importance for Segmentation Analysis: Relevant for contact-based analysis.

14. Field Name: Facebook

• Data Type: Text.


• Field Description: Facebook page or profile link associated with
• the company.
• Field Format: Textual representation of the URL.
• Importance for Segmentation Analysis: Relevant for social media presence analysis.

15. Field Name: Twitter

• Data Type: Text.


• Field Description: Twitter profile link associated with the company.
• Field Format: Textual representation of the URL.
• Importance for Segmentation Analysis: Relevant for social media presence analysis.

16. Field Name: Linkedin

• Data Type: Text.


• Field Description: LinkedIn profile link of the company.
CHAPTER 5. SECOND RELEASE 69

• Field Format: Textual representation of the LinkedIn URL.


• Importance for Segmentation Analysis: Relevant for social media presence analysis.

17. Field Name: Instagram

• Data Type: Text.


• Field Description: Instagram profile link associated with the company.
• Field Format: Textual representation of the URL.
• Importance for Segmentation Analysis: Relevant for social media presence analysis.

18. Field Name: Youtube

• Data Type: Text.


• Field Description: YouTube channel link associated with the company.
• Field Format: Textual representation of the URL.
• Importance for Segmentation Analysis: Relevant for social media presence analysis.

19. Field Name: Skype

• Data Type: Text.


• Field Description: Skype contact information for the company.
• Field Format: Textual representation of the contact details.
• Importance for Segmentation Analysis: Relevant for social media presence analysis.

20. Field Name: WhatsApp

• Data Type: Text.


• Field Description: WhatsApp contact information for the company.
• Field Format: Textual representation of the contact details.
• Importance for Segmentation Analysis: Relevant for social media presence analysis.

This extensive dataset comprises over 11,000 entries, each representing a company’s diverse
attributes and online presence. The dataset includes a wide range of fields, encompassing company
information such as industry, size, location, and founding year. With its rich and varied dimensions,
this dataset poses a unique challenge for unsupervised machine learning segmentation. By employing
advanced clustering techniques, we aim to unearth hidden patterns, groupings, and trends within
this unlabelled data, ultimately revealing valuable insights into the complex landscape of companies’
online identities and characteristics.
CHAPTER 5. SECOND RELEASE 70

5.4.2 Development Environment


5.4.2.1 Python for Machine Learning

• Python is an open-source programming language, It operates as an interpreted programming


language, eliminating the need for compilation before execution. An ”interpreter” program
facilitates Python code execution on any computer. For this purpose, we utilized Jupiter and
Anaconda for script execution.

• Anaconda (figure 5.1 taken from [13]) stands as a complimentary and open-source distribution
of the Python and R programming languages. It is employed for crafting applications tailored
to data science and machine learning, encompassing large-scale data processing, predictive
analysis, and scientific computing. The aim is to streamline package management and
deployment.

Figure 5.1: Anaconda logo

• Jupyter notebooks (figure 5.2 taken from [14]) are electronic notebooks capable of assembling
text, images, mathematical formulas, and executable code in a single document. These are
interactively maneuverable within a web browser. Originally designed for Julia, Python, and R
programming languages.

Figure 5.2: Jupyter Notebook logo

5.4.2.2 Libraries

• NumPy is a Python library that provides support for arrays and matrices, and an extensive
collection of mathematical functions. This library is instrumental for performing mathematical,
logical, and statistical operations, making it a cornerstone in data analysis, scientific computing,
and machine learning workflows.
CHAPTER 5. SECOND RELEASE 71

• SciPy is a comprehensive scientific library built on top of NumPy. It offers a plethora of


tools for advanced mathematics, optimization, signal processing, linear algebra, and statistics.
SciPy extends the capabilities of NumPy by providing specialized functions for tasks such as
numerical integration, interpolation, and solving differential equations. It’s a valuable asset for
researchers, engineers, and analysts working on complex scientific computations.

• Matplotlib is a versatile plotting library in Python, with which we can create various types of
static, interactive, and animated visualizations. With Matplotlib, users can generate 2D and 3D
plots, histograms, scatter plots, and more, making it an ideal choice for data visualization, and
aiding in the effective communication of insights derived from data analysis.

• Scikit-learn, often referred to as sklearn, is a machine-learning library that encompasses a


wide range of algorithms and tools. Sklearn is highly regarded for its ease of use, efficient
implementations, and integration with other scientific libraries.

• Yellowbrick’s Cluster visualizers are a part of the Yellowbrick library, designed to enhance the
understanding and tuning of clustering algorithms. These visualizers allow users to evaluate
clustering models, explore cluster tendencies, and assess the ideal number of clusters. By
providing insightful visualizations, Yellowbrick Cluster simplifies the process of identifying
meaningful patterns within data.

5.4.3 Data Preprocessing


5.4.3.1 Further Data Cleaning

This stage of our work involves handling outliers or anomalies within the data which may or may not
pass through the integration process. Techniques like ‘fillna(’0’)‘, and ‘replace(’N/A’)‘ are applied.
The same goes for eliminating any outliers which are data point or observation that significantly
differs from the rest of the data in a dataset. Outliers are unusual or exceptional values that deviate
from the typical pattern or distribution of the data. showing in the following boxplot in Figure 5.3

Figure 5.3: Checking for outliers


CHAPTER 5. SECOND RELEASE 72

5.4.3.2 Feature Engineering

In our project, the primary purpose of feature engineering is to extract distinctive attributes from raw
data, thereby enhancing the representation of the underlying problem for predictive models.
To begin with, this step involves the selection and extraction of relevant features from the dataset,
which make a substantial contribution to the analysis task. Based on this, we have generated the
following features:

• Count analytics: number of analytic IDs.

• Count records: number of SPF and DMARC records.

• Count contact: number of contact platforms.

• Expiry Status: checks certificate expiration.

5.4.3.3 Exploratory Data Analysis

Next, the correlation matrix in Figure 5.4 reveals the relationships between variables in a dataset.

Figure 5.4: Correlation matrix

The Matrix in the previous figure offers visual representation, aiding our interpretation by
selecting relevant features, reducing dimensionality, providing insights into segments, and avoiding
redundancy. Based on the visuals, and more reference to business relevance we selected our variables
which are distributed as follows in these plots Figures 5.5, 5.6, and 5.7:
CHAPTER 5. SECOND RELEASE 73

Figure 5.5: data distribution (industry, location, and size)

1. The first histogram in Figure 5.5 reveals a significant number of companies in the retail industry
compared to the rest of the industries.

2. The second histogram Figure 5.5 reveals that most companies don’t disclose their location
therefore the majority of them are not assigned to a location but this feature will come in handy
when understanding the distribution of the companies on the map.

3. The third histogram Figures 5.5 analyzing the company’s employee number on Linkedin reveals
a decline in the number of companies when the number of employees grows.

Figure 5.6: data distribution (records, responsiveness, and contact)

1. The fourth histogram Figure 5.6 reveals a significant number of companies that don’t have
Dmarc or SPF records while almost a third of them have both records.

2. The fifth histogram Figure 5.6 reveals that most websites are divided between responsive and
not assigned ones, after further inspection those not assigned to positive responsiveness will be
considered negative.
CHAPTER 5. SECOND RELEASE 74

3. The sixth histogram Figure 5.6 analyzing several contact panels listed in a single website shows
that those with 0 contact panels have the majority of websites while the rest show a decrease in
numbers when the contact panel increases.

Figure 5.7: data distribution (SSL/TLS, analytics, and Expiry status)

1. The seventh histogram Figure 5.7 analyses the number of websites that have an SSL or TLS
certificate.

2. The eighth histogram Figure 5.7 shows how those certificates are distributed between those
certificates that are expired and those that are not.

3. The ninth and final histogram Figure 5.7 analyses the number of websites that have Google
Analytics ID or Adsense ID or both.

Finally, the data is transformed to make it suitable for analysis, including handling variables with
skewed distributions using techniques like logarithmic or power transformations.
• Categorical variables like ’Founded’, ’Size’, ’Location’, and ’Industry’ are encoded using

label − encoder. f it − trans f orm(data f rame[col])‘

as a result Figure 5.8

Figure 5.8: Dataframe after label encoder transformation

• Numeric features are scaled or normalized using ‘StandardScaler()‘ to enhance analysis


accuracy shown in Figure 5.9
CHAPTER 5. SECOND RELEASE 75

Figure 5.9: Dataframe after StandardScaler transformation

5.4.4 Agglomerative Hierarchical Clustering (AHC)


Hierarchical clustering is an unsupervised machine learning technique employed for organizing
our unlabelled dataset into clusters. This method, also recognized as hierarchical cluster analysis
(HCA), constructs a hierarchical arrangement of clusters, illustrated as a dendrogram, to establish
relationships among data points.
The hierarchical clustering approach encompasses two distinct methods:

1. Agglomerative: This bottom-up strategy starts by treating individual data points as separate
clusters. Gradually, clusters are merged iteratively based on proximity.

2. Divisive: In contrast, the divisive approach takes a top-down stance. It starts with all data points
in a single cluster and then progressively divides clusters into smaller ones.

5.4.4.1 Model Building

Within the AHC algorithm, our process initiates with every dataset forming an independent cluster.
The algorithm proceeds by iteratively pairing the closest clusters. A notable advantage is that this
approach doesn’t require prior knowledge of the expected number of clusters.
A crucial element in hierarchical clustering lies in determining the distance between clusters.
Various methods based on linkage techniques each of which calculates the distance between clusters
differently. The choice of linkage can significantly impact the clustering results, for which we have
created the following dendrograms with different linkage methods to explore our best options:

• Ward’s Linkage This linkage method tends to form clusters with minimal within-cluster
variance, promoting consistency and robustness in the resulting clusters. The stability of Ward
linkage stems from its focus on optimizing the within-cluster variance, which leads to well-
defined and interpretable clusters that are less sensitive to noise and outliers as shown in Figure
5.10.
CHAPTER 5. SECOND RELEASE 76

Figure 5.10: Dendogram of Ward Linkage

• Complete Linkage (Maximum Linkage)


Figure 5.11 depicts the hierarchical relationships among clusters by employing the concept
of the furthest-point distance between clusters. As one descends within the dendrogram,
clusters merge based on the pair of data points that are farthest apart from each other. While
this approach typically results in well-defined and closely-knit clusters, it can also make the
clustering process susceptible to outliers and noise in the data.

Figure 5.11: Dendogram of Complete Linkage

• Average Linkage
As we traverse the dendrogram, clusters gradually merge based on the average distance between
their data points. The approach depicted in Figure 5.12 strikes a balance between sensitivity to
outliers and cluster compactness. It can prove beneficial when dealing with data that exhibits
varying cluster sizes and shapes.
CHAPTER 5. SECOND RELEASE 77

Figure 5.12: Dendogram of Average Linkage

5.4.4.2 Evaluation and Profiles

The selection of three clusters was a judicious decision that stemmed from a comprehensive analysis
of multiple factors. This multi-faceted approach encompassed the utilization of silhouette analysis,
dendrogram exploration, and a meticulous alignment with our business requirements.

Silhouette analysis, a rigorous metric, played a pivotal role in evaluating the quality of cluster
formations. Through this analysis, we gained insights into the cohesion and separation of data points
within clusters. Our objective was to identify a configuration that exhibited well-defined, internally
homogeneous clusters while maintaining distinct boundaries between them as shown in Figure 5.13.

Figure 5.13: Silhouette analysis


CHAPTER 5. SECOND RELEASE 78

The examination of dendrograms added another layer of understanding to our decision-making


process. By inspecting the branching patterns, we were able to discern natural stopping points
that indicated the optimal number of clusters. the different linkage dendrograms coupled with the
silhouette analysis, further reinforced our choice of three clusters., where Figure 5.14 exhibited the
AHC model evaluation.

Figure 5.14: Chosen Model Evaluation

Interpretation :

• Silhouette Score of 0.446 indicates that the clusters are reasonably distinct and have a moderate
level of cohesion.

• Calinski-Harabasz Index value of 11104.259 implies that the clustering model has managed to
create clusters that are highly separated and compact, indicating a strong quality of clustering.

Both of these evaluation metrics, the Silhouette Score and Calinski-Harabasz Index, indicate that
the Agglomerative Clustering model has performed well in creating distinct and cohesive clusters for
the given dataset. The higher values of these metrics suggest that the clusters are meaningful and
well-defined, providing valuable insights into the data’s underlying structure which are displayed in
the following tables 5.2 to 5.8:

Count analytics Cluster 0 Cluster 1 Cluster 2


0 80.073252% 80.977937% 81.502347%
1 19.709282% 18.783542% 18.309859%
2 0.217466% 0.238521% 0.187793%

Table 5.2: Count analytics Feature

Count contact Cluster 0 Cluster 1 Cluster 2


0 57.674259% 58.437686% 63.943662%
2 19.720728% 17.590936% 17.089202%
3 10.564267% 10.137150% 7.323944%
4 6.775781% 6.320811% 6.384977%
5 3.754149% 5.426357% 3.568075%
6 1.281905% 1.729278% 1.220657%
7 0.194575% 0.357782% 0.375587%
8 0.034337% NaN 0.093897%

Table 5.3: Count contact Feature


CHAPTER 5. SECOND RELEASE 79

Verification Date Cluster 0 Cluster 1 Cluster 2


0 88.130937% 87.71616% 88.356808%
1 11.869063% 12.28384% 11.643192%

Table 5.4: Verification Date Feature

responsiveOrNot Cluster 0 Cluster 1 Cluster 2


0 48.472016% 48.062016% 49.014085%
1 1.590935% 1.431127% 1.220657%
2 49.937049% 50.506857% 49.765258%

Table 5.5: Responsive Or Not Feature

SSL TLS Cluster 0 Cluster 1 Cluster 2


0 60.215177% 64.877758% 63.849765%
1 39.784823% 35.122242% 36.150235%

Table 5.6: SSL TLS Feature

Expiry Status Cluster 0 Cluster 1 Cluster 2


0 47.213002% 46.809779% 45.727700%
1 39.784823% 35.122242% 36.150235%
2 13.002175% 18.067979% 18.122066%

Table 5.7: Expiry Status Feature

Count records Cluster 0 Cluster 1 Cluster 2


0 76.399222% 70.363745% 68.920188%
2 23.600778% 29.636255% 31.079812%

Table 5.8: Count records Feature

In conclusion, the selection of three clusters was a comprehensive endeavor, combining statistical
rigor with a keen awareness of our business landscape, ultimately leading to a cluster configuration
that resonates with both analytical and business efficacy.

5.4.5 K-means Clustering


K-means clustering represents an instance of Unsupervised Learning. This technique facilitates the
arrangement of untagged datasets into distinct clusters. It offers the capability to categorize data
into separate groups and offers a practical avenue for revealing the classifications of clusters within
untagged datasets autonomously, obviating the necessity for any instructional training.
CHAPTER 5. SECOND RELEASE 80

This algorithm operates on the foundation of centroids, where each cluster is associated with a
centroid. The central objective of this algorithm is to minimize the cumulative distances between data
points and their respective clusters.
The procedure ingests untagged datasets as input, segments the dataset into ’k’ clusters, and
iterates the process until it achieves optimal clusters. The value of ’k’ is predetermined within this
algorithm.
The essence of the k-means clustering algorithm revolves around two primary functions:

1. Ascertaining the most suitable value for ’k’ center points or centroids through an iterative
progression.

2. Assigning every data point to its nearest k-center. These data points in proximity to a specific
k-center coalesce to form a distinct cluster.

Consequently, each cluster encompasses data points exhibiting shared attributes, distinguishing
itself from other clusters.

5.4.5.1 Number of Clusters: Elbow Method

This technique utilizes the principle of WCSS (Within Cluster Sum of Squares) value. WCSS
quantifies the aggregate variations confined within a cluster. The mathematical expression to
determine the WCSS value :
X X X
WCS S = distance(Pi , C1 )2 + distance(Pi , C2 )2 + distance(Pi , C3 )2
i∈Cluster 1 i∈Cluster 2 i∈Cluster 3

where :

• WCSS: Within Cluster Sum of Squares, a measure of the total variations within the clusters.

• i: An index representing a data point in the respective clusters.

• Cluster 1, Cluster 2, Cluster 3: Individual clusters in the dataset.

• Pi : The ith data point.

• C1 , C2 , C3 : The centroids (center points) of Cluster 1, Cluster 2, and Cluster 3, respectively.

• distance(Pi , C j ): The distance between data point Pi and the centroid C j of the corresponding
cluster ( j indicates the cluster number).

To ascertain the most suitable cluster count, the elbow methodology adheres to the subsequent
steps:

1. It performs K-means clustering on a provided dataset, varying the value of K (ranging from 1
to 10).

2. For each K value, it computes the corresponding WCSS value.


CHAPTER 5. SECOND RELEASE 81

3. A graph is generated, depicting the relationship between computed WCSS values and the count
of clusters (K).

4. The inflection point or the juncture resembling an arm on the plot designates the optimal K
value.

Since the graph 5.15 figure shown below exhibits a distinct curvature resembling an elbow, our
optimal number of clusters would be 4 for the K-means algorithm The visual representation of the
elbow method takes a form analogous.

Figure 5.15: the resulting elbow method for the Dataset

5.4.5.2 Evaluation and Profiles

Figure 5.16 exhibited the Kmeans model evaluation

• Silhouette Score of 0.461 indicates that the clusters are reasonably distinct and have a moderate
level of cohesion.

• Calinski-Harabasz Index value of 15433.048 implies that the clustering model has managed to
create clusters that are highly separated and compact, indicating a strong quality of clustering.

Figure 5.16: Kmeans Model Evaluation


CHAPTER 5. SECOND RELEASE 82

Both of these evaluation metrics, the Silhouette Score and Calinski-Harabasz Index, indicate that the
Kmeans model has performed well in creating distinct and cohesive clusters for the given dataset. The
higher values of these metrics suggest that the clusters are meaningful and well-defined, providing
valuable insights into the data’s underlying structure which are displayed as follows in tables 5.9, to
5.15 :

Count analytics Feature


Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 79.90% 80.13% 80.92% 81.48%
Category 1 19.92% 19.59% 18.84% 18.37%
Category 2 0.18% 0.28% 0.24% 0.15%

Table 5.9: Count analytics Feature

Count contact Feature


Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 57.10% 58.86% 57.16% 63.18%
Category 2 19.25% 19.79% 18.65% 17.99%
Category 3 11.61% 9.55% 10.03% 7.47%
Category 4 7.00% 6.61% 6.47% 6.05%
Category 5 3.64% 3.66% 5.55% 3.58%
Category 6 1.24% 1.26% 1.75% 1.27%
Category 7 0.16% 0.22% 0.39% 0.30%
Category 8 NaN 0.06% NaN 0.15%

Table 5.10: Count contact Feature

Verification Date Feature


Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 88.76% 87.64% 87.34% 88.20%
Category 1 11.24% 12.36% 12.66% 11.80%

Table 5.11: Verification Date Feature

Expiry Status Feature


Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 47.95% 46.64% 46.49% 45.71%
Category 1 40.06% 39.60% 35.49% 37.19%
Category 2 11.99% 13.76% 18.01% 17.10%

Table 5.12: Expiry Status Feature


CHAPTER 5. SECOND RELEASE 83

ResponsiveOrNot Feature
Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 48.46% 48.42% 47.66% 49.81%
Category 1 1.60% 1.56% 1.61% 1.12%
Category 2 49.94% 50.01% 50.73% 49.07%

Table 5.13: ResponsiveOrNot Feature

SSL TLS Feature


Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 59.94% 60.40% 64.51% 62.81%
Category 1 40.06% 39.60% 35.49% 37.19%

Table 5.14: SSL TLS Feature

Count records Feature


Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 77.73% 75.47% 70.74% 69.60%
Category 2 22.27% 24.53% 29.26% 30.40%

Table 5.15: Count records Feature

5.5 Increment of sprint 2.1


Crucially, our decision wasn’t solely grounded in statistical measures. It was aligned with the strategic
goals of our business. Each cluster’s distribution was meticulously evaluated concerning the distinct
business needs we sought to address. This alignment ensured that the chosen cluster configuration for
the AHC model translated into meaningful insights and actionable strategies ( we went for 3 as the
number of clusters with Ward linkage as it shows the most stability). The results are displayed in the
following dashboard Figure 5.17:
CHAPTER 5. SECOND RELEASE 84

Figure 5.17: PowerBI dashboard for Lead segmentation result

5.6 Sprint 2.2: Digital Maturity Assessment


5.6.1 Sprint 2.2 Backlog
This section outlines the product backlog for this sprint (Table 5.16) and includes screenshots of the
interfaces.

Table 5.16: Sprint 2.2 Backlog

US/TS/AS User Story


US4 As a user, I want to conduct assessments on the listed prospects.
AS4 As an administrator, I want to perform lead management actions.

5.6.2 Sprint 2.2 Increment


5.6.2.1 User-side Assessments

Figure 5.18 displays the digital maturity assessment form, through which users can conduct individual
assessments.
CHAPTER 5. SECOND RELEASE 85

Figure 5.18: Digital Maturity Assessment Form Interface

Figure 5.19 presents the results of the digital maturity assessment interface and the prediction
history.

Figure 5.19: Interface for Digital Maturity Prediction Results and History

5.6.2.2 Admin-side Assessment Management

Figure 5.20 displays the digital maturity assessment form interface for administrators to conduct
individual assessments.
CHAPTER 5. SECOND RELEASE 86

Figure 5.20: Conducting a Prediction

Figure 5.21 presents the results of the digital maturity assessment interface for administrators and
the prediction history.

Figure 5.21: Prediction History

5.7 Conclusion
In this chapter, we initially created multiple clustering models and subjected them to specific
performance evaluation measures. Finally, we selected the top-performing model based on these
evaluations. Next, we developed a feature to conduct individual assessments of companies’ digital
maturity.
Chapter 6

Third Release

6.1 Introduction
Utilizing the same foundational concept as the second release, we commence by introducing
Release 3, which is constructed by the product backlog for every sprint. This section focuses on the
subsequent stage within the GIMSI approach, facilitating us in addressing the query: What actions
are necessary? Accordingly, we will initiate by outlining the structure of our resolution, establishing
the goals, selecting the benchmarks, creating our framework, and ultimately showcasing prototype
dashboard illustrations during the final stride of this stage.

6.2 Presentation of Release 3


A meeting was organized with the Scrum team to determine the features to include in this delivery.
Our third delivery will consist of one sprint, For which, we will present its sprint backlog, and an
analysis will be explored to illustrate the Dashboards created.

6.3 Sprint 3.1 Backlog: BI Solution


In this section, we will introduce the Product Backlog specific to this iteration of the BI report
integration. This backlog will consist of items identified as the most relevant and of the highest
priority for end-users and business analysts. Table 6.1 outlines the Product Backlog for this version’s
iteration.

Table 6.1: Backlog of sprint 3.1

ID User Story
TS1 As a developer, I need to create dashboards that align with the customer’s
requirements.
US15 As a user, I desire enhanced data visibility through interactive decision-
support dashboards.

87
CHAPTER 6. THIRD RELEASE 88

6.4 BI Solution Architecture


Here, we’ll create the blueprint for the decision system, dictating how components are organized.
The data model we’ve devised, outlining how information is structured, stored, and managed, will be
introduced. Figure 6.1 illustrates our decisional system architecture which will prepare our data for
further processing :

Figure 6.1: BI Architecture

Our system’s functional structure comprises several stages: gathering source data via web
scraping, transforming, and loading the collected data into a data warehouse for compatibility with
analysis and visualization tools. The data warehouse becomes prepared for analysis, which involves
extracting valuable insights through OLAP cubes from stored data. Results are communicated
effectively using interactive dashboards, reports, and visualizations to assist decision-makers in
comprehending data-driven trends and conclusions.

6.5 Multidimensional Conception of BI Solution


6.5.1 Global DW Conception
For our project, we’ve selected the Top-Down approach as it’s an effective method for data
warehouse design. It focuses on overall business requirements, normalizes data, and handles changes
efficiently. This approach ensures the data warehouse is aligned with the company’s needs, enhances
data quality, and reduces design and maintenance costs.
Each fact table offers a unique angle for analysis, in line with the architecture. All fact tables
have common dimensions. Our comprehensive data warehouse model follows the Galaxy structure
structure (Figure 6.2), which originates from star models. This complete depiction can be seen in the
provided diagram. The data warehouse comprises three fact tables: ”Fact Network”, ”Fact Digital”
and ”Fact Company”.
CHAPTER 6. THIRD RELEASE 89

Figure 6.2: Global DW Conception

6.5.2 Dimensions Identification


The endeavor encompasses multiple avenues of examination, epitomized by distinct Dimensions that
we shall elaborate on in Table 6.2.

Dimension Attributs Description


[IndustryID]
Dim Industry this dimension is the industry of activity
[Industry]
[LocationID] this dimension is the location of the
Dim Location
[Location] headquarters
[CompanyID]
[Size]
Dim Company this dimension is the company’s info
[Linkedin]
[L phone]
[GoogleAnalyticID] this dimension is the tiGoogle Analytics ID
Dim GAID
[GAID] informations
[AdSenseID]
Dim AdSense this dimension is the Publisher ID informations
[Adsense]
CHAPTER 6. THIRD RELEASE 90

Dimension Attributs Description


[SMID]
[Facebook]
[Linkedin]
[Twitter] this dimension is the social media accounts
Dim Social Media
[Instagram] listed in the website
[Youtube]
[Skype]
[Whatsapp]
[ContactID]
this dimension is the contact info listed in
Dim Conatct info [Phone]
the website
[Email]
[FoundedID]
Dim Founded this dimension is the time axis based on years
[FoundedYear]
[LanguageID]
Dim Language this dimension is the website’s language
[Language]
[WebsiteID]
[website]
Dim WebSite this dimension is the website’s information
[Responsive]
[Title]
[IPID]
Dim IP [IP Adress] this dimension is the IP adress information
[IP Country]
[CopyrightID]
Dim CopyRight [Copyright Owner] this dimension is the website’s copyright data
[Copyright Year]
[RECORDSID]
this dimension is the Record listed in the
Dim Records [Dmarc Record]
website ( SPF and Dmarc)
[SPF Record]
[Cert IssuerID]
Dim cert issuer [Issuer] this dimension is the certificate’s issuer
[Organisation]
[CertID]
[SerialNumber]
Dim certificate this dimension is the SSL/TLS certificate
[SSL TLS]
[Cert Protocol]
[Cert locationID]
[State] this dimension is the location in which the
Dim cert location
[Country] attributed certificate
[Locality]
CHAPTER 6. THIRD RELEASE 91

Dimension Attributs Description


[SchemaID]
Dim SchemaType this dimension is the website’s schema type
[SchemaType]

Table 6.2: Dimensions Identification for DW

6.5.3 Conception of Network Datamart


The conception of the Network data mart is illustrated in Figure 6.3 as a star model.

Figure 6.3: Conception of Network Datamart


CHAPTER 6. THIRD RELEASE 92

6.5.4 Conception of Digital Datamart


The conception of the Digital data mart is illustrated in Figure 6.4 as a star model.

Figure 6.4: Conception of Digital Datamart

6.5.5 Conception of Company Datamart


The conception of the Company data mart is illustrated in Figure 6.5 as a star model.

Figure 6.5: Conception of Company Datamart


CHAPTER 6. THIRD RELEASE 93

6.6 Key Performance Indicators


6.6.1 KPI Identification
At this juncture, our focus shifts towards identifying a comprehensive array of indicators aligned
with our goals. The meticulous selection of these indicators, which will form the bedrock of our
dashboards, constitutes a pivotal phase. This strategic endeavor aims to streamline decision-making
and offer a synthesized overview across the spectrum of collaborators. Anchored in our business
processes, we are poised to pinpoint performance indicators that impeccably cater to the requisites
and objectives of the Marketing Department.
Table 6.3 illustrates the KPI identification of the Network data mart.

Title Analyse du processus digital


Date 08/08/2023
Author Data Analysis team
Decision Maker Chief Sales Officer
Proces Network Analysis
Designation Expiry status of SSL/TLScertification
KPI Expiration Status Indicator 1
Formula Current Date - Expiry date
Designation Count IP addresses
KPI IP Adress Indicateur 2 Formula Total of IP Adresses
Query Total of IP Addresses by country
Designation Count Certificates
KPI SSL/TLS Indicateur 3 Formula Total of Certificates
Total of Certificates by Issuer
Query
Total of Certificates by Organisation
Designation Count Certificate Issuers
KPI Issuers Indicateur 3 Formula Total of Issuers
Query Total of Issuers by Organisation

Table 6.3: KPI of Fact Network


CHAPTER 6. THIRD RELEASE 94

Table 6.4 illustrates the KPI identification of the Digital data mart :

Title Analyse du processus digital


Date 08/08/2023
Author Data Analysis team
Decision Maker Chief Sales Officer
Proces Digital Analysis
Designation Count Analytics
Formula Total of Analytic IDs
KPI Analytics Indicator 1
Query Total of analytic IDs activated by website
Designation Count Records
KPI Records Indicator 2 Formula Total of SPF and Dmarc records
Query Total records by website
Designation Count Contact Panels
KPI Contact Panels Indicator 3 Formula Total of Contact panels
Query Total Contact panels by website

Table 6.4: KPI of Fact Digital

Table 6.5 illustrates the KPI identification of the Company data mart :

Title Analyse du processus digital


Date 08/08/2023
Author Data Analysis team
Decision Maker Chief Sales Officer
Proces Company Analysis
Designation Count Companies
Formula Total of Companies
KPI Company Indicator 1
Query Count Company by Industry
Count company by location
Designation Count Industry
Formula Total of Industries
KPI Industry Indicateur 2
Query Total of Industries by location

Table 6.5: KPI of Fact Company

6.6.2 Development of Dashboards


This phase in our project encompasses the creation of dashboards, which emerge as the outcomes of
a decision support system. These dashboards serve as tools that offer decision-makers visibility and a
comprehensive understanding of the entire dataset.
The dashboards we are crafting consist of a collection of indicators, which empower the sales and
marketing managers at Satoripop to validate the efficiency of their decisions.
CHAPTER 6. THIRD RELEASE 95

6.6.2.1 Mock-up scenario: ”Digital Exploration”

In this dashboard (Figure 6.6), we will present the KPI related to Fact Digital where we analyze the
digital presence of the companies by the number of social media platforms, analytic or traffic IDs,
and website responsiveness.

Figure 6.6: Scenario Prototype ”Digital Exploration”

6.6.2.2 Mock-up Scenario: ”Enterprise Exploration”

In this dashboard (Figure 6.7), we will present the KPI related to Fact Digital where we analyze the
digital presence of the companies by the number of social media platforms, analytic or traffic IDs,
and website responsiveness.

Figure 6.7: Scenario Prototype ”Company Exploration”

6.6.2.3 Mock-up Scenario: ”Network Exploration”

In this dashboard (Figure 6.8), we will present the KPI related to Fact Digital where we analyze the
digital presence of the companies by the number of social media platforms, analytic or traffic IDs,
CHAPTER 6. THIRD RELEASE 96

and website responsiveness.

Figure 6.8: Scenario Prototype ”Network Exploration”

6.7 Development environment


6.7.1 Comparative Evaluation: Integration Software
Table 6.6 illustrates the comparison between two integration software: SSIS and Talend

Feature SQL Server Integration Services Talend


(SSIS)
Integration Native integration with Microsoft Supports a wide range of databases
SQL Server and technologies
Connectivity Primarily for Microsoft-based Provides extensive connectivity
systems with various systems
Scalability Well-suited for small to medium- Handles large and complex data
sized data projects integration scenarios
Job Scheduling Utilizes SQL Server Agent for Built-in job scheduler with
scheduling tasks flexibility in scheduling
Extensibility Supports custom script tasks and Extensible with custom Java, Perl,
components and Ruby components
Data Transformation Provides various built-in data Offers a wide range of data
transformation functions transformation components
Data Quality Basic data quality features; may Built-in data profiling and data
require third-party tools quality capabilities
Performance Generally performs well with Performance may vary depending
Microsoft-based systems on the integration scale
CHAPTER 6. THIRD RELEASE 97

Deployment Smooth deployment within SQL Supports cloud, on-premises, and


Server ecosystem hybrid deployments

Table 6.6: SSIS vs Talend

6.7.2 Comparative Evaluation: Reporting Software


1. Tableau: a widely used data visualization and business intelligence tool that enables users to
create interactive and visually appealing dashboards and reports.

2. QlikView and Qlik Sense: are data discovery and visualization tools that offer in-memory data
processing and associative data modeling.

3. Amazon QuickSight: a cloud-powered business intelligence tool provided by Amazon Web


Services. It integrates with AWS data sources, builds interactive visualizations, and shares
them securely.

4. Google Data Studio: a free data visualization tool that connects to various data sources,
including data warehouses. to create customizable dashboards and reports.

5. Power BI is a BI and visualization tool created by Microsoft. It integrates seamlessly with


various data sources, including data warehouses, and offers a user-friendly interface to create
dashboards, reports, and visualizations.

Each of these tools offers a range of features and capabilities for data warehouse visualization.
However, Power BI remains a popular choice due to its ease of use, integration with Microsoft
technologies, and extensive community support. The choice of tool ultimately depends on factors
like the complexity of data, budget, scalability requirements, and user preferences.

6.7.3 Ecosystem
The Microsoft tools collectively form a robust ecosystem that enables us to manage databases,
perform data integration and transformation, conduct in-depth data analysis, and present data insights
effectively through interactive visualizations and reports.

6.7.3.1 SSMS

SSMS (Figure 6.9 taken from [15]), an acronym for SQL Server Management Studio, is a GUI
software developed by Microsoft, enabling tasks like database creation, SQL query execution, object
design, server monitoring, and data backup.
CHAPTER 6. THIRD RELEASE 98

Figure 6.9: SSMS

6.7.3.2 SSIS

SQL Server Integration Services, or SSIS (Figure 6.10 taken from [16]), stands as Microsoft’s data
integration and workflow solution within the SQL Server toolkit. This platform facilitates the creation,
deployment, and oversight of data integration and transformation undertakings. By allowing users
to draw data from diverse origins, mold it into preferred structures, and then deliver it to designated
systems or repositories, SSIS accommodates intricate integration demands, including data refinement,
migration, and ETL operations.

Figure 6.10: SSIS software logo

6.7.3.3 SQL Server Agent

An integral element of Microsoft SQL Server, it facilitates the automation of tasks and jobs by offering
a scheduling system. Empowering users to effectively manage a range of operations, the SQL Server
Agent is indispensable for streamlining routine database maintenance and repetitive assignments,
leading to enhanced database dependability and performance.

6.7.3.4 SSAS

SSAS, short for SQL Server Analysis Services (Figure 6.11 taken from [17]), is a potent data tool from
Microsoft. It empowers users to craft and handle OLAP cubes, facilitating profound data analysis.
CHAPTER 6. THIRD RELEASE 99

Figure 6.11: SSAS software logo

6.7.3.5 PowerBI

1. Power BI Desktop (Figure 6.12 taken from [18]) is a standalone application that serves as
the authoring tool for Power BI. It allows users to create more complex and sophisticated
data models, reports, and visualizations compared to the browser-based Power BI Service.
Power BI Desktop provides advanced data manipulation capabilities, supports the creation of
calculated measures and columns, and enables users to design and refine their data models
before publishing them to the Power BI Service.

2. Power BI Service is the cloud-based service offered by Microsoft for sharing, collaborating,
and consuming Power BI reports and dashboards. It allows users to publish their Power BI
Desktop reports to the cloud and securely share them. Power BI Service offers more features
like embedding reports into websites, data-driven alerts, and access to real-time data insights.

Figure 6.12: PowerBI software logo

6.8 Integration Phase


6.8.1 ETL: Extraction, Transformation, Loading
The data provisioning phase (ETL) constitutes a fundamental step in the data processing process. It
encompasses three vital stages: extraction, transformation, and loading. During the extraction phase
of our project, data is drawn from various sources. The transformation stage involves refining and
structuring the extracted data, making it consistent and aligned with our project’s requirements. This
transformation can encompass data cleansing, aggregation, and formatting, thereby enhancing their
quality and integrity. The culmination of this process lies in the loading stage, where the transformed
CHAPTER 6. THIRD RELEASE 100

data finds its place in a designated target destination, a data warehouse. This phase plays a pivotal
role in ensuring that the data used in our project is consistent, accurate, and, most importantly, ready
for use and actionable insights.

6.8.2 Extraction
The extraction stage serves as the initial phase of our ETL process (extraction, transformation,
loading). This step is crucial to maintain the integrity of our extracted data and to prevent errors
or inconsistencies in subsequent process stages.
As shown in Figure 6.13, initially, we implemented the Staging area which is an intermediate location
or environment where data and files are temporarily stored, processed, or prepared before they are
moved to their final destination or utilized for further processing. Our staging area consolidates newly
added raw data from the source.

Figure 6.13: The Loading of the Staging area

To achieve this, we have implemented a script containing a truncate command for all tables. We
employed the Sequence Container component, an organizational unit that groups all arranged tasks.
This control container acts as a conductor, orchestrating the order of task execution within an SSIS
package.

6.8.3 Transform
In the transformation phase we aim to ensure data quality and accessibility. This involves basic
cleaning, removing any duplicates, and handling empty values. These transformations ensure data
consistency.
Figure 6.14 illustrates the Loading into Operational Data Store (ODS), This is a critical step in the
data management process. It involves loading extracted data into an operational data store. This step
CHAPTER 6. THIRD RELEASE 101

facilitates consistent data storage before transferring to our data warehouse.

Figure 6.14: The Loading of the ODS

During the next step (illustrated in Figure 6.15), we select the necessary data. and perform the
data transformation process, which does not bear much complication since we don’t have multiple
attributes. For this, we used the following components :

• Sort: We utilized this technique to arrange data rows in ascending order according to the
”Linkedin” column. It also facilitated the elimination of duplicates from this organized output.

• Derived Column: In our context, this element played a role in modifying existing columns by
substituting empty values with ”NA.”.

• Slowly changing Dimension: This mechanism effectively detects and handles changes within
dimensions. It accomplishes this by updating shifting attributes, inserting fresh records for new
members, and retaining prior records for members that remain unchanged.
CHAPTER 6. THIRD RELEASE 102

Figure 6.15: Transformation in table Company

6.8.4 Load
The third and final phase of our ETL process involves loading the previously extracted and
transformed data into their new storage location, namely the data warehouse. This phase transfers
the data to its ultimate destination.
The procedure of populating fact tables within the framework of galaxy modeling involves the
amalgamation of three distinct star models. This process enhances the comprehensive representation
of interconnected data relationships, ultimately contributing to a more informative galaxy model
illustrated in Figure 6.16.
CHAPTER 6. THIRD RELEASE 103

Figure 6.16: The loading of the Data warehouse

As illustrated in Figure 6.17 and similar to the fact-digital, the rest of the Fact tables go through
the same steps. Our process of loading the 3 fact tables will occur in two main stages: Firstly, we
must prepare our data source for populating the Dimensions and then load the fact tables. This data
source is a stored procedure that performs joins between all dimensions and measures.
The procedure of populating our fact tables encompassed several stages within the ETL
methodology:

1. Data Extraction: The fact table data was drawn from the data source, which in our instance
was a flat file. We employed the Merge join component to amalgamate data from two separate
source files.

2. Data Transformation: The data underwent modifications in alignment with analytical requisites.
This involved employing the Aggregate function, as well as repeated instances of sorting and
joining. Following the integration of our data sources, we employed the lookup operation to
establish associations connecting the fact table with its corresponding dimension tables.

3. Data Loading: The transformed data was incrementally introduced into the data warehouse’s
fact table using a lookup component, effectively preventing the duplication of any pre-existing
data.
CHAPTER 6. THIRD RELEASE 104

Figure 6.17: Fact Digital implementation

6.8.5 Deployment and Configuration of SSIS Package


To optimize the orchestration of our SSIS packages and streamline the integration of data, we have
initiated the deployment of these packages while also configuring the SQL Agent scheduler. This
strategic move empowers us to effectively schedule and oversee the execution of the packages at
designated intervals or in response to prearranged triggers. This process unfolded through two distinct
stages, each outlined below.

6.8.5.1 SSIS package deployment

Our SSIS package was successfully deployed onto the designated SQL Server instance, a process
vividly depicted in Figure 6.18. The package is systematically stored within the SQL Server
Integration Services Catalog, serving as a centralized hub for administering and housing SSIS
packages. By deploying the package in this manner, we ensure its centralized accessibility and
executable nature.
CHAPTER 6. THIRD RELEASE 105

Figure 6.18: SSIS package deployment

6.8.5.2 Jobs Planing

The incorporation of job scheduling involves the deployment and automation of ETL updates. This
automation occurs through a structured sequence of three stages, executed in synchronization using
SQL Server Agent: the initial stage involves staging area loading, followed by ODS loading, and
culminating in data warehouse loading as shown in Figure 6.19.

Figure 6.19: SQL Agent: job steps configuration

To align with the frequency of web scraping cycles. The scheduling occurs daily at midnight or
can be initiated manually as shown in Figure 6.20.
CHAPTER 6. THIRD RELEASE 106

Figure 6.20: SQL Agent: job planning configuration

We present in Figures 6.21 and 6.22, the execution of the SQL agent configured job the successful
execution of our SSIS package, and the report provided.

Figure 6.21: SQL Agent: job Execution


CHAPTER 6. THIRD RELEASE 107

Figure 6.22: SQL Agent: job Report

6.9 Analysis Phase


6.9.1 OLAP Cubes
The online analysis phase, focuses on creating cubes from the data warehouse. In this step
we employed OLAP tools for data analysis, enhancing comprehension and clear visualization of
information.
During the analysis stage with SSAS, essential dimensions for multidimensional analysis were
identified. Dimensions represent the elements upon which end users wish to analyze data.
Once dimensions were identified, key measures for this analysis were also defined. Through
organizing and storing data in a multidimensional cube structure, we facilitate swift and flexible data
analysis across multiple dimensions.
For our project, we designed and implemented the following cubes to prepare for any further analysis
besides the ones that we have in PowerBI:
CHAPTER 6. THIRD RELEASE 108

1. the following diagram 6.23 presents the Digital analysis Cube

Figure 6.23: Digital analysis Cube

2. The following diagram 6.24 presents the Network analysis Cube

Figure 6.24: Network analysis Cube


CHAPTER 6. THIRD RELEASE 109

3. The following diagram 6.25 presents the company’s analysis of Cube

Figure 6.25: Company Analysis Cube

6.9.2 Deployment and Configuration of OLAP Cubes


To optimize the orchestration of our Cubes, we have initiated the deployment. This strategic move
empowers us to effectively process for any further analysis grounds. This process unfolded through
two distinct stages, each outlined below:
Our SSAS packages were successfully deployed onto the designated SQL Server instance, a
process vividly depicted in Figure 6.26. By deploying the package in this manner, we ensure its
centralized accessibility and executable nature.

Figure 6.26: Cubes deployment


CHAPTER 6. THIRD RELEASE 110

6.10 Reporting Phase


The importance of restitution lies in its significance for users as it serves as the interface through
which they engage with the decision-making information system to attain outcomes. Our primary
objective of the decision-making information system is to swiftly deliver results to users’ queries
without necessitating advanced computer knowledge. Our dashboards, feature visuals that illustrate
results and insights derived from the dataset.

6.10.1 Dashboards
• the following dashboard 6.27 presents the Dashboard1 : Companies profiling We used several
visualisations:

– filters: to filter the visuals and the measures of the analysis axis.
– DAX Mesures: to quantify the numbers of the analysis axis:

CountCompanies = DIS T INCTCOUNT (FactCompany[Company.Linkedin])

CountIndustry = DIS T INCTCOUNT (FactCompany[Industry.Industry])

CountLocation = DIS T INCTCOUNT (FactCompany[Location.Location])

– Map: to visualize the distribution of the companies on the global map.


– Table: to list the company’s names and their respective profile link.
– Donut Chart: to visualize the number of companies by Industry.
– Stacked Area Chart: to visualize the number of companies by year of foundation.
CHAPTER 6. THIRD RELEASE 111

Figure 6.27: Dashboard Company analysis

• the following dashboard 6.28 presents Dashboard 2: companies digital analysis

– filters: to filter the visuals and the measures of the analysis axis by the number of contact
panels.
– KPI Mesures: to quantify the numbers of the analysis axis

CountEmail = DIS T INCTCOUNT (FactDigital[ContactIn f o.Email])

CountPhone = DIS T INCTCOUNT (FactDigital[ContactIn f o.Phone])

CountWebsite = DIS T INCTCOUNT (FactDigital[Website.website])

FacebookAccounts = DIS T INCTCOUNT (FactDigital[S ocialMedia.Facebook])

InstagramAccounts = DIS T INCTCOUNT (FactDigital[S ocialMedia.Instagram])

LinkedInAccounts = DIS T INCTCOUNT (FactDigital[S ocialMedia.Linkedin])

NombreWebS ite = DIS T INCTCOUNT (FactDigital[Website.website])

T witterAccounts = DIS T INCTCOUNT (FactDigital[S ocialMedia.T witter])

WhatsappAccounts = DIS T INCTCOUNT (FactDigital[S ocialMedia.Whatsapp])

YoutubeAccounts = DIS T INCTCOUNT (FactDigital[S ocialMedia.Youtube])

– Pie Chart: to visualize the number of companies by website responsiveness and the GAID.
CHAPTER 6. THIRD RELEASE 112

– Stacked Bar Chart: to visualize the number of companies by the number of social media
of each company.

Figure 6.28: Dashboard Digital analysis

• the following dashboard 6.29 presents the Dashboard 3

– Tables: to list the IP addresses by country, and the number of certificates issued by each
organization.
– KPI Mesures: to quantify the numbers of the analysis axis:

CountCerti f ication = DIS T INCTCOUNT (FactNetwork[Certi f ication.S erialNumber])

CountIPAddresses = DIS T INCTCOUNT (FactNetwork[IPAdress.IPAdress])

CountI ssuer = DIS T INCTCOUNT (FactNetwork[Certissuer.I ssuer])

CountOrganisation = DIS T INCTCOUNT (FactNetwork[Certissuer.Organisation])

CountS chemaT ypes = DIS T INCTCOUNT (FactNetwork[S chemaT ype.S chemaT ype])

– Donut Chart: to visualize the number of websites by schema type, and the number of
issuers to each organization.
– Bar Chart: to visualize types of TLS protocols.
CHAPTER 6. THIRD RELEASE 113

Figure 6.29: Dashboard Website analysis

All these dashboards are shared on the Power BI service, allowing the customer to access them
through our web application. This enables the visualization of all KPIs conveniently in a unified
location.

6.10.2 Market Analysis Overview


All these dashboards are accessible to clients through our web application via the Power BI service.
This setup offers the convenience of visualizing all key performance indicators (KPIs) in one unified
location.
Figure 6.30 showcases the market analysis interface, providing users with navigation access to the
dashboards.
CHAPTER 6. THIRD RELEASE 114

Figure 6.30: Dashboard Interface for Analysis

6.11 Conclusion
In this chapter, we have delved into the various steps involved in crafting our decision-making solution
and the technical tools employed for this purpose. Visual representations in the form of screenshots
have been incorporated to illustrate the prototypes employed in our project. Additionally, we have
comprehensively examined the phases of integration and analysis, seamlessly integrating screenshots
to depict the interfaces developed within the different facets of our solution.
Prespectives

In the realm of our project’s future endeavors, we envision a path marked by innovation and
advancement. These forthcoming steps are poised to enhance our project’s capabilities and impact in
profound ways:

Our first objective is to implement the scraping code on a cloud-based virtual machine.
This strategic move will not only ensure scalability but also allow us to select the most suitable
cloud provider for our specific needs. Data storage will also undergo transformation in the cloud
environment, facilitating more efficient and accessible data management.
Taking a step further, we plan to migrate our data warehousing and data pipeline operations to
the cloud. This transition will provide us with expanded resources and capabilities for data analysis,
particularly focusing on employee data from various companies. By leveraging cloud infrastructure,
we aim to bolster our data processing capabilities, enabling us to derive deeper insights from the
extensive datasets we gather.
Our web application is poised for refinement and enrichment. To enhance user experience,
we intend to introduce advanced features such as detailed filtering and search functionality. This
will empower users to precisely tailor their data queries, facilitating more insightful exploration.
Additionally, we plan to incorporate an export function for prospect data, enabling users to extract
valuable information for further analysis.
As part of our project’s evolution, we have plans to optimize the evaluation processes.
This optimization will include the introduction of a profile management module, allowing users to
efficiently organize and track their interactions. Furthermore, we are preparing the groundwork for
the integration of a payment module, which will serve as a pivotal component for future monetization
strategies.

These prospective developments represent the natural evolution of our project, aiming to elevate
its functionality, accessibility, and usability. By embracing cloud technologies, refining our web
application, and introducing advanced features, we are poised to deliver a more comprehensive and
powerful tool for decision-makers and data analysts.

115
Conclusion

Embarking on the journey of our project, we venture into the realm of innovation and insight. In
this phase, we initiate the design process, a crucial juncture where the foundation for a functional and
impactful system is laid. Through this introduction, we unveil the roadmap that guides the creation of
a comprehensive decision support system, harmonizing the intricate dance of machine learning, web
scraping, and business intelligence.
At the heart of our endeavor lies the design blueprint, a strategic map that orchestrates the
arrangement of integral components. This blueprint is a guiding light, directing the interactions,
functionalities, and flow of information within the web application. This model not only shapes how
information is structured but also governs how it’s stored and managed. The synergy between data
and design forms the bedrock upon which our web app stands, enabling seamless user experiences
and insightful data exploration.
Enabling this intricate ecosystem is an ensemble of cutting-edge technologies and development
tools, each a brushstroke on the canvas of innovation. We unravel these tools, presenting a tableau that
brings together the prowess of machine learning algorithms, the finesse of web scraping techniques,
and the precision of business intelligence methodologies. These tools not only amplify the user
experience but also empower decision-makers with the ability to extract actionable insights from a
sea of data.
Venturing deeper, we delve into the meticulous steps of loading and constructing data warehouses,
modern repositories where information finds its sanctuary. These repositories are more than just
storage; they are engines of analysis, driving the generation of meaningful dashboards and insightful
reports. As we meticulously detail these steps, the essence of transforming raw data into consumable
knowledge comes to life.

116
Bibliography

[1] https://medium.com/thetechieguys/crisp-ml-q-.

[2] https://mentari-er.medium.com/membuat-rancangan-data-warehouse-classic-model-
c22e46ccfbeb.

[3] https://bennyaustin.com/2010/05/02/kimball-and-inmon-dw-models/ .

[4] https://www.javatpoint.com/machine-learning.

[5] https://www.javatpoint.com/supervised-machine-learning.

[6] https://nixustechnologies.com/unsupervised-machine-learning/ .

[7] https://medium.com/@khang.pham.exxact/top-10-popular-data-science-algorithms-and-
examples-part-1-of-2-52fc14604dd9.

[8] https://medium.com/@medunoyeeni/django-the-fun-part-understanding-the-framework-
1bb4df54ab1f .

[9] https://www.upwork.com/en-gb/services/product/development-it-a-website-in-python-django-
1371024921383915520.

[10] https://logos-world.net/javascript-logo/ .

[11] https://toppng.com/showd ownload/233950/bootstrap− f eatured −image−bootstrap−3−logo.

[12] https://wolfgang-ziegler.com/blog/note-taking-with-github-and-vscode.

[13] https://www.anaconda.com/ .

[14] https://jupyter.org/ .

[15] https://www.ubackup.com/enterprise-backup/sql-management-studio-backup-fhhbj.html.

[16] https://www.sarjen.com/ssis-advantages-disadvantages/ .

[17] https://ramkedem.com/en/ssas-2/ .

[18] https://logohistory.net/power-bi-logo/ .

[19] Etl or elt: The evolution of data delivery. QlikTech International AB. (2022).

117
[20] Builtwith, 2023.

[21] Datanyze, 2023.

[22] wappalyzer, 2023.

[23] Aitken, A., I. V. Comparative analysis between traditional software engineering and agile
software development, 4749- 4752. System Sciences International Conference (2013).

[24] Chen, N. Research on e-commerce database. marketing based on machine learning algorithm,
337-340. Computational Intelligence and Neuroscience (2022).

[25] Gunawan, R., R. A. D.-I. F. F. Comparison of web scraping techniques, 1-5. Conference on
System Engineering and Industrial Enterprise (2019).

[26] Herrmann, M., H. L. Applied webscraping in market research, 125-125.

[27] Jin, Z. Research on business intelligence-based marketing decision-makingn, 5-10.


Technological Development of Enterprise (2008).

[28] Khder, M. Web scraping or web crawling: State of art, techniques, approaches ,and application,
1-25. Advances in Soft Computing and its Applications (2021).

[29] Krafft, M., M. C. Data-driven marketing and its impact on customer engagement, 119–136.

[30] Laxmi Priya, V., H. K. Implementing lead qualification model, using icp, 81-90. Capital
Markets: Market Efficiency eJournal (2020).

[31] LeCun, Y., B. Y. . H. G. Deep learning. nature, 436-444.

[32] Nygård, R. Mezei, J. Automating lead scoring with machine learning: An experimental study,
1-10. International Conference on System Sciences (Jan. 2020).

[33] Peng, J., E. C. Machine learning techniques : Applications and challenges, 3299–3348.
Frontiersin.

[34] Piccialli, F., C. G. Decision making through unsupervised learning, 27-35. IEEE Intelligent
Systems (2020).

[35] Ramakrishnan, G., J. S. Automatic sales lead generation from web data, 100-101. 22nd
International Conference on Data Engineering (ICDE’06). IBM India Research Lab (2006).

[36] Ranjan, J. Business intelligence: Concepts, components, techniques and benefits, 61 - 68.
Journal of Theoretical and applied information technology (2009).

[37] Romero, T., O. J. K. O. Business intelligence: business in industry 4.0, 2-10.

[38] Zhi, Z, R. H. A.-p. L. Research on referral service and big data mining for e-commerce with
machine learning, 35-38. Conference on Computer and Technology Applications (ICCTA)
(2018).

118
[39] Świeczak, W., W. Lead generation strategy as a multichannel mechanism of growth of a modern
enterprise, 105 - 140. Marketing of Scientific, and Research Organizations (2016).

119
Appendix A

Evaluation Metrics

A.1 The Silhouette Score


A measure of how well-separated the clusters are in comparison to their cohesion. It provides
insight into the quality of clustering by considering both the distance between data points within the
same cluster (a) and the distance to the nearest neighboring cluster (b). The silhouette score ranges
from -1 to 1, where a higher value indicates that data points are better clustered and well-separated.
Mathematical Formula:
b−a
Silhouette Score =
max(a, b)
Variables:

• a: Average distance from a data point to other points within the same cluster.

• b: Smallest average distance from a data point to points in a different cluster.

A.2 The Calinski-Harabasz Index


Known as the Variance Ratio Criterion, the Calinski-Harabasz Index quantifies the ratio of between-
cluster variance to within-cluster variance. A higher Calinski-Harabasz Index suggests that the
clusters are well-separated and distinct.
Mathematical Formula:
B N−K
Calinski-Harabasz Index = ×
W K−1
• B: Between-cluster variance (sum of squared distances between cluster centers and the overall
mean).

• W: Within-cluster variance (sum of squared distances within each cluster).

• N: Total number of data points.

• K: Number of clusters.

120
Appendix B

Optimization Methods

B.0.1 Label Encoder


The Label Encoder is employed for encoding categorical variables into numerical values, enabling
machine learning algorithms to work with categorical data. It assigns a unique integer to each
category.

B.0.2 StandardScaler
StandardScaler is a normalization technique used to standardize numerical features. It scales the
features to have a mean of 0 and a standard deviation of 1. This ensures that all features contribute
equally to the learning process, preventing features with larger scales from dominating the model.
Mathematical Formula:
x−µ
Standardized Value =
σ
• x: Original feature value.

• µ: Mean of the feature values.

• σ: Standard deviation of the feature values.

121
ABSTRACT
Our proposed solution, the Intelligent Lead Generation, is a system that generates B2B leads
and evaluates them using machine learning techniques. It also provides analytical reports and
visualizations to assist the Sales/Marketing team in their decision-making process through the
integration of business intelligence.

Keywords: lead generation, B2B, machine learning, analytical reports, visualizations, business
intelligence integration.

RÉSUMÉ
Notre solution proposée, la Génération Intelligente de Leads, est un système qui génère des prospects
B2B et les évalue à l’aide de techniques d’apprentissage automatique. Elle fournit également des
rapports analytiques et des visualisations pour assister l’équipe de Ventes/Marketing dans leur
processus de prise de décision en utilisant l’intégration de l’intelligence d’entreprise.

Mots-clés : génération de leads, B2B, apprentissage automatique, rapports analytiques,


visualisations, intelligence d’entreprise.

122

You might also like