Rapport PFE Smart Lead Generation Scrum English
Rapport PFE Smart Lead Generation Scrum English
Internship Report
COMPUTER ENGINEER
SPECIALITY : BI
Elaborated by
Hosting Company
SATORIPOP
Supervised by
Academic Year
2022 – 2023
Dedications
To the cherished memory of my Grandparents, who left our side a few months
before my graduation, you have left an indelible mark on my heart. Your wisdom
and love continue to prevail, even in your absence.
To the incredible souls who lent their hands and hearts, whether near or far, shaping
moments of camaraderie and shared dedication during the journey to complete this
work. Your warmth and encouragement infused each step with purpose.
i
Acknowledgments:
We wish to extend our heartfelt gratitude to all those who contributed to the
accomplishment of our final year project and offered their unwavering support.
Our sincere gratitude also goes to Mr.Khaireddine Fredj, the director of the
company, for welcoming me and providing this opportunity. I would like to extend
my heartfelt thanks to Mr. Zied Machkena, my mentor at Satoripop, for his
insightful guidance that contributed to my continual growth throughout these six
months of internship.
Lastly, I extend a heartfelt thank you to all those who directly or indirectly
contributed to the success of my final year project.
ii
Contents
List of Acronyms 1
General introduction 1
1 Projet Contexte 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Hosting Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Satoripop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Organizational structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Project Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Existing Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.4 Proposed Solution: Smart Lead Generation . . . . . . . . . . . . . . . . . . 6
1.4 Methodological Study and Planification . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Comparative Analysis of Project Management Approaches . . . . . . . . . . 6
1.4.2 Chosen Approach: Agile . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 Project Management Frameworks . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Adopted Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4.1 SCRUM Framework . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4.2 CRISP-ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.4.3 GIMSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Planification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Project Planification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iii
2.4.2 Webscraping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 The Role of Web Scraping in Lead Generation . . . . . . . . . . . . . . . . 18
2.5 Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 BI Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Decisional System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.3 Multidimensional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 Data Warehouse Conception Approches . . . . . . . . . . . . . . . . . . . . 21
2.5.4.1 The Bottom-Up Approach . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4.2 The Top-Down Approach . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4.3 The Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.5 Comparative Study Between Data Integration Processes . . . . . . . . . . . 22
2.5.6 Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.7 OLAP Tabular vs Multidimensional Models . . . . . . . . . . . . . . . . . 24
2.5.8 The Role of Business Intelligence in B2B Lead Generation marketing . . . . 25
2.6 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.1.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 28
2.7.1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.2 The Role of Machine Learning in The Marketing Field . . . . . . . . . . . . 30
2.7.3 Unsupervised Learning for Segmentation Problem . . . . . . . . . . . . . . 31
2.7.4 Similar Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Preliminary Analysis 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.1 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.3 Non Functional Requirempents . . . . . . . . . . . . . . . . . . . 37
3.2.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2.1 Use Case: User . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2.2 Use Case: Admin . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Project management with SCRUM . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Team and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Product backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Release planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Architecture: MVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iv
3.4.1 Model Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 View Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Template Leyar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Framework: Django . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Client-side Interaction: Javascript . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.3 Front-end Design: Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.4 Integrated Development Environment (IDE) and Version Control . . . . . . . 45
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 First Release 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Presentation of Release 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Sprint 1.1: Account Creation and Admin Panel . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Sprint 1.1 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Increment of Sprint 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2.1 Registration and email verification . . . . . . . . . . . . . . . . . 47
4.3.2.2 Administrator Login Interface and Home page . . . . . . . . . . . 50
4.3.2.3 Users Management Interface . . . . . . . . . . . . . . . . . . . . 51
4.3.2.4 Group Management . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Sprint 1.2: Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Sprint 1.2 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 Related research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.3 Sources Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.4 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.5.1 Linkedin Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.5.2 Company employees . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.5.3 Website Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.6 Increment of Sprint 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.6.1 Leads Listing and Search Interface . . . . . . . . . . . . . . . . . 62
4.4.6.2 Leads Management . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Second Release 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Presentation of Release 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Backlog of sprint 2.1: Prospects Segmentation . . . . . . . . . . . . . . . . . . . . . 65
5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.1 Data Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.2.1 Python for Machine Learning . . . . . . . . . . . . . . . . . . . . 70
v
5.4.2.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3.1 Further Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . 72
5.4.4 Agglomerative Hierarchical Clustering (AHC) . . . . . . . . . . . . . . . . 75
5.4.4.1 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.4.2 Evaluation and Profiles . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.5 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.5.1 Number of Clusters: Elbow Method . . . . . . . . . . . . . . . . 80
5.4.5.2 Evaluation and Profiles . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Increment of sprint 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Sprint 2.2: Digital Maturity Assessment . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.1 Sprint 2.2 Backlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2 Sprint 2.2 Increment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2.1 User-side Assessments . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2.2 Admin-side Assessment Management . . . . . . . . . . . . . . . 85
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Third Release 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Presentation of Release 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Sprint 3.1 Backlog: BI Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 BI Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Multidimensional Conception of BI Solution . . . . . . . . . . . . . . . . . . . . . . 88
6.5.1 Global DW Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.2 Dimensions Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5.3 Conception of Network Datamart . . . . . . . . . . . . . . . . . . . . . . . 91
6.5.4 Conception of Digital Datamart . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5.5 Conception of Company Datamart . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6.1 KPI Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6.2 Development of Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6.2.1 Mock-up scenario: ”Digital Exploration” . . . . . . . . . . . . . 95
6.6.2.2 Mock-up Scenario: ”Enterprise Exploration” . . . . . . . . . . . 95
6.6.2.3 Mock-up Scenario: ”Network Exploration” . . . . . . . . . . . . 95
6.7 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.7.1 Comparative Evaluation: Integration Software . . . . . . . . . . . . . . . . . 96
6.7.2 Comparative Evaluation: Reporting Software . . . . . . . . . . . . . . . . . 97
6.7.3 Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7.3.1 SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7.3.2 SSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
vi
6.7.3.3 SQL Server Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.3.4 SSAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.3.5 PowerBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8 Integration Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8.1 ETL: Extraction, Transformation, Loading . . . . . . . . . . . . . . . . . . 99
6.8.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.3 Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.4 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.8.5 Deployment and Configuration of SSIS Package . . . . . . . . . . . . . . . 104
6.8.5.1 SSIS package deployment . . . . . . . . . . . . . . . . . . . . . . 104
6.8.5.2 Jobs Planing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.9 Analysis Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9.1 OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9.2 Deployment and Configuration of OLAP Cubes . . . . . . . . . . . . . . . . 109
6.10 Reporting Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.10.1 Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.10.2 Market Analysis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 117
vii
List of Figures
viii
4.10 Interface: Add - Update user groups . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 Interface: update - remove user permissions . . . . . . . . . . . . . . . . . . . . . . 52
4.12 Interface: Update - Delete user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.13 Interface: Add Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.14 Interface: Group permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.15 result Group added successfully . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.16 Data Harvesting Process/steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.17 DataKund initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.18 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.19 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.20 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.21 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.22 Processus de Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.23 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.24 Code Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.25 listing interface in Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.26 Interface: Add lead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.27 Result: Lead added successfully . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 BI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Global DW Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
ix
6.3 Conception of Network Datamart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Conception of Digital Datamart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 Conception of Company Datamart . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Scenario Prototype ”Digital Exploration” . . . . . . . . . . . . . . . . . . . . . . . 95
6.7 Scenario Prototype ”Company Exploration” . . . . . . . . . . . . . . . . . . . . . . 95
6.8 Scenario Prototype ”Network Exploration” . . . . . . . . . . . . . . . . . . . . . . 96
6.9 SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.10 SSIS software logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.11 SSAS software logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.12 PowerBI software logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.13 The Loading of the Staging area . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.14 The Loading of the ODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.15 Transformation in table Company . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.16 The loading of the Data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.17 Fact Digital implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.18 SSIS package deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.19 SQL Agent: job steps configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.20 SQL Agent: job planning configuration . . . . . . . . . . . . . . . . . . . . . . . . 106
6.21 SQL Agent: job Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.22 SQL Agent: job Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.23 Digital analysis Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.24 Network analysis Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.25 Company Analysis Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.26 Cubes deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.27 Dashboard Company analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.28 Dashboard Digital analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.29 Dashboard Website analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.30 Dashboard Interface for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
x
List of Tables
xi
List of Acronyms
1
General Introduction
The realm of decision-based computing encompasses a wide array of tools, applications, and
methodologies enabling organizations to gather data from various sources. This data is then
prepared for analysis to generate reports, dashboards, and machine-learning models, making analysis
accessible to decision-makers and operational staff.
Currently, enterprises employ BI software to extract valuable insights from their extensive
data repositories. Such tools facilitate the extraction of information like competitive intelligence,
market trends, performance tracking, and reasons behind missed opportunities. Typically rooted in
historical data analysis, Business Intelligence now even incorporates machine learning for predictive
capabilities.
In this context, our culminating project titled ”B2B lead generation and evaluation system”
at Satoripop aims to implement a decision-based solution within the industrial sector, aiding the
decision-makers to visualize the market and evaluate digital maturity.
The primary focus is Business Understanding, where our first chapter provides an overall
comprehension of our project framework. It includes an introduction to the core business, a clear
delineation of the project scope, an identification of challenges, and a presentation of the envisioned
solution. Additionally, we elaborate on the development methodology and offer insights into our
project plan.
In the second chapter, we delve into various lead generation methods in marketing, lead assessment,
data collection, business intelligence, and the role of machine learning.
The third chapter addresses preliminary design, needs, stakeholders, and requirements, and introduces
SCRUM methodologies, architecture, and the development environment.
Moving on to the fourth chapter describes the first project release, covering account creation, the
administrator dashboard, and data collection.
In the fifth chapter, we present the second release, with a focus on prospect segmentation and the
implementation of company assessment.
Finally, the sixth chapter introduces the third release, exploring sprint 5, which centers on business
intelligence, data warehouse design, performance indicators, and integration, analysis, and reporting
phases.
2
Chapter 1
Projet Contexte
1.1 Introduction
The initial chapter serves as an introduction to our project’s background. To begin with, we will
introduce the hosting organization, followed by an exploration of the issues at hand, the overall
framework, and project goals. Subsequently, we will delve into the proposed solution, outlining
the methodology, and providing an overview of the project’s timeline.
Satoripop (figure 1.1 presents its logo) is a software development company with a dedicated
creative team that provides software solutions tackling diverse industry challenges, it offers:
• Cloud Computing
• UX/UI Design
• Digital Marketing
3
CHAPTER 1. PROJET CONTEXTE 4
• Swieq: An AI solution for customer behavior prediction for retailers and e-commerce
platforms.
1. Retail
4. Cloud
In each of these departments, you will find employees with specific skills and expertise tailored
to their respective roles. For instance, there may be financial analysts within the FSI department and
administrative staff supporting various company activities. Additionally, each department may have a
dedicated project manager overseeing their operations and ensuring smooth coordination.
1.3.2 Problem
Every company requires outbound sales and marketing campaigns where prospects are essential,
as it constitutes one of the primary tasks for any business. Currently, the online advertising market
is witnessing a significant boom in lead generation, leading to soaring demand for lead management
services. Unfortunately, 61% of B2B marketers consider generating high-quality prospects as one of
their major challenges.
• Advantages:
• Disadvantages:
1. Limitations of the Free Version: The free version of Wappalyzer has certain query
limitations. Regular usage may require upgrading to the paid version.
2. Limited Precision for Some Tools: Wappalyzer may not detect certain newer or less
popular technologies used on a website.
3. Data Privacy: As Wappalyzer collects information about the technologies used on a
website, data privacy concerns may arise. However, it is worth noting that Wappalyzer
does not collect personal data and does not store the collected information.
BuiltWith [20] : This lead generation tool identifies the technologies used by websites and provides
a list of websites that utilize specific technology, along with other data points such as contact
information and social media profiles.
• Advantages:
• Disadvantages:
1. Inconsistent Data Accuracy: Some users report that the accuracy of the provided data can
be inconsistent.
Datanyze [21] : This business intelligence platform offers a range of lead generation and market
analysis tools. It includes a technology tracking feature that identifies the technologies used by
websites, along with other data points such as funding, employee count, etc.
• Advantages:
1. Wide Range of Data Points: Datanyze offers an array of data points beyond technology
usage, including firmographics, funding data, and employee count.
2. Comprehensive Search Filters: It provides users with a comprehensive set of filters and
searches.
• Disadvantages:
of the project and follow a fixed sequence of phases: initiation, planning, execution,
monitoring, and closure.
– Agile Approaches: Agile approaches, like Scrum and Kanban, are iterative and
progressive. They focus on delivering value to the customer through small, frequent
increments rather than attempting to complete the entire project simultaneously. Agile
projects adapt to changes and continuously improve through regular iterations.
– Traditional methods are less flexible and adaptable to changes in requirements. Once a
phase is completed, it is difficult to make changes without going back and reworking the
previous phases, which can be time-consuming and costly.
– Agile approaches encourage change and favor adaptability. They prioritize
accommodating changing requirements and stakeholder feedback, making it easier to
make adjustments as development progresses.
• Project Planning :
– Classic methodologies require comprehensive and detailed planning from the start. The
entire project scope, schedule, and resources are defined upfront, and any modifications
require a formal change management process.
– Agile projects focus on planning incrementally. Planning is done progressively, typically
for the next iteration or sprint, allowing for more flexibility as the project advances.
• Delivery :
– Classic methodologies deliver the entire project as a single package at the end. The final
product is often tested and validated only after all development is completed.
– Agile approaches deliver the project through small, frequent iterations. Each iteration
results in a potentially shippable product, ensuring regular feedback and validation from
stakeholders.
– Roles are often more rigid in classic methodologies, with distinct roles for project
managers, team members, and stakeholders.
– Approaches Agiles: Agile approaches promote self-organizing teams, where members
collectively take responsibility for planning, executing, and delivering the work.
Traditional project management roles may be more flexible in agile environments.
– Communication in classic approaches often follows a formal and top-down structure with
predefined communication channels.
CHAPTER 1. PROJET CONTEXTE 8
• Risk Management:
– In classic Approaches: Risk management is typically performed early in the project and
is less dynamic throughout the project lifecycle.
– Agile projects integrate risk management throughout the development process, identifying
and addressing risks as they arise during iterations.
Table 1.1 summarises the main differences between traditional approaches and agile approaches.
• Scrum: an agile project management methodology that emphasizes iterative and incremental
software development. It promotes interdisciplinary collaboration, transparency, and the
continuous delivery of features. Scrum projects are divided into iterations called ”sprints,”
typically lasting 2 to 4 weeks. This methodology encourages close collaboration among team
CHAPTER 1. PROJET CONTEXTE 9
members, with an ideal Scrum team consisting of around 7 +/- 2 members. Communication
within the Scrum team is informal, fostering seamless information exchange. Scrum also
places a strong emphasis on transparency and visibility of ongoing work through artifacts like
the Scrum board. A key aspect of Scrum is the ongoing involvement of the client, with the
product owner acting on behalf of the client throughout the development process, allowing for
adjustments to priorities and features based on changing needs.
• KANBAN: a visual project management methodology that enables workflow tracking and
task management optimization. It emphasizes visualizing work in progress, limiting work in
progress, and continuous improvement. Unlike Scrum and XP, KANBAN does not rely on
fixed iterations. Work begins immediately after the completion of a previous task, offering
great flexibility. This methodology can be adopted by any competent or multidisciplinary team.
Team communication in KANBAN is often informal and can occur face-to-face. KANBAN
focuses on visualizing the workflow, with a KANBAN board displaying work in progress, tasks
to be done, and those already completed.
After studying various agile methodologies to improve client communication, enhance deliverable
visibility, and ensure high product quality, we have chosen to adopt the Scrum framework. Our
decision is motivated by the following reasons:
Scrum principles are integral to the widely practiced agile software development techniques
worldwide. Scrum comprises three core processes: roles, artifacts, and timeframes.
Scrum is frequently employed in the software development process to address complex challenges,
consistently demonstrating increased productivity and reduced software development costs [23]
The core principle of the Scrum methodology involves incremental software development by
breaking the project into iterations or ”Sprints.” The goal is to deliver an incremental portion of
CHAPTER 1. PROJET CONTEXTE 10
the software to the client at the end of each sprint. This methodology relies on iterative development
cycles lasting 2 to 4 weeks, making it conducive to accommodating adjustments compared to other
approaches.
Regarding Scrum roles, there are three primary ones:
• Product Owner: Responsible for managing the product backlog. Collaborates with stakeholders
to understand product needs and requirements, prioritizes backlog items, and defines them in
terms of ”User stories.” Ensures the development team comprehends product requirements and
goals and remains available to address queries.
• Development Team: Tasked with completing selected product backlog items for each sprint.
The team is self-organizing and may comprise developers, testers, designers, and other
professionals involved in product development. They work collectively to achieve sprint goals
and deliver high-quality features.
• Scrum Master: Responsible for implementing the Scrum methodology and ensuring the
development team adheres to the process. Facilitates sprint meetings, aids in resolving
issues and obstacles encountered by the development team, and ensures proper Scrum process
adherence.
Scrum also includes several meetings to maintain communication and collaboration within the
development team and with stakeholders. These meetings include sprint planning, daily Scrum, sprint
review, and sprint retrospective meetings.
Regarding Scrum artifacts, the methodology provides several essential elements, including the
product backlog, sprint backlog, burndown chart, and task board. These contribute to effective
management and visualization of ongoing work and sprint goals.
1.4.4.2 CRISP-ML
CRISP-ML, which stands for Cross-Industry Standard Process for Machine Learning, is a widely
adopted methodology for developing machine learning projects. It provides a well-structured
framework to carry out various tasks and activities throughout the machine learning lifecycle, from
problem understanding to model deployment. One of the primary reasons for embracing CRISP-ML
is its systematic and structured approach, which enhances the chances of success. It offers a clear
roadmap for the project, ensuring that all relevant steps are followed in the correct sequence with
well-defined inputs and outputs.
Another advantage of using CRISP-ML is its established and widely accepted nature, which
means ample support and resources are available for those who use it. This can be especially beneficial
for machine learning novices or complex projects. Overall, the use of CRISP-ML can significantly
increase the likelihood of success in a machine learning project by providing a clear and structured
approach while leveraging the support and resources within the machine learning community.
The CRISP-ML methodology is illustrated in Figure 1.2 taken from [1], it encompasses the
following key stages:
CHAPTER 1. PROJET CONTEXTE 11
1. Business and Data Understanding: The development of machine learning applications starts
with identifying the project’s scope, success criteria, and data quality verification to ensure
feasibility. Success criteria, including those related to the market, should be defined with
measurable performance indicators.
2. Data Engineering (Data Preparation): In this phase, data is prepared by selecting relevant
market segments and cleaning them to ensure data quality. Important features for market
segmentation are identified, and data is normalized to avoid errors.
3. Machine Learning Model Engineering: The modeling phase focuses on specifying machine
learning models suitable for market segmentation. Evaluation metrics include the ability to
identify market segments, model robustness, and interpretability.
4. Machine Learning Model Evaluation: After training, models are evaluated for their ability
to segment the market effectively and accurately. Performance, robustness, and interpretability
metrics are used to assess the models.
5. Deployment: Model deployment involves integrating machine learning models into existing
systems to enable real-time segmentation. Deployment approaches vary depending on market
segmentation needs, whether online or batch.
6. Monitoring and Maintenance: Once in production, models are monitored to ensure they
maintain their ability to segment the market accurately. Adjustments are made based on market
changes to ensure the continuous relevance of the segmentation.
CRISP-ML can be effectively combined with agile approaches like Scrum or Kanban. While it
provides a systematic and structured way to handle machine-learning projects, agile methodologies
bring flexibility, collaboration, and iterative delivery. By integrating both approaches, teams can
efficiently manage the complexities of data projects, steadily deliver value, and adapt to evolving
requirements and insights.
1.4.4.3 GIMSI
The GIMSI approach, an agile methodology, centers on users and meaningful insights in business
intelligence. It offers a structured framework for successful dashboard integration projects, focusing
on optimizing performance.
In addition, rigorous research has been conducted on GIMSI. Evolving technology and human
behavior pose challenges to businesses, requiring adaptability and proactive measures. Choosing the
right approach for aligning policies and strategies is complex. Notably, the GIMSI process consists
of 10 defined steps grouped into four phases:
1. Identification
• Examination of the company’s environment: This phase involves analyzing the economic
environment and the company’s strategy to outline the project’s scope clearly.
• Company identification: This step entails scrutinizing the organizational structure,
business processes, and involved stakeholders of the company.
2. Design
• Defining company objectives: During this stage, we thoroughly explore the strategic
aspirations of operational teams, seeking their tactical goals and specific ambitions.
• Defining a dashboard: This phase encompasses defining and characterizing an individual
dashboard for each team, serving as a decision-making aid with relevant performance
indicators.
• Selection of performance indicators: Choosing performance indicators is a crucial step
based on objectives, context, and stakeholders identified in prior stages, providing
valuable guidance for selecting the most relevant indicators.
• Information collection: This phase aims to gather essential data required for developing
indicators.
• Dashboard system: Constructing the dashboard system and ensuring overall consistency
control.
3. Implementation
1.5 Planification
1.5.1 Project Planification
Our project unfolds through several pivotal phases that encompass diverse aspects of data-driven
analysis and application development. In the initial stage of Planification and Requirement Analysis,
we lay the foundation by meticulously outlining the project’s scope and objectives. The culmination
arrives with the Development of a Web Application, an interactive platform showcasing the fruits
of our labor. This app provides access to an array of valuable resources, including visually rich
dashboards and a comprehensive list of web-scraped data. Furthermore, it furnishes the capability
to make informed predictions using the segmentation model. This holistic approach encapsulates the
essence of our project, merging cutting-edge analytics with user-friendly interactivity.
Following this, we embark on the journey of Data Collection via Web Scraping, harnessing the
power of automated data extraction to amass relevant information from online sources.
Our project’s sophistication ascends with the Building of Segmentation Machine Learning
Models. We create two models, meticulously training and validating them. Following a rigorous
evaluation, we select the most suitable model, ensuring it aligns seamlessly with our objectives.
Subsequently, in the Data Integration phase, we utilize robust ETL (Extract, Transform, Load)
processes to harmonize and merge the collected data. We establish a systematic job plan to ensure
consistent updates, maintaining data relevance over time. This integrated dataset serves as the bedrock
for the subsequent steps.
In the Data Analysis phase, we leverage Online Analytical Processing (OLAP) cubes to delve into
the multidimensional insights hidden within the data. Visual Dashboards provide a comprehensive
visual representation of these insights, empowering stakeholders to gain quick and meaningful
insights.
1.6 Conclusion
This chapter allowed us to develop our work methodology and outline the different phases to follow.
This framework provided us with the essential foundations to progress in our approach. we were able
to grasp the challenges of our project by contextualizing it within the company where the internship
took place, as well as the detailed specification of our requirements. This step will enable us to lay
the necessary groundwork for the implementation of our project. The next chapter will present the
fundamental theoretical principles underlying our solution.
Chapter 2
2.1 Introduction
Following the presentation of the problem and the proposed solution, This chapter embarks on an
examination of the current state of knowledge and developments in our project’s domain. It delves
into research work, methodologies, and existing solutions to our lead generation issue.
Subsequently, we dive into the specific realm of data collection and cleansing. We focus on web
scraping, a method we judiciously employed in the data collection phase. This approach enabled
us to acquire essential data and subsequently enter the realm of business intelligence to integrate and
prepare the data for analysis. As we progress, we explore machine learning methods, with a particular
focus on unsupervised learning techniques for segmentation problems in the field of Marketing.
Conforming to [39], lead generation helps organizations increase brand awareness, establish
relationships, and attract more potential customers to fill their sales pipeline. It affects the significance
of organizations and how their value effectively increases following the implementation of tools
15
CHAPTER 2. STATE OF THE ART 16
According to [30] concluded that lead qualification is an essential task for the marketing team as
it enhances the efficiency of campaigns conducted by the sales teams. A well-qualified lead will help
the sales team increase the conversion rate. Other factors, such as time optimization, targeting the
right type of prospects, and transforming the lead management process to be more meaningful, also
play a crucial role.
conventional methods. Research has shown that using web scraping produces more comprehensive,
accurate, and consistent data compared to manual data entry. Based on the results, it has been
concluded that web scraping is very useful in the information age.
These extracted data can then create marketing lists and target potential customers with specific
offers or promotions. Web scraping can also help businesses gather data about their competitors,
including their marketing strategies and product offerings, which can be used to inform their own
sales and marketing efforts.
• HTML Parsing: This involves analyzing the HTML code of a website to extract specific
information:
1. Analyzing HTML structure using tools like Beautiful Soup and lxml to select specific
elements for extraction.
2. Using CSS, and XPath selectors to locate specific elements on a webpage for extraction,
such as tags, classes, IDs, and attributes.
3. Web Browser Automation: For more complex websites, automating the web browser may
be necessary to simulate human interaction. Tools like Selenium can be used for this
purpose.
4. Handling Speed Limitations: Websites may implement rate limits to prevent excessive
web scraping. To overcome this, web scraping experts can use techniques like rotating IP
addresses, breaking queries into multiple sessions, and setting delays between requests.
5. Data Storage and Processing: Once the data is extracted, it needs to be stored and
processed.
• Web Crawling: This involves automatically navigating a website to extract data from multiple
pages. Before starting a crawl, defining the crawl scope, i.e., the web pages to explore,
is important. Then, link crawling involves exploring the website’s internal links to collect
additional data.
• API Scraping: This refers to using an Application Programming Interface (API) to extract data
from a website.
• Screen Scraping: This involves extracting data directly from the visual elements of a website,
such as text, images, and forms.
Additionally, based on [25], we found that web scraping using the regex method consumes the least
memory compared to the HTML DOM, and XPath methods. On the other hand, HTML DOM has
the least time and the least data consumption compared to regex, and XPath methods.
CHAPTER 2. STATE OF THE ART 18
1. Depending on the complexity of the website and the data to be extracted, using an API is
the most reliable option if it provides accurate and up-to-date data. However, in the case of
limitations with the LinkedIn API, HTML structure analysis, especially using regex, is the
simplest in terms of memory usage and accessibility.
2. During website exploration, there may be constraints to consider, such as rate limits and
bandwidth limitations. Crawlers can be configured to comply with these constraints by slowing
down the exploration speed or dividing the exploration into multiple sessions. As a result,
errors may occur during website exploration, such as encountering 404 pages or server errors.
Crawlers can be configured to handle these errors by attempting to recover missing pages or
aborting the exploration.
• Adapting to changes by updating system data according to evolving needs and technology while
keeping users informed of these modifications.
• Extracting significant business value from vast datasets using analytical tools to aid decision-
making.
CHAPTER 2. STATE OF THE ART 19
In figure 2.1 taken from [2], the project architecture’s in-depth structure of such a decision-making
system consists of the following components:
• Data Sources: These are varied and diverse data origins that can be generated both within and
outside the organization.
• ETL Process: This is a procedure involving collecting, transforming, and loading data into a
data warehouse or target system.
• Operational Data Store (ODS): This serves as an intermediate storage system between
operational data sources and the data warehouse. It allows real-time access to operational data
for daily activities and operational reports.
• Data Warehouse: This centralized database efficiently organizes and stores structured data from
different company sources. Its purpose is to facilitate analysis and decision-making by enabling
quick and consistent access to historical and current data.
• Data Marts: These are compact, specialized databases that consolidate domain or department-
specific data. They aim to provide aggregated and pre-formatted information suitable for precise
analyses and reports within a given context.
• OLAP Cubes: These enable interactive analysis of data using multidimensional structures.
• Data Visualization Tools: These tools visually represent information and data graphically and
intuitively.
CHAPTER 2. STATE OF THE ART 20
• The fact table captures detailed data about specific events or transactions, with numerical
measures tied to various dimensions.
• Dimension tables offer extra descriptive data related to the fact table, providing specific
contexts and viewpoints for the recorded measures.
1. Star Model see figure 2.2, where a central fact table is encircled by dimension tables,
resembling a star shape and catering to user-friendly data exploration, and multidimensional
analysis.
2. Snowflake Model, see figure 2.3, a variation where dimensions are divided into sub-tables,
forming a hierarchy that improves data normalization but can complicate queries.
3. Galaxy Model, see figure 2.4, which involves multiple star models sharing common dimensions,
potentially involving different facts and dimensions.
Initiated by Ralph Kimball, it involves gradually building essential components toward a complete
system. It starts with crafting data marts that cater to specific business needs, providing user-friendly
reporting and analysis for particular processes in figure 2.5 (taken from [3]).
Initiated by Bill Inmon, It begins with an overarching vision and moves towards specifics. Here,
a data warehouse acts as a centralized repository using a standardized business model in figure 2.6
CHAPTER 2. STATE OF THE ART 22
Combining Inmon and Kimball’s methodologies for efficient data warehouse design. In practice,
many companies use a hybrid approach, employing Inmon’s method to establish a centralized data
warehouse and Kimball’s technique to create data marts using a star schema. This blend reaps the
benefits of both approaches, catering to the business’s specific needs in figure 2.7 (taken from [3]).
KPIs are generally grouped into four distinct categories, each of them with its distinct attributes:
• Strategic KPIs, offer a comprehensive look at a company’s health. Though not providing
intricate details, they are frequently used by executives to gauge return on investment, profit
margins, and total revenue.
• Functional KPIs center around specific company departments or functions. For instance,
the marketing department measures the clicks on each email distribution. These KPIs can be
strategic or maybe operational, and they offer substantial value to specific user groups.
• Operational KPIs focuse on shorter periods, assessing a company’s performance from month
to month or even day to day. They allow management to analyze specific processes or segments.
CHAPTER 2. STATE OF THE ART 24
• Leading or Lagging KPIs illustrate the nature of the analyzed data and indicates whether they
predict forthcoming events or reflect already occurred events, on the other hand, result from
past operations and are considered a lagging indicator.
According to [27] the collection and analysis of marketing data and information are the
scientific basis for marketing decision-making. The study analyzed that key technologies supporting
business intelligence include data warehousing, data mining, and OLAP. It also examined the
application of business intelligence in corporate marketing decision-making.
types has catalyzed breakthroughs across industries, ushering in a new era of AI-powered innovation
with profound societal implications. For more details see [31]
Supervised machine learning operates under the principle of guidance. This involves instructing
machines through the use of a ”labeled” dataset, whereby the machine is trained and subsequently
makes predictions based on this training. The term ”labeled data” signifies that specific inputs are
already linked to their respective outputs. To elaborate further, the process begins by training the
machine with input-output pairs, followed by tasking the machine with predicting outputs when
presented with a separate test dataset.
CHAPTER 2. STATE OF THE ART 27
As illustrated in Figure 2.9 taken from [5], supervised learning embodies a category of
ML wherein the algorithm undergoes training with a dataset that includes both input data and
corresponding output labels. This training enables the algorithm to establish associations between
input data and the correct corresponding output, based on the provided labels. The overarching
objective of supervised learning is to facilitate accurate predictions for novel, unseen data, leveraging
the general patterns and relationships absorbed during the training phase.
The primary objective underlying the supervised learning approach is to establish a mapping
between the input variable (x) and the output variable (y). It can be divided into two distinct
categories:
Unsupervised learning stands as a distinctive approach in machine learning, where models operate
without guided instruction from a training dataset. Instead, these models autonomously uncover
concealed patterns and insights within provided data. The analogy to human learning to the
assimilation of new knowledge. In essence:
Unsupervised learning is a machine learning category wherein models are trained using unlabeled
datasets, enabling them to make informed decisions without directed oversight.
The algorithm’s objective within this context is to unveil latent patterns, structures, or relationships
embedded within the data, devoid of explicit steering. As illustrated in figure 2.10 taken from [6],
CHAPTER 2. STATE OF THE ART 28
the core objective of unsupervised learning algorithms is to categorize unorganized datasets based on
similarities. Tasks like clustering and dimensionality reduction exemplify this paradigm. Clustering
involves amalgamating data points based on inherent attributes, while dimensionality reduction
techniques aspire to encapsulate complex data within a reduced-dimensional space while retaining
crucial insights. Unsupervised machine learning can be divided into two distinct categories, as
delineated below:
• Clustering: Employed when intrinsic groups within data necessitate discovery. This technique
groups objects so that the most similar items congregate, while dissimilarity dominates between
different groups. An instance is customer grouping by purchasing behavior.
While traditional supervised learning focuses solely on labeled data and unsupervised learning
deals with unlabeled data, semi-supervised learning offers a balanced approach that capitalizes on the
advantages of both paradigms. This technique is valuable in scenarios where acquiring large amounts
of labeled data is challenging or expensive, yet you want to improve model performance beyond what
unsupervised learning can achieve alone.
Figure 2.11 (taken from [7]) shows that reinforcement learning operates through a feedback-driven
procedure where an AI agent (a software component) autonomously explores its environment through
trial and error. It takes action, learns from its encounters, and enhances its performance. The
agent is rewarded for favorable actions and penalized for unfavorable ones, with the primary aim
of maximizing cumulative rewards.
In contrast to supervised learning, reinforcement learning lacks labeled data and solely relies on
experiential learning.
The process of reinforcement learning mirrors human learning, to how a child acquires knowledge
through daily experiences. A tangible instance is playing a game, wherein the game serves as the
environment, the agent’s moves represent states, and the objective is to achieve a high score. The
agent receives feedback in the form of rewards and penalties.
Reinforcement learning’s operational paradigm has found applications across diverse domains
including game theory, operations research, information theory, and multi-agent systems.
Formally, a reinforcement learning challenge can be defined using the framework of a Markov
Decision Process (MDP). Within this context, the agent engages continually with the environment,
executing actions that result in environment responses and subsequent state transitions.
Machine learning encompasses various types tailored for distinct tasks and precise results divided
as illustrated in figure 2.12 ( taken from [7]) from the Machine Learning Techniques article illustrating
its applications and challenges [33] :
• K-Means Clustering: The algorithm seeks to minimize the sum of squared distances from
each point to the centroid of its assigned cluster. It assigns each data point to the nearest cluster
centroid and then updates the centroids based on the mean of the points in each cluster. This
process is repeated iteratively until convergence. K-means can work well when clusters are
well-defined and roughly spherical. Its performance can be evaluated using metrics like the
silhouette score or within-cluster sum of squares.
n
X k
arg min min ||xi − µ j ||2
clusters j=1
i=1
– arg minclusters : The argument that minimizes over possible cluster assignments.
– ni=1 : The summation over all data points.
P
1. Single Linkage: defines the distance between two clusters as the minimum distance
between any pair of points, one from each cluster.
The quality of hierarchical clustering can be visualized through dendrograms and cluster
selection can be guided by metrics such as cophenetic correlation or silhouette score:
k
X
p(xi |θ) = π j N(xi |µ j , Σ j )
j=1
GMM can model clusters of arbitrary shapes and can capture data distribution complexities.
Evaluation can be done using log-likelihood or the Bayesian Information Criterion (BIC):
• The authors employed a variety of measures and validation approaches instead of relying solely
on accuracy criteria to evaluate model performance.
CHAPTER 2. STATE OF THE ART 34
• The authors introduced processing time and computational power as useful criteria in model
selection to maintain stable performance on large datasets.
• ML can not only significantly enhance the performance of large-scale data exploration but also
achieve precise marketing and further increase the marginal profit by approximately 20% for
each product type.
2.8 Conclusion
In this chapter, we have examined the impact of Data integration and predictive analysis technologies
on economic and business decision-making. The next chapter will focus on presenting our work and
outline the different phases we followed.
Chapter 3
Preliminary Analysis
3.1 Introduction
In this chapter, we will begin by examining the requirements of this project. Following that, as we
embark on our web application project, we will establish the design foundations, and outline the
various tasks carried out following meetings with the Scrum Master and the Product Owner. During
these meetings, we formulated the project backlog and segmented it into iterations. Subsequently,
we will outline the overall architecture of our project, and finally, we will delve into the development
environment for the work.
3.2 Conception
3.2.1 Requirements Analysis
By implementing these functionalities, users gain the ability to make more informed business
decisions based on precise data and in-depth analysis.
3.2.1.1 Actors
User: Represents standard users who have access to the web application to view the site. This could be
a business director or any decision-maker who would benefit from leads, predictions, or visualizations
for an overall market analysis.
Administrator: This is the web management system administrator with elevated data access rights.
They can manage, review, add, modify, and delete significant elements and users.
• Data Extraction from Targeted Company Profiles: The process of gathering relevant
information and details about specific companies, such as their history, industry, size, location,
and key personnel, from various sources.
• Data Extraction from Targeted Company Websites: Collecting data from specific company
websites, which may include contact information and other relevant content to obtain insights
35
CHAPTER 3. PRELIMINARY ANALYSIS 36
• Data Integration: Combining data from different sources and formats into a unified and
consistent format, allowing decision-makers to analyze and make informed decisions based
on a comprehensive view of their data.
• Standardized Reports: Predefined and structured reports presenting key performance indicators
(KPIs) and metrics in a consistent format, enabling easy and quick access to essential business
information.
1. User Management:
2. Group Management:
3. Prospect Management:
2. Conducting assessments
3. Viewing dashboards
These are the inherent system characteristics, encapsulating the implied requisites to which the system
must accord. Amidst these necessities, we highlight:
• Security Measures: Safeguarding data confidentiality, and accessibility for both its own data
and user data.
• Dependability: The data yielded by the application must be accurate and assured.
The ideal user would be a sales or marketing manager who would benefit from the insights provided
in our app, he can perform the following actions as illustrated in Figure 7.1
1. Login and registration: The first step is to create an account which requires email verification,
then the user who has created his account can log in to the platform.
2. Leads Search and Filtering: The user can search leads by industry, location, or even company
name. and can view the global list of the leads in the database in the form of a table.
3. Evaluate a company’s digital maturity: The user can input the information related to the digital
presence of a company through a form and predict its digital maturity, he can also view the
history of predictions.
4. View Market analysis dashboards: View the dynamic visuals of multiple dashboards to gain
information and inspiration on his next marketing strategy.
CHAPTER 3. PRELIMINARY ANALYSIS 38
The web management system administrator can perform the following actions which are illustrated
in Figure 7.2
1. Leads Management: The admin can manage models defined in the application. create, update,
delete, and view instances of database models directly from the admin interface.
2. prediction’s history Management: The admin can manage models defined in the application.
create, update, delete, and view instances of database models directly from the admin interface.
3. Authentication and Authorization control: The admin panel requires users to log in with valid
credentials. It offers role-based access control, allowing different users to have varying levels
of access and control over different parts of the application. Superusers can assign permissions
and roles to other users, including creating new superusers.
4. Search and Filtering: The admin can easily search for records using keyword searches and apply
filters to narrow down results, enhancing usability for administrators managing large datasets.
5. List and Detail Views: The admin panel provides list views to display records in tabular format
and detail views for individual records, making it straightforward to view and manage data.
6. Actions and Bulk Operations: Developers can define custom actions that can be applied to
multiple records at once, simplifying bulk operations such as deleting or updating records.
7. Manage Users and Groups: Superusers can create, update, and delete user accounts as well as
manage groups and permissions.
CHAPTER 3. PRELIMINARY ANALYSIS 39
• User Description: The User class table is responsible for storing user accounts and
authentication information. It enables users to log in, access personalized content, and interact
with the web app’s features. User data includes attributes like username, email, password,
and permissions. This class is fundamental for managing user identities and access within the
application.
• Group Description: The Group class table represents user groups, which can help in organizing
and managing permissions efficiently. User accounts can be assigned to specific groups,
simplifying permission management by applying access controls to entire groups instead of
individual users. This class table typically includes attributes like group name and associated
permissions.
• Leads Description: The Prospect class table handles the management of potential clients or
prospects. It stores information about potential customers who have shown interest in the
services or products offered by the web app. Attributes within this table may include prospect
name, contact information, interaction history, and status (e.g., active, inactive).
• Data Description: The Assessment class table is used to store evaluation data conducted by
clients.
• Superuser Description: The Superuser class table represents the highest level of administrative
access within the application. Superusers have the authority to manage user accounts, groups,
CHAPTER 3. PRELIMINARY ANALYSIS 40
and other administrative tasks. This class table may store attributes like username, email,
password, and additional permissions specific to superusers.
3.6 Conclusion
In this chapter, we began by defining the key participants in our application, outlining their
respective roles and use cases, all while maintaining the uniqueness of the content. Following that,
we delved into the functional and non-functional specifications of our solution. Subsequently, we
elaborated on the approach we will adopt for our project using the Scrum methodology. To wrap up,
we concluded this chapter with an overview of the software environment. The next chapter will be
entirely dedicated to the first deliverable, ”Delivery 1.”
Chapter 4
First Release
4.1 Introduction
After examining and defining our client’s overall requirements, this chapter will delve into the various
steps involved in developing the first delivery’s two sprints. We will begin by presenting the product
backlog for each sprint, followed by a detailed analysis, feature design, and ultimately, a showcase of
the user interfaces.
For each sprint, we will present its sprint backlog, and an analysis will be explored to illustrate
the interfaces created.
46
CHAPTER 4. FIRST RELEASE 47
Figure 4.1 depicts the interface that enables the user to create an account
Figure 4.3 depicts the login interface from which the user can access the home page:
Figure 4.4 depicts the password reset request interface from which the user can request to reset
his password in case he forgot it which will send him an automatic reset email:
Figure 4.5 depicts the actual password reset interface from which the user can reset his password
in case he forgot it:
Figure 4.6 depicts the interface that enables the admin to access his account:
Figure 4.8 depicts the user management interface, particularly adding the user :
Figure 4.10 depicts the user group management interface, particularly adding or updating the user
group:
Figure 4.11 depicts the user permissions management interface particularly updating or removing
user permissions:
Figure 4.12 depicts the user management interface particularly updating or deleting users:
Figure 4.13 depicts the group management interface, particularly adding group :
Figure 4.14 depicts the user group management interface, particularly managing group
permissions ;
each company’s website. Utilizing a looping mechanism, we sent requests to each website in the list
and collected valuable data. This exhaustive approach allowed us to gain insights into the company’s
online content, enabling a deeper analysis.
By combining LinkedIn data with website scraping, We were able to create a robust dataset,
providing a comprehensive view of the companies’ profiles and online presence. This wealth
of information serves as a valuable resource for strategic decision-making, market research, and
competitive analysis. However, it is crucial to mention that during this web scraping process, we
adhered to ethical and legal guidelines, respecting LinkedIn’s terms of service and ensuring the
privacy and security of the scraped data. All of which is summarised in Figure 4.16.
4.4.5 Implementation
4.4.5.1 Linkedin Scraping
The main tool that helped us automate the data collection process to work efficiently in bulk is
DataKund with several functions dedicated to social media platforms such as YouTube, Facebook,
Instagram, and Linkedin. Initialized as illustrated in Figure 4.17.
CHAPTER 4. FIRST RELEASE 57
In the first part, we began by extracting a targeted list of companies based on specific criteria such
as location and sector. The code shown in Figure 4.18 illustrates the implementation of this function.
• Website: The website URL of the company, provides a direct link to their online presence and
offerings, allowing users to explore their products and services easily.
• Linkedin: The LinkedIn profile link of the company, giving insights into their professional
network, company updates, and potential collaborations.
• Industry: The industry category in which the company operates, providing an understanding of
its market focus and niche.
• Phone: The contact phone number of the company, allowing users to reach out for inquiries or
support.
• Company Size: The size of the company in terms of employee count, offering an idea of its
scale and workforce.
• Headquarters: The location of the company’s main office or headquarters, indicating their
primary operational base.
• Founded: The year in which the company was established, providing insights into its history
and experience in the industry.
CHAPTER 4. FIRST RELEASE 58
In the pursuit of gathering valuable insights into the key personnel of companies on our designated
list, a sophisticated script has been meticulously designed and executed. Our script, characterized by
its dynamic nature, adeptly traverses LinkedIn to extract vital information about major employees.
Beginning with the identification of each company’s name from their respective LinkedIn URLs,
in figure 4.19, the script proceeds to systematically search and collect data on individuals who
hold prominent positions within the organizations included in this list : keywords = [”CEO”,
”Founder”, ”Owner”, ”Chef”, ”CTO”, ”Chief”, ”Executive”, ”Partner”, ”Director”, ”Vice President”,
”Directeur”, ”Fondateur”, ”DGA”, ”PDG”, ”RH”, ”Responsible”]
Employing a systematic and thorough approach, our script iterates through LinkedIn search results,
capturing pertinent details such as names, job titles, locations, and profile links of individuals who
fit predefined criteria. This method ensures the comprehensive compilation of valuable data for our
analysis illustrated in Figure 4.20.
The script’s ability to adapt to different company profiles and the precision with which it identifies
major employees highlights its effectiveness in assisting our research efforts. This innovative approach
empowers us with a robust dataset for further analysis and strategic decision-making, ultimately
enhancing our understanding of the corporate landscape.
• BeautifulSoup: was utilized to parse HTML and XML documents, simplifying our ability to
navigate the analysis tree and extract relevant data from web pages.
• Requests: We employed Requests to perform HTTP requests to web pages. While it is not
specifically designed for HTML parsing like BeautifulSoup, it is frequently used in conjunction
with it to retrieve web pages before analysis. An example of using these two libraries is when we
collected the Google Analytics ID and the Publisher ID using the regex method, as illustrated
in Figure 4.21.
• SSL: has enabled us to work with SSL/TLS certificates. It provides tools for creating
secure SSL/TLS connections, managing certificates, and verifying the authenticity of SSL/TLS
certificates presented by remote servers.
• SOCKET: Python’s SOCKET module has provided us with an interface for handling sockets,
which act as endpoints for network communication.
• OpenSSL: This library has empowered us to effectively manage certificates and secure
sockets. The previously mentioned SSL library utilizes OpenSSL for performing underlying
cryptographic operations.
The utilization of these three components in our project is depicted in Figure 4.22.
CHAPTER 4. FIRST RELEASE 60
• DNS.Resolver: Integrated within the Python DNS (Domain Name System) library, this tool
has provided us with the capability to automate DNS queries. This resource has enabled us to
extract SPF and DMARC records, as illustrated in Figure 4.23.
• IPWHOIS: We can retrieve information regarding IP addresses using the WHOIS protocol,
which is used to query databases containing data about Internet resources such as domain names
and IP address allocations. We were able to extract comprehensive details about the entity or
organization that owns a specific IP address, including contact information and registration
details, as illustrated in Figure 4.24.
As previously mentioned, libraries have been employed to carry out HTTP requests. In our scenario,
they are utilized in combination to efficiently extract data from websites as follows:
CHAPTER 4. FIRST RELEASE 61
• Phones: The phone numbers through which the company can be contacted.
• Verification Date: The date when the data was verified or extracted.
• Copyright year: The year when the company’s website content was copyrighted.
• Copyright owner: The entity or person owning the copyright of the website content.
• Issuer: The entity that issued the SSL/TLS certificate for the website.
• Cert Country: The country where the SSL/TLS certificate was issued.
• Cert start date: The start date of the SSL/TLS certificate’s validity.
• Schema type: The type of schema used for structured data on the website.
• Google Analytics ID: The unique Google Analytics ID linked to the website.
• Spf record: The Sender Policy Framework (SPF) record for the website’s domain.
• Facebook: The company’s profile URL on the Facebook social media platform.
• Twitter: The company’s profile URL on the Twitter social media platform.
• Instagram: The company’s profile URL on the Instagram social media platform.
• Skype: The Skype username or profile URL associated with the company.
• WhatsApp: The contact information (e.g., phone number) for the company on WhatsApp.
This extensive dataset provides a comprehensive representation of a company’s features and online
presence. It encompasses a range of informative fields that address various aspects of the business.
Figure 4.25 depicts the home page and the leads listing interface from which the user can search for
leads and navigate the web app:
CHAPTER 4. FIRST RELEASE 63
4.5 Conclusion
In this chapter, we have presented the initial version of our solution, consisting of two iterations.
For each iteration, we began by introducing the product roadmap. Subsequently, we showcased the
various functionalities through visual representations and provided textual descriptions of specific
use cases. Lastly, we developed the graphical interfaces. This comprehensive approach allowed
us to gather extensive insights from the web landscape, enabling us to collect crucial information
relevant to our research objectives. The data acquired from various web portals and websites will
be systematically integrated into the next chapter, further enhancing our understanding of industry
trends, competitive landscapes, and market dynamics.
Chapter 5
Second Release
5.1 Introduction
In line with the same approach as the first version, we commence by presenting the second version
based on the product backlog for each sprint. The development of a model involves a series of well-
defined steps that are crucial for project success. In this chapter, we will introduce the various stages of
the process used. Furthermore, we will discuss and compare the implementation of two segmentation
models.
For each sprint, we will present its sprint backlog, and an analysis will be conducted to illustrate
the interfaces developed.
65
CHAPTER 5. SECOND RELEASE 66
5.4 Implementation
5.4.1 Data Identification
The data acquired from different web portals and websites which we have harvested through Web
scraping.
This extensive dataset comprises over 11,000 entries, each representing a company’s diverse
attributes and online presence. The dataset includes a wide range of fields, encompassing company
information such as industry, size, location, and founding year. With its rich and varied dimensions,
this dataset poses a unique challenge for unsupervised machine learning segmentation. By employing
advanced clustering techniques, we aim to unearth hidden patterns, groupings, and trends within
this unlabelled data, ultimately revealing valuable insights into the complex landscape of companies’
online identities and characteristics.
CHAPTER 5. SECOND RELEASE 70
• Anaconda (figure 5.1 taken from [13]) stands as a complimentary and open-source distribution
of the Python and R programming languages. It is employed for crafting applications tailored
to data science and machine learning, encompassing large-scale data processing, predictive
analysis, and scientific computing. The aim is to streamline package management and
deployment.
• Jupyter notebooks (figure 5.2 taken from [14]) are electronic notebooks capable of assembling
text, images, mathematical formulas, and executable code in a single document. These are
interactively maneuverable within a web browser. Originally designed for Julia, Python, and R
programming languages.
5.4.2.2 Libraries
• NumPy is a Python library that provides support for arrays and matrices, and an extensive
collection of mathematical functions. This library is instrumental for performing mathematical,
logical, and statistical operations, making it a cornerstone in data analysis, scientific computing,
and machine learning workflows.
CHAPTER 5. SECOND RELEASE 71
• Matplotlib is a versatile plotting library in Python, with which we can create various types of
static, interactive, and animated visualizations. With Matplotlib, users can generate 2D and 3D
plots, histograms, scatter plots, and more, making it an ideal choice for data visualization, and
aiding in the effective communication of insights derived from data analysis.
• Yellowbrick’s Cluster visualizers are a part of the Yellowbrick library, designed to enhance the
understanding and tuning of clustering algorithms. These visualizers allow users to evaluate
clustering models, explore cluster tendencies, and assess the ideal number of clusters. By
providing insightful visualizations, Yellowbrick Cluster simplifies the process of identifying
meaningful patterns within data.
This stage of our work involves handling outliers or anomalies within the data which may or may not
pass through the integration process. Techniques like ‘fillna(’0’)‘, and ‘replace(’N/A’)‘ are applied.
The same goes for eliminating any outliers which are data point or observation that significantly
differs from the rest of the data in a dataset. Outliers are unusual or exceptional values that deviate
from the typical pattern or distribution of the data. showing in the following boxplot in Figure 5.3
In our project, the primary purpose of feature engineering is to extract distinctive attributes from raw
data, thereby enhancing the representation of the underlying problem for predictive models.
To begin with, this step involves the selection and extraction of relevant features from the dataset,
which make a substantial contribution to the analysis task. Based on this, we have generated the
following features:
Next, the correlation matrix in Figure 5.4 reveals the relationships between variables in a dataset.
The Matrix in the previous figure offers visual representation, aiding our interpretation by
selecting relevant features, reducing dimensionality, providing insights into segments, and avoiding
redundancy. Based on the visuals, and more reference to business relevance we selected our variables
which are distributed as follows in these plots Figures 5.5, 5.6, and 5.7:
CHAPTER 5. SECOND RELEASE 73
1. The first histogram in Figure 5.5 reveals a significant number of companies in the retail industry
compared to the rest of the industries.
2. The second histogram Figure 5.5 reveals that most companies don’t disclose their location
therefore the majority of them are not assigned to a location but this feature will come in handy
when understanding the distribution of the companies on the map.
3. The third histogram Figures 5.5 analyzing the company’s employee number on Linkedin reveals
a decline in the number of companies when the number of employees grows.
1. The fourth histogram Figure 5.6 reveals a significant number of companies that don’t have
Dmarc or SPF records while almost a third of them have both records.
2. The fifth histogram Figure 5.6 reveals that most websites are divided between responsive and
not assigned ones, after further inspection those not assigned to positive responsiveness will be
considered negative.
CHAPTER 5. SECOND RELEASE 74
3. The sixth histogram Figure 5.6 analyzing several contact panels listed in a single website shows
that those with 0 contact panels have the majority of websites while the rest show a decrease in
numbers when the contact panel increases.
1. The seventh histogram Figure 5.7 analyses the number of websites that have an SSL or TLS
certificate.
2. The eighth histogram Figure 5.7 shows how those certificates are distributed between those
certificates that are expired and those that are not.
3. The ninth and final histogram Figure 5.7 analyses the number of websites that have Google
Analytics ID or Adsense ID or both.
Finally, the data is transformed to make it suitable for analysis, including handling variables with
skewed distributions using techniques like logarithmic or power transformations.
• Categorical variables like ’Founded’, ’Size’, ’Location’, and ’Industry’ are encoded using
1. Agglomerative: This bottom-up strategy starts by treating individual data points as separate
clusters. Gradually, clusters are merged iteratively based on proximity.
2. Divisive: In contrast, the divisive approach takes a top-down stance. It starts with all data points
in a single cluster and then progressively divides clusters into smaller ones.
Within the AHC algorithm, our process initiates with every dataset forming an independent cluster.
The algorithm proceeds by iteratively pairing the closest clusters. A notable advantage is that this
approach doesn’t require prior knowledge of the expected number of clusters.
A crucial element in hierarchical clustering lies in determining the distance between clusters.
Various methods based on linkage techniques each of which calculates the distance between clusters
differently. The choice of linkage can significantly impact the clustering results, for which we have
created the following dendrograms with different linkage methods to explore our best options:
• Ward’s Linkage This linkage method tends to form clusters with minimal within-cluster
variance, promoting consistency and robustness in the resulting clusters. The stability of Ward
linkage stems from its focus on optimizing the within-cluster variance, which leads to well-
defined and interpretable clusters that are less sensitive to noise and outliers as shown in Figure
5.10.
CHAPTER 5. SECOND RELEASE 76
• Average Linkage
As we traverse the dendrogram, clusters gradually merge based on the average distance between
their data points. The approach depicted in Figure 5.12 strikes a balance between sensitivity to
outliers and cluster compactness. It can prove beneficial when dealing with data that exhibits
varying cluster sizes and shapes.
CHAPTER 5. SECOND RELEASE 77
The selection of three clusters was a judicious decision that stemmed from a comprehensive analysis
of multiple factors. This multi-faceted approach encompassed the utilization of silhouette analysis,
dendrogram exploration, and a meticulous alignment with our business requirements.
Silhouette analysis, a rigorous metric, played a pivotal role in evaluating the quality of cluster
formations. Through this analysis, we gained insights into the cohesion and separation of data points
within clusters. Our objective was to identify a configuration that exhibited well-defined, internally
homogeneous clusters while maintaining distinct boundaries between them as shown in Figure 5.13.
Interpretation :
• Silhouette Score of 0.446 indicates that the clusters are reasonably distinct and have a moderate
level of cohesion.
• Calinski-Harabasz Index value of 11104.259 implies that the clustering model has managed to
create clusters that are highly separated and compact, indicating a strong quality of clustering.
Both of these evaluation metrics, the Silhouette Score and Calinski-Harabasz Index, indicate that
the Agglomerative Clustering model has performed well in creating distinct and cohesive clusters for
the given dataset. The higher values of these metrics suggest that the clusters are meaningful and
well-defined, providing valuable insights into the data’s underlying structure which are displayed in
the following tables 5.2 to 5.8:
In conclusion, the selection of three clusters was a comprehensive endeavor, combining statistical
rigor with a keen awareness of our business landscape, ultimately leading to a cluster configuration
that resonates with both analytical and business efficacy.
This algorithm operates on the foundation of centroids, where each cluster is associated with a
centroid. The central objective of this algorithm is to minimize the cumulative distances between data
points and their respective clusters.
The procedure ingests untagged datasets as input, segments the dataset into ’k’ clusters, and
iterates the process until it achieves optimal clusters. The value of ’k’ is predetermined within this
algorithm.
The essence of the k-means clustering algorithm revolves around two primary functions:
1. Ascertaining the most suitable value for ’k’ center points or centroids through an iterative
progression.
2. Assigning every data point to its nearest k-center. These data points in proximity to a specific
k-center coalesce to form a distinct cluster.
Consequently, each cluster encompasses data points exhibiting shared attributes, distinguishing
itself from other clusters.
This technique utilizes the principle of WCSS (Within Cluster Sum of Squares) value. WCSS
quantifies the aggregate variations confined within a cluster. The mathematical expression to
determine the WCSS value :
X X X
WCS S = distance(Pi , C1 )2 + distance(Pi , C2 )2 + distance(Pi , C3 )2
i∈Cluster 1 i∈Cluster 2 i∈Cluster 3
where :
• WCSS: Within Cluster Sum of Squares, a measure of the total variations within the clusters.
• distance(Pi , C j ): The distance between data point Pi and the centroid C j of the corresponding
cluster ( j indicates the cluster number).
To ascertain the most suitable cluster count, the elbow methodology adheres to the subsequent
steps:
1. It performs K-means clustering on a provided dataset, varying the value of K (ranging from 1
to 10).
3. A graph is generated, depicting the relationship between computed WCSS values and the count
of clusters (K).
4. The inflection point or the juncture resembling an arm on the plot designates the optimal K
value.
Since the graph 5.15 figure shown below exhibits a distinct curvature resembling an elbow, our
optimal number of clusters would be 4 for the K-means algorithm The visual representation of the
elbow method takes a form analogous.
• Silhouette Score of 0.461 indicates that the clusters are reasonably distinct and have a moderate
level of cohesion.
• Calinski-Harabasz Index value of 15433.048 implies that the clustering model has managed to
create clusters that are highly separated and compact, indicating a strong quality of clustering.
Both of these evaluation metrics, the Silhouette Score and Calinski-Harabasz Index, indicate that the
Kmeans model has performed well in creating distinct and cohesive clusters for the given dataset. The
higher values of these metrics suggest that the clusters are meaningful and well-defined, providing
valuable insights into the data’s underlying structure which are displayed as follows in tables 5.9, to
5.15 :
ResponsiveOrNot Feature
Cluster 0 Cluster 1 Cluster 2 Cluster 3
Category 0 48.46% 48.42% 47.66% 49.81%
Category 1 1.60% 1.56% 1.61% 1.12%
Category 2 49.94% 50.01% 50.73% 49.07%
Figure 5.18 displays the digital maturity assessment form, through which users can conduct individual
assessments.
CHAPTER 5. SECOND RELEASE 85
Figure 5.19 presents the results of the digital maturity assessment interface and the prediction
history.
Figure 5.19: Interface for Digital Maturity Prediction Results and History
Figure 5.20 displays the digital maturity assessment form interface for administrators to conduct
individual assessments.
CHAPTER 5. SECOND RELEASE 86
Figure 5.21 presents the results of the digital maturity assessment interface for administrators and
the prediction history.
5.7 Conclusion
In this chapter, we initially created multiple clustering models and subjected them to specific
performance evaluation measures. Finally, we selected the top-performing model based on these
evaluations. Next, we developed a feature to conduct individual assessments of companies’ digital
maturity.
Chapter 6
Third Release
6.1 Introduction
Utilizing the same foundational concept as the second release, we commence by introducing
Release 3, which is constructed by the product backlog for every sprint. This section focuses on the
subsequent stage within the GIMSI approach, facilitating us in addressing the query: What actions
are necessary? Accordingly, we will initiate by outlining the structure of our resolution, establishing
the goals, selecting the benchmarks, creating our framework, and ultimately showcasing prototype
dashboard illustrations during the final stride of this stage.
ID User Story
TS1 As a developer, I need to create dashboards that align with the customer’s
requirements.
US15 As a user, I desire enhanced data visibility through interactive decision-
support dashboards.
87
CHAPTER 6. THIRD RELEASE 88
Our system’s functional structure comprises several stages: gathering source data via web
scraping, transforming, and loading the collected data into a data warehouse for compatibility with
analysis and visualization tools. The data warehouse becomes prepared for analysis, which involves
extracting valuable insights through OLAP cubes from stored data. Results are communicated
effectively using interactive dashboards, reports, and visualizations to assist decision-makers in
comprehending data-driven trends and conclusions.
Table 6.4 illustrates the KPI identification of the Digital data mart :
Table 6.5 illustrates the KPI identification of the Company data mart :
In this dashboard (Figure 6.6), we will present the KPI related to Fact Digital where we analyze the
digital presence of the companies by the number of social media platforms, analytic or traffic IDs,
and website responsiveness.
In this dashboard (Figure 6.7), we will present the KPI related to Fact Digital where we analyze the
digital presence of the companies by the number of social media platforms, analytic or traffic IDs,
and website responsiveness.
In this dashboard (Figure 6.8), we will present the KPI related to Fact Digital where we analyze the
digital presence of the companies by the number of social media platforms, analytic or traffic IDs,
CHAPTER 6. THIRD RELEASE 96
2. QlikView and Qlik Sense: are data discovery and visualization tools that offer in-memory data
processing and associative data modeling.
4. Google Data Studio: a free data visualization tool that connects to various data sources,
including data warehouses. to create customizable dashboards and reports.
Each of these tools offers a range of features and capabilities for data warehouse visualization.
However, Power BI remains a popular choice due to its ease of use, integration with Microsoft
technologies, and extensive community support. The choice of tool ultimately depends on factors
like the complexity of data, budget, scalability requirements, and user preferences.
6.7.3 Ecosystem
The Microsoft tools collectively form a robust ecosystem that enables us to manage databases,
perform data integration and transformation, conduct in-depth data analysis, and present data insights
effectively through interactive visualizations and reports.
6.7.3.1 SSMS
SSMS (Figure 6.9 taken from [15]), an acronym for SQL Server Management Studio, is a GUI
software developed by Microsoft, enabling tasks like database creation, SQL query execution, object
design, server monitoring, and data backup.
CHAPTER 6. THIRD RELEASE 98
6.7.3.2 SSIS
SQL Server Integration Services, or SSIS (Figure 6.10 taken from [16]), stands as Microsoft’s data
integration and workflow solution within the SQL Server toolkit. This platform facilitates the creation,
deployment, and oversight of data integration and transformation undertakings. By allowing users
to draw data from diverse origins, mold it into preferred structures, and then deliver it to designated
systems or repositories, SSIS accommodates intricate integration demands, including data refinement,
migration, and ETL operations.
An integral element of Microsoft SQL Server, it facilitates the automation of tasks and jobs by offering
a scheduling system. Empowering users to effectively manage a range of operations, the SQL Server
Agent is indispensable for streamlining routine database maintenance and repetitive assignments,
leading to enhanced database dependability and performance.
6.7.3.4 SSAS
SSAS, short for SQL Server Analysis Services (Figure 6.11 taken from [17]), is a potent data tool from
Microsoft. It empowers users to craft and handle OLAP cubes, facilitating profound data analysis.
CHAPTER 6. THIRD RELEASE 99
6.7.3.5 PowerBI
1. Power BI Desktop (Figure 6.12 taken from [18]) is a standalone application that serves as
the authoring tool for Power BI. It allows users to create more complex and sophisticated
data models, reports, and visualizations compared to the browser-based Power BI Service.
Power BI Desktop provides advanced data manipulation capabilities, supports the creation of
calculated measures and columns, and enables users to design and refine their data models
before publishing them to the Power BI Service.
2. Power BI Service is the cloud-based service offered by Microsoft for sharing, collaborating,
and consuming Power BI reports and dashboards. It allows users to publish their Power BI
Desktop reports to the cloud and securely share them. Power BI Service offers more features
like embedding reports into websites, data-driven alerts, and access to real-time data insights.
data finds its place in a designated target destination, a data warehouse. This phase plays a pivotal
role in ensuring that the data used in our project is consistent, accurate, and, most importantly, ready
for use and actionable insights.
6.8.2 Extraction
The extraction stage serves as the initial phase of our ETL process (extraction, transformation,
loading). This step is crucial to maintain the integrity of our extracted data and to prevent errors
or inconsistencies in subsequent process stages.
As shown in Figure 6.13, initially, we implemented the Staging area which is an intermediate location
or environment where data and files are temporarily stored, processed, or prepared before they are
moved to their final destination or utilized for further processing. Our staging area consolidates newly
added raw data from the source.
To achieve this, we have implemented a script containing a truncate command for all tables. We
employed the Sequence Container component, an organizational unit that groups all arranged tasks.
This control container acts as a conductor, orchestrating the order of task execution within an SSIS
package.
6.8.3 Transform
In the transformation phase we aim to ensure data quality and accessibility. This involves basic
cleaning, removing any duplicates, and handling empty values. These transformations ensure data
consistency.
Figure 6.14 illustrates the Loading into Operational Data Store (ODS), This is a critical step in the
data management process. It involves loading extracted data into an operational data store. This step
CHAPTER 6. THIRD RELEASE 101
During the next step (illustrated in Figure 6.15), we select the necessary data. and perform the
data transformation process, which does not bear much complication since we don’t have multiple
attributes. For this, we used the following components :
• Sort: We utilized this technique to arrange data rows in ascending order according to the
”Linkedin” column. It also facilitated the elimination of duplicates from this organized output.
• Derived Column: In our context, this element played a role in modifying existing columns by
substituting empty values with ”NA.”.
• Slowly changing Dimension: This mechanism effectively detects and handles changes within
dimensions. It accomplishes this by updating shifting attributes, inserting fresh records for new
members, and retaining prior records for members that remain unchanged.
CHAPTER 6. THIRD RELEASE 102
6.8.4 Load
The third and final phase of our ETL process involves loading the previously extracted and
transformed data into their new storage location, namely the data warehouse. This phase transfers
the data to its ultimate destination.
The procedure of populating fact tables within the framework of galaxy modeling involves the
amalgamation of three distinct star models. This process enhances the comprehensive representation
of interconnected data relationships, ultimately contributing to a more informative galaxy model
illustrated in Figure 6.16.
CHAPTER 6. THIRD RELEASE 103
As illustrated in Figure 6.17 and similar to the fact-digital, the rest of the Fact tables go through
the same steps. Our process of loading the 3 fact tables will occur in two main stages: Firstly, we
must prepare our data source for populating the Dimensions and then load the fact tables. This data
source is a stored procedure that performs joins between all dimensions and measures.
The procedure of populating our fact tables encompassed several stages within the ETL
methodology:
1. Data Extraction: The fact table data was drawn from the data source, which in our instance
was a flat file. We employed the Merge join component to amalgamate data from two separate
source files.
2. Data Transformation: The data underwent modifications in alignment with analytical requisites.
This involved employing the Aggregate function, as well as repeated instances of sorting and
joining. Following the integration of our data sources, we employed the lookup operation to
establish associations connecting the fact table with its corresponding dimension tables.
3. Data Loading: The transformed data was incrementally introduced into the data warehouse’s
fact table using a lookup component, effectively preventing the duplication of any pre-existing
data.
CHAPTER 6. THIRD RELEASE 104
Our SSIS package was successfully deployed onto the designated SQL Server instance, a process
vividly depicted in Figure 6.18. The package is systematically stored within the SQL Server
Integration Services Catalog, serving as a centralized hub for administering and housing SSIS
packages. By deploying the package in this manner, we ensure its centralized accessibility and
executable nature.
CHAPTER 6. THIRD RELEASE 105
The incorporation of job scheduling involves the deployment and automation of ETL updates. This
automation occurs through a structured sequence of three stages, executed in synchronization using
SQL Server Agent: the initial stage involves staging area loading, followed by ODS loading, and
culminating in data warehouse loading as shown in Figure 6.19.
To align with the frequency of web scraping cycles. The scheduling occurs daily at midnight or
can be initiated manually as shown in Figure 6.20.
CHAPTER 6. THIRD RELEASE 106
We present in Figures 6.21 and 6.22, the execution of the SQL agent configured job the successful
execution of our SSIS package, and the report provided.
6.10.1 Dashboards
• the following dashboard 6.27 presents the Dashboard1 : Companies profiling We used several
visualisations:
– filters: to filter the visuals and the measures of the analysis axis.
– DAX Mesures: to quantify the numbers of the analysis axis:
– filters: to filter the visuals and the measures of the analysis axis by the number of contact
panels.
– KPI Mesures: to quantify the numbers of the analysis axis
– Pie Chart: to visualize the number of companies by website responsiveness and the GAID.
CHAPTER 6. THIRD RELEASE 112
– Stacked Bar Chart: to visualize the number of companies by the number of social media
of each company.
– Tables: to list the IP addresses by country, and the number of certificates issued by each
organization.
– KPI Mesures: to quantify the numbers of the analysis axis:
CountS chemaT ypes = DIS T INCTCOUNT (FactNetwork[S chemaT ype.S chemaT ype])
– Donut Chart: to visualize the number of websites by schema type, and the number of
issuers to each organization.
– Bar Chart: to visualize types of TLS protocols.
CHAPTER 6. THIRD RELEASE 113
All these dashboards are shared on the Power BI service, allowing the customer to access them
through our web application. This enables the visualization of all KPIs conveniently in a unified
location.
6.11 Conclusion
In this chapter, we have delved into the various steps involved in crafting our decision-making solution
and the technical tools employed for this purpose. Visual representations in the form of screenshots
have been incorporated to illustrate the prototypes employed in our project. Additionally, we have
comprehensively examined the phases of integration and analysis, seamlessly integrating screenshots
to depict the interfaces developed within the different facets of our solution.
Prespectives
In the realm of our project’s future endeavors, we envision a path marked by innovation and
advancement. These forthcoming steps are poised to enhance our project’s capabilities and impact in
profound ways:
Our first objective is to implement the scraping code on a cloud-based virtual machine.
This strategic move will not only ensure scalability but also allow us to select the most suitable
cloud provider for our specific needs. Data storage will also undergo transformation in the cloud
environment, facilitating more efficient and accessible data management.
Taking a step further, we plan to migrate our data warehousing and data pipeline operations to
the cloud. This transition will provide us with expanded resources and capabilities for data analysis,
particularly focusing on employee data from various companies. By leveraging cloud infrastructure,
we aim to bolster our data processing capabilities, enabling us to derive deeper insights from the
extensive datasets we gather.
Our web application is poised for refinement and enrichment. To enhance user experience,
we intend to introduce advanced features such as detailed filtering and search functionality. This
will empower users to precisely tailor their data queries, facilitating more insightful exploration.
Additionally, we plan to incorporate an export function for prospect data, enabling users to extract
valuable information for further analysis.
As part of our project’s evolution, we have plans to optimize the evaluation processes.
This optimization will include the introduction of a profile management module, allowing users to
efficiently organize and track their interactions. Furthermore, we are preparing the groundwork for
the integration of a payment module, which will serve as a pivotal component for future monetization
strategies.
These prospective developments represent the natural evolution of our project, aiming to elevate
its functionality, accessibility, and usability. By embracing cloud technologies, refining our web
application, and introducing advanced features, we are poised to deliver a more comprehensive and
powerful tool for decision-makers and data analysts.
115
Conclusion
Embarking on the journey of our project, we venture into the realm of innovation and insight. In
this phase, we initiate the design process, a crucial juncture where the foundation for a functional and
impactful system is laid. Through this introduction, we unveil the roadmap that guides the creation of
a comprehensive decision support system, harmonizing the intricate dance of machine learning, web
scraping, and business intelligence.
At the heart of our endeavor lies the design blueprint, a strategic map that orchestrates the
arrangement of integral components. This blueprint is a guiding light, directing the interactions,
functionalities, and flow of information within the web application. This model not only shapes how
information is structured but also governs how it’s stored and managed. The synergy between data
and design forms the bedrock upon which our web app stands, enabling seamless user experiences
and insightful data exploration.
Enabling this intricate ecosystem is an ensemble of cutting-edge technologies and development
tools, each a brushstroke on the canvas of innovation. We unravel these tools, presenting a tableau that
brings together the prowess of machine learning algorithms, the finesse of web scraping techniques,
and the precision of business intelligence methodologies. These tools not only amplify the user
experience but also empower decision-makers with the ability to extract actionable insights from a
sea of data.
Venturing deeper, we delve into the meticulous steps of loading and constructing data warehouses,
modern repositories where information finds its sanctuary. These repositories are more than just
storage; they are engines of analysis, driving the generation of meaningful dashboards and insightful
reports. As we meticulously detail these steps, the essence of transforming raw data into consumable
knowledge comes to life.
116
Bibliography
[1] https://medium.com/thetechieguys/crisp-ml-q-.
[2] https://mentari-er.medium.com/membuat-rancangan-data-warehouse-classic-model-
c22e46ccfbeb.
[3] https://bennyaustin.com/2010/05/02/kimball-and-inmon-dw-models/ .
[4] https://www.javatpoint.com/machine-learning.
[5] https://www.javatpoint.com/supervised-machine-learning.
[6] https://nixustechnologies.com/unsupervised-machine-learning/ .
[7] https://medium.com/@khang.pham.exxact/top-10-popular-data-science-algorithms-and-
examples-part-1-of-2-52fc14604dd9.
[8] https://medium.com/@medunoyeeni/django-the-fun-part-understanding-the-framework-
1bb4df54ab1f .
[9] https://www.upwork.com/en-gb/services/product/development-it-a-website-in-python-django-
1371024921383915520.
[10] https://logos-world.net/javascript-logo/ .
[12] https://wolfgang-ziegler.com/blog/note-taking-with-github-and-vscode.
[13] https://www.anaconda.com/ .
[14] https://jupyter.org/ .
[15] https://www.ubackup.com/enterprise-backup/sql-management-studio-backup-fhhbj.html.
[16] https://www.sarjen.com/ssis-advantages-disadvantages/ .
[17] https://ramkedem.com/en/ssas-2/ .
[18] https://logohistory.net/power-bi-logo/ .
[19] Etl or elt: The evolution of data delivery. QlikTech International AB. (2022).
117
[20] Builtwith, 2023.
[23] Aitken, A., I. V. Comparative analysis between traditional software engineering and agile
software development, 4749- 4752. System Sciences International Conference (2013).
[24] Chen, N. Research on e-commerce database. marketing based on machine learning algorithm,
337-340. Computational Intelligence and Neuroscience (2022).
[25] Gunawan, R., R. A. D.-I. F. F. Comparison of web scraping techniques, 1-5. Conference on
System Engineering and Industrial Enterprise (2019).
[28] Khder, M. Web scraping or web crawling: State of art, techniques, approaches ,and application,
1-25. Advances in Soft Computing and its Applications (2021).
[29] Krafft, M., M. C. Data-driven marketing and its impact on customer engagement, 119–136.
[30] Laxmi Priya, V., H. K. Implementing lead qualification model, using icp, 81-90. Capital
Markets: Market Efficiency eJournal (2020).
[32] Nygård, R. Mezei, J. Automating lead scoring with machine learning: An experimental study,
1-10. International Conference on System Sciences (Jan. 2020).
[33] Peng, J., E. C. Machine learning techniques : Applications and challenges, 3299–3348.
Frontiersin.
[34] Piccialli, F., C. G. Decision making through unsupervised learning, 27-35. IEEE Intelligent
Systems (2020).
[35] Ramakrishnan, G., J. S. Automatic sales lead generation from web data, 100-101. 22nd
International Conference on Data Engineering (ICDE’06). IBM India Research Lab (2006).
[36] Ranjan, J. Business intelligence: Concepts, components, techniques and benefits, 61 - 68.
Journal of Theoretical and applied information technology (2009).
[38] Zhi, Z, R. H. A.-p. L. Research on referral service and big data mining for e-commerce with
machine learning, 35-38. Conference on Computer and Technology Applications (ICCTA)
(2018).
118
[39] Świeczak, W., W. Lead generation strategy as a multichannel mechanism of growth of a modern
enterprise, 105 - 140. Marketing of Scientific, and Research Organizations (2016).
119
Appendix A
Evaluation Metrics
• a: Average distance from a data point to other points within the same cluster.
• K: Number of clusters.
120
Appendix B
Optimization Methods
B.0.2 StandardScaler
StandardScaler is a normalization technique used to standardize numerical features. It scales the
features to have a mean of 0 and a standard deviation of 1. This ensures that all features contribute
equally to the learning process, preventing features with larger scales from dominating the model.
Mathematical Formula:
x−µ
Standardized Value =
σ
• x: Original feature value.
121
ABSTRACT
Our proposed solution, the Intelligent Lead Generation, is a system that generates B2B leads
and evaluates them using machine learning techniques. It also provides analytical reports and
visualizations to assist the Sales/Marketing team in their decision-making process through the
integration of business intelligence.
Keywords: lead generation, B2B, machine learning, analytical reports, visualizations, business
intelligence integration.
RÉSUMÉ
Notre solution proposée, la Génération Intelligente de Leads, est un système qui génère des prospects
B2B et les évalue à l’aide de techniques d’apprentissage automatique. Elle fournit également des
rapports analytiques et des visualisations pour assister l’équipe de Ventes/Marketing dans leur
processus de prise de décision en utilisant l’intégration de l’intelligence d’entreprise.
122