Dataset Usage Guidelines Dataset Overview: Data Splits

Uploaded by

chethan.work36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views3 pages

Dataset Usage Guidelines Dataset Overview: Data Splits

Uploaded by

chethan.work36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Dataset Usage Guidelines

Dataset Overview
This hackathon presents a rich, unstructured text dataset consisting of approximately 1.56
lakh rows. The primary challenge is to classify these raw text descriptions into subcategories
then into categories. While the names of the categories and subcategories are provided,
participants are expected to explore the dataset to understand and define the meaning and
context of each category. Participants are expected to ensure that the definitions of these
categories and subcategories are aligned with Government of India rules and regulations.

The dataset is entirely raw, and no prior pre-processing has been performed. Participants
will have full control over the entire pipeline, from text cleaning and preparation to feature
engineering and model development. The unstructured nature of the data offers a range of
challenges that must be addressed by the participants, such as:

• Handling messy text, including typos, inconsistencies, or abbreviations

• Addressing potential ambiguities in the descriptions

• Managing imbalances between categories and subcategories

• Privilege escalation.

Data Splits
• Training Set (60%): Unlabeled data for model development associated with both a
category and subcategory.
• Testing Set (20%): Unlabeled data for model evaluation during development.
• Validation Set (20%): Held-back data for unbiased final assessment.

Participants are encouraged to experiment with a variety of models and techniques, ranging
from traditional methods to state-of-the-art NLP models. There are no limitations on the
choice of models or approaches. Popular techniques like BERT, TF-IDF, or even more
traditional algorithms like Naive Bayes or SVM can be utilized, depending on the participant's
preferred approach to text classification.

Category Names:
• Women/Child Related Crime
• Financial Fraud Crimes
• Other Cyber Crime

Subcategory Names:
• Child Pornography/Child Sexual Abuse Material (CSAM)
• Rape/Gang Rape-Sexually Abusive Content
• Sale, Publishing and Transmitting Obscene Material/Sexually Explicit Material
• Debit/Credit Card Fraud
• SIM Swap Fraud
• Internet Banking-Related Fraud
• Business Email Compromise/Email Takeover
• E-Wallet Related Frauds
• Fraud Call/Vishing
• Demat/Depository Fraud
• UPI-Related Frauds
• Aadhaar Enabled Payment System (AEPS) Fraud
• Email Phishing
• Cheating by Impersonation
• Fake/Impersonating Profile
• Profile Hacking/Identity Theft
• Provocative Speech of Unlawful Acts
• Impersonating Email
• Intimidating Email
• Online Job Fraud
• Online Matrimonial Fraud
• Cyber Bullying/Stalking/Sexting
• Email Hacking
• Damage to Computer Systems
• Tampering with Computer Source Documents
• Defacement/Hacking
• Unauthorized Access/Data Breach
• Online Cyber Trafficking
• Online Gambling/Betting Fraud
• Ransomware
• Cryptocurrency Crime
• Cyber Terrorism
• Any Other Cyber Crime
• Targeted scanning/probing of critical networks/systems.
• Compromise of critical systems/information.
• Unauthorised access to IT systems/data.
• Defacement of websites or unauthorized changes, such as inserting malicious code or
external links.
• Malicious code attacks (e.g., virus, worm, Trojan, Bots, Spyware, Ransomware, Crypto
miners).
• Attacks on servers (Database, Mail, DNS) and network devices (Routers).
• Identity theft, spoofing, and phishing attacks.
• Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks.
• Attacks on critical infrastructure, SCADA, operational technology systems, and wireless
networks.
• Attacks on applications (e.g., E-Governance, E-Commerce).
• Data breaches.
• Data leaks.
• Attacks on Internet of Things (IoT) devices and associated systems, networks, and
servers.
• Attacks or incidents affecting digital payment systems.
• Attacks via malicious mobile apps.
• Fake mobile apps.
• Unauthorised access to social media accounts.
• Attacks or suspicious activities affecting cloud computing systems, servers, software,
and applications.
• Attacks or malicious/suspicious activities affecting systems related to Big Data,
Blockchain, virtual assets, and robotics.
• Attacks on systems related to Artificial Intelligence (AI) and Machine Learning (ML).
• Backdoor attacks.
• Disinformation or misinformation campaigns.
• Supply chain attacks.
• Cyber espionage.
• Zero-day exploits.
• Password attacks.
• Web application vulnerabilities.
• Hacking
• Malware attacks.

Key Expectations
• Perform Exploratory Data Analysis (EDA) to uncover patterns and insights within the
text.
• Implement text pre-processing strategies, such as cleaning, tokenization, and
normalization.
• Develop models to accurately classify text descriptions into the appropriate
categories and subcategories.
• Focus on the entire model evaluation pipeline, from EDA to the final model’s
performance metrics.
• Participants are encouraged to leverage a variety of models and techniques, from
traditional methods to state-of-the-art NLP models. There are no restrictions on the
choice of models or approaches, as long as they effectively address the task of
classifying the raw text descriptions. Creativity, depth of analysis, and model
performance will be key differentiators.

Final Document
No ratings yet
Final Document
61 pages
Phishing Detection
No ratings yet
Phishing Detection
22 pages
Final Report2 1
No ratings yet
Final Report2 1
83 pages
Final Report2 8
No ratings yet
Final Report2 8
82 pages
System Analysis For Cyber Attack Detection Using Machine Learning 1
No ratings yet
System Analysis For Cyber Attack Detection Using Machine Learning 1
14 pages
MALLWARE DETECTION USING ARTIFICIAL INTELLIGENCE-ppt Final
No ratings yet
MALLWARE DETECTION USING ARTIFICIAL INTELLIGENCE-ppt Final
26 pages
Ebook DarkGPT Eng
No ratings yet
Ebook DarkGPT Eng
43 pages
Machine Learning For Misuse-Based Network Intrusio
No ratings yet
Machine Learning For Misuse-Based Network Intrusio
22 pages
Technothon Phishing Detection
No ratings yet
Technothon Phishing Detection
30 pages
Project 2024
No ratings yet
Project 2024
110 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Aureole Book
No ratings yet
Aureole Book
360 pages
Reasearch Paper
No ratings yet
Reasearch Paper
6 pages
Cyber Attack Report-3 - 312820205031 SACHIN L (II-IT)
No ratings yet
Cyber Attack Report-3 - 312820205031 SACHIN L (II-IT)
65 pages
CyberHackathon 2025 Problem Statement
No ratings yet
CyberHackathon 2025 Problem Statement
11 pages
Document From Aparnasoddy
No ratings yet
Document From Aparnasoddy
36 pages
Social Studies Grade 8 Final Final August 2022
No ratings yet
Social Studies Grade 8 Final Final August 2022
117 pages
Vinodhini Project
No ratings yet
Vinodhini Project
66 pages
Chapter 5
No ratings yet
Chapter 5
34 pages
Advanced Penetration Testing with Kali Linux: Unlocking industry-oriented VAPT tactics (English Edition)
From Everand
Advanced Penetration Testing with Kali Linux: Unlocking industry-oriented VAPT tactics (English Edition)
Ummed Meel
No ratings yet
SAH LAB Risk Assesssment Tool
100% (1)
SAH LAB Risk Assesssment Tool
10 pages
Batch-59 - Analysis On Cyber Attacks
No ratings yet
Batch-59 - Analysis On Cyber Attacks
13 pages
Checklist and Procedure Ver 3.0
No ratings yet
Checklist and Procedure Ver 3.0
4 pages
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
6 pages
Grade 7 SCIENCE Item-Analysis-for-item-bank
100% (1)
Grade 7 SCIENCE Item-Analysis-for-item-bank
5 pages
SAMPLE - Final Word
No ratings yet
SAMPLE - Final Word
24 pages
Modeling and Predicting Cyber Hacking Breaches: Under The Guidance Of: Team Members
100% (1)
Modeling and Predicting Cyber Hacking Breaches: Under The Guidance Of: Team Members
38 pages
FINAL
No ratings yet
FINAL
6 pages
REFLEX ACT III™ Quick User Guide v12
100% (1)
REFLEX ACT III™ Quick User Guide v12
20 pages
Cs Microproject
No ratings yet
Cs Microproject
3 pages
CyberThreat - Detection - Documentation Final
No ratings yet
CyberThreat - Detection - Documentation Final
18 pages
Industrial Organization NBoccard
No ratings yet
Industrial Organization NBoccard
806 pages
Ramesh Internship Report
No ratings yet
Ramesh Internship Report
35 pages
2025 ICM Problem F
No ratings yet
2025 ICM Problem F
3 pages
Information Security Project
No ratings yet
Information Security Project
7 pages
ML & DL For Cyber SFR 1
No ratings yet
ML & DL For Cyber SFR 1
8 pages
AI Code Generators For Security: Friend or Foe?
No ratings yet
AI Code Generators For Security: Friend or Foe?
9 pages
Some - Hitherto - Unknown - Fragments - of - Utpal - 2024-05-17T102648.974
No ratings yet
Some - Hitherto - Unknown - Fragments - of - Utpal - 2024-05-17T102648.974
46 pages
Rachel Dolezal Thesis
100% (2)
Rachel Dolezal Thesis
7 pages
1ds19scn09 - Mtech Project Phase-3
No ratings yet
1ds19scn09 - Mtech Project Phase-3
27 pages
Revisioncehpc
No ratings yet
Revisioncehpc
25 pages
Unit 2 Companies English For Business 3 April 2025
No ratings yet
Unit 2 Companies English For Business 3 April 2025
8 pages
Machine Learning For Misuse-Based Network Intrusion Detection Overview Unified Evaluation and Feature Choice Comparison Framework
No ratings yet
Machine Learning For Misuse-Based Network Intrusion Detection Overview Unified Evaluation and Feature Choice Comparison Framework
21 pages
Final Year Project
No ratings yet
Final Year Project
35 pages
Vijayragavan Cyber
No ratings yet
Vijayragavan Cyber
21 pages
Prenatal Genetic Testing For Monogenic Diabetes Due To Glucokinase Deficiency (December 2023) What's New
No ratings yet
Prenatal Genetic Testing For Monogenic Diabetes Due To Glucokinase Deficiency (December 2023) What's New
33 pages
Vijayragavan Cyber
No ratings yet
Vijayragavan Cyber
21 pages
NCSPCN 12 CRP
No ratings yet
NCSPCN 12 CRP
3 pages
GRPPRJCT
No ratings yet
GRPPRJCT
15 pages
Ethical Hacking
From Everand
Ethical Hacking
Elias Mutegi
No ratings yet
Detection of Cyber Attack in Network Using Machine Learning Techniques New PDF
No ratings yet
Detection of Cyber Attack in Network Using Machine Learning Techniques New PDF
31 pages
Network-Based Intrusion Detection With Support Vector Machines
No ratings yet
Network-Based Intrusion Detection With Support Vector Machines
14 pages
Compromised Account Detection On Social Networks
No ratings yet
Compromised Account Detection On Social Networks
11 pages
Ieee Paper
No ratings yet
Ieee Paper
3 pages
SIH2024 IDEA Presentation Format
No ratings yet
SIH2024 IDEA Presentation Format
6 pages
Dissertation Sara Parchami
100% (2)
Dissertation Sara Parchami
7 pages
A User-Centric Machine Learning
No ratings yet
A User-Centric Machine Learning
11 pages
A User-Centric Machine Learning Framework For Cyber Security Operations Center
No ratings yet
A User-Centric Machine Learning Framework For Cyber Security Operations Center
11 pages
Obc 19971027 Kaye Gibbons
No ratings yet
Obc 19971027 Kaye Gibbons
2 pages
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
57 pages
Basis of Cyber Sec
No ratings yet
Basis of Cyber Sec
6 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
4 pages
CompTIA PenTest+ Exam Summary
No ratings yet
CompTIA PenTest+ Exam Summary
15 pages
Sniffing Dtetction IEEE Paper
No ratings yet
Sniffing Dtetction IEEE Paper
3 pages
Monitering Suspicious Discussion On Online Forum
No ratings yet
Monitering Suspicious Discussion On Online Forum
6 pages
GEN-Sup 2020 EN
No ratings yet
GEN-Sup 2020 EN
24 pages
Cyber Threat
No ratings yet
Cyber Threat
4 pages
ETHICAL Hacking Till Mid
No ratings yet
ETHICAL Hacking Till Mid
11 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
5 pages
Abcde
No ratings yet
Abcde
5 pages
Project - Software Development
No ratings yet
Project - Software Development
3 pages
Audit of The Acquisition and Payment Cycle: Tests of Controls, Substantive Tests of Transactions, and Accounts Payable
No ratings yet
Audit of The Acquisition and Payment Cycle: Tests of Controls, Substantive Tests of Transactions, and Accounts Payable
39 pages
Khushboo Plastics Project 2
No ratings yet
Khushboo Plastics Project 2
42 pages
Computer & Internet Crime Notes PPIT
No ratings yet
Computer & Internet Crime Notes PPIT
5 pages
Ncma217 Week11 Reclec Mod
No ratings yet
Ncma217 Week11 Reclec Mod
10 pages
Communication in Freaky Friday
No ratings yet
Communication in Freaky Friday
4 pages
The Evolving Concept of Life
No ratings yet
The Evolving Concept of Life
17 pages
Data Sheet 80x65 FS2GA 6 15
No ratings yet
Data Sheet 80x65 FS2GA 6 15
5 pages
Associations Between Loneliness and Perceived Social Support and Outcomes of Mental Health Problems: A Systematic Review
No ratings yet
Associations Between Loneliness and Perceived Social Support and Outcomes of Mental Health Problems: A Systematic Review
16 pages
What Is Twitter and Why Should You Use It
No ratings yet
What Is Twitter and Why Should You Use It
4 pages
The Genesis or
No ratings yet
The Genesis or
151 pages
Gagandeep Resume-1
No ratings yet
Gagandeep Resume-1
2 pages
Schema Indiaai Cyberguard Ai Hackathon
No ratings yet
Schema Indiaai Cyberguard Ai Hackathon
12 pages
CHAPTER 19 - Industrialization and Nationalism
No ratings yet
CHAPTER 19 - Industrialization and Nationalism
27 pages
Both Statements Are False
No ratings yet
Both Statements Are False
26 pages
Auditing Theory 2013
No ratings yet
Auditing Theory 2013
28 pages
Lev S. Vygotsky - Mind in Society The Development of Higher Psychological Processes
88% (16)
Lev S. Vygotsky - Mind in Society The Development of Higher Psychological Processes
170 pages
MidaCrochet PSYDUCK CAPTAIN
100% (1)
MidaCrochet PSYDUCK CAPTAIN
15 pages
Personal Mandala Rubric
No ratings yet
Personal Mandala Rubric
2 pages
VetcoGray S-Series SVXT
No ratings yet
VetcoGray S-Series SVXT
2 pages

Dataset Usage Guidelines Dataset Overview: Data Splits

Uploaded by

Dataset Usage Guidelines Dataset Overview: Data Splits

Uploaded by

Dataset Usage Guidelines

• Handling messy text, including typos, inconsistencies, or abbreviations

• Addressing potential ambiguities in the descriptions

• Managing imbalances between categories and subcategories

You might also like