0% found this document useful (0 votes)

43 views3 pages

20mis1025 Lab1

The document summarizes preprocessing steps performed on a dataset containing network traffic data. It loads and inspects the dataset, replaces categorical labels with numeric values, applies one-hot encoding to categorical features, checks and imputes missing values, and splits the data into feature and target variables for modeling. The dataset has 125,973 rows and 42 columns, with 3 unique protocol types, 70 services, and 11 flags. One-hot encoding increases the number of columns to 123. Missing value imputation is not needed as there are no missing values. The data is then split into a 122-column feature matrix X and a single-column target y for further analysis.

Uploaded by

Sandip Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views3 pages

20mis1025 Lab1

Uploaded by

Sandip Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

7/26/23, 11:36 PM Lab1_pandas.

ipynb - Colaboratory

#data preprocessing

#importing the libraries
import os
from pathlib import Path
import pandas as pd
df = pd.read_csv("/content/KDD_Train.csv")

import warnings
warnings.filterwarnings('ignore')

df.shape

(125973, 42)

print(df.shape)
df.head(10)

(125973, 42)
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot .

0 0 tcp ftp_data SF 491 0 0 0 0 0

1 0 udp other SF 146 0 0 0 0 0

2 0 tcp private S0 0 0 0 0 0 0

3 0 tcp http SF 232 8153 0 0 0 0

4 0 tcp http SF 199 420 0 0 0 0

5 0 tcp private REJ 0 0 0 0 0 0

6 0 tcp private S0 0 0 0 0 0 0

7 0 tcp private S0 0 0 0 0 0 0

8 0 tcp remote_job S0 0 0 0 0 0 0

9 0 tcp private S0 0 0 0 0 0 0

10 rows × 42 columns

#DATA PREPROCESSING
df.replace(('normal','anomaly'), (0,1), inplace=True)
df.head(10)

duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot .

0 0 tcp ftp_data SF 491 0 0 0 0 0

1 0 udp other SF 146 0 0 0 0 0

2 0 tcp private S0 0 0 0 0 0 0

3 0 tcp http SF 232 8153 0 0 0 0

4 0 tcp http SF 199 420 0 0 0 0

5 0 tcp private REJ 0 0 0 0 0 0

6 0 tcp private S0 0 0 0 0 0 0

7 0 tcp private S0 0 0 0 0 0 0

8 0 tcp remote_job S0 0 0 0 0 0 0

9 0 tcp private S0 0 0 0 0 0 0

10 rows × 42 columns

#CATEGORICAL FEATURES
for column_name in df.columns:
if df[column_name].dtypes=='object':
a =df[column_name].unique()

https://colab.research.google.com/drive/1D5c73elvUIzMA3cDeuy4uoBN2qa9G9aB#scrollTo=CSuMQbcTPcFB&printMode=true 1/3
7/26/23, 11:36 PM Lab1_pandas.ipynb - Colaboratory
        a=len(a)
        #print(a)
        print(column_name+ " has "+ str(a) +" unique values. ")

protocol_type has 3 unique values.

service has 70 unique values.
flag has 11 unique values.

#CONVERT CATEGORICAL DATA INTO BINARY VARIABLES BY ONE HOT ENCODING
df['protocol_type'].head(5)

0 tcp
1 udp
2 tcp
3 tcp
4 tcp
Name: protocol_type, dtype: object

df['protocol_type'].value_counts()

tcp 102689
udp 14993
icmp 8291
Name: protocol_type, dtype: int64

print(pd.get_dummies(df['protocol_type']).head(5))

icmp tcp udp

0 0 1 0
1 0 0 1
2 0 1 0
3 0 1 0
4 0 1 0

def dummy_df(df):
    todummy_list = ['protocol_type', 'service','flag']
    for x in todummy_list:
        #dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
        #dummy_na=False: If false NaNs are ignored. If true, add col to indicate Nans
        dummies=pd.get_dummies(df[x],dummy_na=False)
        df = df.drop(x, 1)
        #Drop label coln.
        df = pd.concat([df, dummies], axis=1)
        #concat along columns.
    return df

#Appling one hot encoding function
df = dummy_df(df)
df.head(5)

duration src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_

0 0 491 0 0 0 0 0 0 0

1 0 146 0 0 0 0 0 0 0

2 0 0 0 0 0 0 0 0 0

3 0 232 8153 0 0 0 0 0 1

4 0 199 420 0 0 0 0 0 1

5 rows × 123 columns

# Checking how much of my data is missing?
df.isnull().sum().sort_values(ascending=False).head()

duration 0
red_i 0
printer 0
pop_3 0
pop_2 0
dtype: int64

# Impute missing values using Imputer in sklearn.preprocessing

import numpy as np
f kl i t i t Si l I t
https://colab.research.google.com/drive/1D5c73elvUIzMA3cDeuy4uoBN2qa9G9aB#scrollTo=CSuMQbcTPcFB&printMode=true 2/3
7/26/23, 11:36 PM Lab1_pandas.ipynb - Colaboratory
from sklearn.impute import SimpleImputer
imr = SimpleImputer(missing_values=np.nan, strategy='median')
imr.fit(df)
df = pd.DataFrame(data=imr.transform(df), columns=df.columns)

df.isnull().sum().sort_values(ascending=False).head()

duration 0
red_i 0
printer 0
pop_3 0
pop_2 0
dtype: int64

X = df.drop ('class', 1) # Dropping target, train_features = train.iloc[:,:41]
y = df['class'] #train_target = train.class
X.shape

(125973, 122)

Colab paid products - Cancel contracts here

check 0s completed at 11:34 PM

https://colab.research.google.com/drive/1D5c73elvUIzMA3cDeuy4uoBN2qa9G9aB#scrollTo=CSuMQbcTPcFB&printMode=true 3/3

Dump-Networking-2021 0817 142123 Stop
No ratings yet
Dump-Networking-2021 0817 142123 Stop
524 pages
20MIS1025 Perceptron
No ratings yet
20MIS1025 Perceptron
5 pages
Ddos
No ratings yet
Ddos
27 pages
Document Sans Titre
No ratings yet
Document Sans Titre
18 pages
DDoS ML AI TTSData 2
No ratings yet
DDoS ML AI TTSData 2
23 pages
Ddos Dataset: Import As Import As Import As Import As From Import
No ratings yet
Ddos Dataset: Import As Import As Import As Import As From Import
51 pages
Classification Model To Classify Network Traffic
No ratings yet
Classification Model To Classify Network Traffic
5 pages
Lab 6ns
No ratings yet
Lab 6ns
12 pages
256 Qam DL FL16
No ratings yet
256 Qam DL FL16
12 pages
Dev Presentation
No ratings yet
Dev Presentation
4 pages
Code
No ratings yet
Code
4 pages
S Detection Using Machine Learning
No ratings yet
S Detection Using Machine Learning
24 pages
VL2023240503445 Pe003
No ratings yet
VL2023240503445 Pe003
11 pages
Zeek Logs Us Online
No ratings yet
Zeek Logs Us Online
6 pages
Report On Task Scheduling and Delay Prediction
No ratings yet
Report On Task Scheduling and Delay Prediction
4 pages
Use Python To Process and Get Data From Routers On Pnetlab: Function
No ratings yet
Use Python To Process and Get Data From Routers On Pnetlab: Function
8 pages
CCN Exp1
No ratings yet
CCN Exp1
9 pages
Fyp 4
No ratings yet
Fyp 4
12 pages
TCP - TX Process Model
No ratings yet
TCP - TX Process Model
10 pages
Lab7 Hameed 211086
No ratings yet
Lab7 Hameed 211086
4 pages
Dos Attack Basic Coding
No ratings yet
Dos Attack Basic Coding
2 pages
DL Arch Packets
No ratings yet
DL Arch Packets
21 pages
Log
No ratings yet
Log
657 pages
Final Project For Fundamentals of Analytics
No ratings yet
Final Project For Fundamentals of Analytics
9 pages
XL
No ratings yet
XL
2 pages
Gigi Ancien
No ratings yet
Gigi Ancien
3 pages
Work 1 Eprua Program
No ratings yet
Work 1 Eprua Program
5 pages
Cisco Lab OSPF Encapsulation
No ratings yet
Cisco Lab OSPF Encapsulation
23 pages
Introduction To Internet of Things - Unit 9 - Week 7
No ratings yet
Introduction To Internet of Things - Unit 9 - Week 7
4 pages
Network Protocol Test Engineer
No ratings yet
Network Protocol Test Engineer
7 pages
Lab06 Scapy
No ratings yet
Lab06 Scapy
5 pages
Session 2 Machine Learning Execution
No ratings yet
Session 2 Machine Learning Execution
12 pages
Network Intrusion Detection System Crypto Project
No ratings yet
Network Intrusion Detection System Crypto Project
9 pages
App Exp-10
No ratings yet
App Exp-10
6 pages
Use Python CLASS and FUNCTION To Get and Process Data From Routers
No ratings yet
Use Python CLASS and FUNCTION To Get and Process Data From Routers
12 pages
HPC Server
No ratings yet
HPC Server
6 pages
Time Series Ex
No ratings yet
Time Series Ex
4 pages
EDU 311 80a MOD 03 Flow Logic
No ratings yet
EDU 311 80a MOD 03 Flow Logic
47 pages
Test Sqs
No ratings yet
Test Sqs
8 pages
Project 4 Report
No ratings yet
Project 4 Report
16 pages
Ns 2 Simulation
No ratings yet
Ns 2 Simulation
27 pages
5G Resources Allocation Machine Learning Project ???
No ratings yet
5G Resources Allocation Machine Learning Project ???
37 pages
Lecture 2 FINALns2
No ratings yet
Lecture 2 FINALns2
24 pages
Dump-Networking-2023 0418 224134 Stop
No ratings yet
Dump-Networking-2023 0418 224134 Stop
473 pages
Univariate and Mutivariate Time Series Forecasting
No ratings yet
Univariate and Mutivariate Time Series Forecasting
33 pages
Scapy Packet Manuplation: CE 340/S. Kondakcı
No ratings yet
Scapy Packet Manuplation: CE 340/S. Kondakcı
53 pages
C1M5 Object Oriented Programming V7
0% (1)
C1M5 Object Oriented Programming V7
7 pages
Nhóm AI Đề Tài Anomaly
No ratings yet
Nhóm AI Đề Tài Anomaly
13 pages
Use Python Function To Log On All Network Deviecs To Get Data
No ratings yet
Use Python Function To Log On All Network Deviecs To Get Data
10 pages
CN
No ratings yet
CN
9 pages
Python - Making A Fast Port Scanner - Stack Overflow
No ratings yet
Python - Making A Fast Port Scanner - Stack Overflow
8 pages
When eBPF Meets TLS!: A Security Focused Introduction To eBPF
No ratings yet
When eBPF Meets TLS!: A Security Focused Introduction To eBPF
77 pages
19bit0368 Capstone Final Review
No ratings yet
19bit0368 Capstone Final Review
48 pages
Dharani ph3
No ratings yet
Dharani ph3
21 pages
IT3010 - 2023 - Network Design and Management
No ratings yet
IT3010 - 2023 - Network Design and Management
8 pages
Network Intrusion Detection System Crypto Project
No ratings yet
Network Intrusion Detection System Crypto Project
8 pages
mySDN Lab8
No ratings yet
mySDN Lab8
20 pages
Utkarsh Jupiter
No ratings yet
Utkarsh Jupiter
21 pages
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
From Everand
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
Clive W. Humphris
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Module1 1
No ratings yet
Module1 1
15 pages
Module2 2
No ratings yet
Module2 2
15 pages
Iterative Methods - Unconstrained Optimization
No ratings yet
Iterative Methods - Unconstrained Optimization
6 pages
Cauchy's Steepest Descent Method
No ratings yet
Cauchy's Steepest Descent Method
3 pages
Fletcher Reeves Method
No ratings yet
Fletcher Reeves Method
3 pages
Reference Management
No ratings yet
Reference Management
16 pages
A Case Study On Zomato
No ratings yet
A Case Study On Zomato
15 pages
Zero Trust Security Proposal Document
No ratings yet
Zero Trust Security Proposal Document
14 pages
LabVIEW TM Core 2 Course Manual-11-20
No ratings yet
LabVIEW TM Core 2 Course Manual-11-20
10 pages
Diploma Thesis
No ratings yet
Diploma Thesis
107 pages
OOP With Python Unit 1 (Fpro)
No ratings yet
OOP With Python Unit 1 (Fpro)
28 pages
API Integration and Asynchronous Operations in Reactjs
No ratings yet
API Integration and Asynchronous Operations in Reactjs
7 pages
TDP For SQL Restoration
No ratings yet
TDP For SQL Restoration
11 pages
Test Analysis and Design
No ratings yet
Test Analysis and Design
4 pages
Raid
No ratings yet
Raid
4 pages
Penetration Testing For Android Applications With Santoku Linux
No ratings yet
Penetration Testing For Android Applications With Santoku Linux
60 pages
Telecommunications Manager Resume Sample
83% (6)
Telecommunications Manager Resume Sample
2 pages
Data Warehousing - CS614 Fall 2007 Assignment 01 Solution
No ratings yet
Data Warehousing - CS614 Fall 2007 Assignment 01 Solution
3 pages
Unit 5 Memory Organization
No ratings yet
Unit 5 Memory Organization
48 pages
TAQA-User Manual-Asset Transactions-En V1 0
No ratings yet
TAQA-User Manual-Asset Transactions-En V1 0
9 pages
Introduction To Information System
No ratings yet
Introduction To Information System
12 pages
WIS Administration Guide
No ratings yet
WIS Administration Guide
298 pages
WTL Assignment No. 4
No ratings yet
WTL Assignment No. 4
5 pages
Module 2 - Power Platform Licensing Models
No ratings yet
Module 2 - Power Platform Licensing Models
20 pages
Cyber Security Auditing
No ratings yet
Cyber Security Auditing
3 pages
Open Source Vs Proprietary Software, A White Paper
No ratings yet
Open Source Vs Proprietary Software, A White Paper
8 pages
Solution For Some Real Time Problems - SAP ABAP Troubleshooting Helps
No ratings yet
Solution For Some Real Time Problems - SAP ABAP Troubleshooting Helps
5 pages
2do Documento de Gonzalo
No ratings yet
2do Documento de Gonzalo
52 pages
Water Billing System
No ratings yet
Water Billing System
10 pages
Oracle APEX Cloud Developer Specialist: Course Topics
No ratings yet
Oracle APEX Cloud Developer Specialist: Course Topics
3 pages
Gilbert F. NGWANU-SM2024
No ratings yet
Gilbert F. NGWANU-SM2024
3 pages
Building Scalable Web Sites
No ratings yet
Building Scalable Web Sites
21 pages
ISP
No ratings yet
ISP
12 pages
S71937 - Enabling Intelligent Storage To Process Data For Ai Application Ibm
No ratings yet
S71937 - Enabling Intelligent Storage To Process Data For Ai Application Ibm
21 pages
Summative Exam: Results - LinkedIn Learning
No ratings yet
Summative Exam: Results - LinkedIn Learning
5 pages

20mis1025 Lab1

Uploaded by

20mis1025 Lab1

Uploaded by

7/26/23, 11:36 PM Lab1_pandas.

0 0 tcp ftp_data SF 491 0 0 0 0 0

1 0 udp other SF 146 0 0 0 0 0

3 0 tcp http SF 232 8153 0 0 0 0

4 0 tcp http SF 199 420 0 0 0 0

5 0 tcp private REJ 0 0 0 0 0 0

0 0 tcp ftp_data SF 491 0 0 0 0 0

1 0 udp other SF 146 0 0 0 0 0

3 0 tcp http SF 232 8153 0 0 0 0

4 0 tcp http SF 199 420 0 0 0 0

5 0 tcp private REJ 0 0 0 0 0 0

protocol_type has 3 unique values.

icmp tcp udp

5 rows × 123 columns

Colab paid products - Cancel contracts here

You might also like