
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Split Data into Training and Testing in Python Without Sklearn
In the domain of machine learning or artificial intelligence models, data stands as the backbone. The way this data gets handled shapes the holistic performance of the model. This includes the indispensable task of segregating the dataset into learning and verification sets. While sklearn's train_test_split() is a frequently employed method, there could be instances when a Python aficionado might not have it at their disposal or is curious to grasp how to manually attain a similar outcome. This discourse delves into how one can segregate data into learning and verification sets without leaning on sklearn. We will bank on Python's built-in libraries for this objective.
Example 1: The Rationale Behind Segregating Data
Before plunging into the nuts and bolts, let's address the rationale. Machine learning algorithms need a wealth of data to glean from. This data, the learning set, assists the model to decipher patterns and formulate predictions. However, to gauge the model's prowess, we need data that the model hasn't been exposed to before. This untouched data is the verification set.
Utilizing the identical data for learning and verification would give rise to an overfitted model - the model performs impressively with the learning data but stumbles with untouched data. Consequently, the data typically gets divided into a 70-30 or an 80-20 proportion, where the larger chunk is harnessed for learning, and the smaller one is employed for verification.
Manually Segregating Data in Python
We'll kick-off with a simple yet efficacious way of segregating the data utilizing Python's built-in operations. The specimen used here is a list of integers, but the technique is applicable for any data type.
Assume we possess a dataset data as follows:
data = list(range(1, 101)) # data is a list of integers from 1 to 100
The goalpost is to segregate this data into 80% learning data and 20% verification data.
Initially, we'll import the necessary library.
The random module offers a variety of functions for generating random numbers, and we will utilize it to shuffle our data. Subsequently, we'll shuffle our data.
After shuffling the data, we'll segregate it into learning and verification sets
The split_index dictates the point at which the data gets bifurcated. We calculate it as the product of split_ratio and the size of the dataset.
Eventually, we employ slicing to craft the learning and verification datasets.
The learning data consists of elements from the commencement of the list up to split_index, and the verification data is composed of elements from split_index to the termination of the list.
Example
import random random.shuffle(data) split_ratio = 0.8 # We are using an 80-20 split here split_index = int(split_ratio * len(data)) train_data = data[:split_index] test_data = data[split_index:]
Output
train_data = [65, 51, 8, 82, 15, 32, 11, 74, 89, 29, 50, 34, 93, 84, 37, 7, 1, 83, 17, 24, 5, 33, 49, 90, 35, 57, 47, 73, 46, 95, 10, 80, 59, 94, 63, 27, 31, 52, 18, 76, 91, 71, 20, 68, 70, 87, 26, 64, 99, 42, 61, 69, 79, 12, 3, 66, 96, 75, 30, 22, 100, 14, 97, 56, 55, 58, 28, 23, 98, 6, 2, 88, 43, 41, 78, 60, 72, 39] test_data = [45, 53, 48, 16, 9, 62, 13, 81, 92, 54, 21, 38, 25, 44, 85, 19, 40, 77, 67, 4]
As the code involves random shuffling of data, the output may vary each and every time you try it to run.
Example 2: Utilizing Numpy to Segregate Data
Another technique to segregate data sans sklearn is by exploiting the numpy library. Numpy is a potent library for numerical computation and can be employed to construct arrays and manipulate them efficiently.
Here's how you can segregate data employing numpy:
Firstly, import the numpy library. Subsequently, construct a numpy array.
Shuffle the array. Eventually, split the array.
The index represents the point at which our pool of data gets fractionated into the learning and verification subsets. It's arrived at by harnessing the product of the predetermined split ratio (0.8 in our instance for an 80-20 split) and the cumulative count of data points
The final step is to create the training and testing datasets using the calculated split index. We use list slicing for this operation.
Example
import numpy as np data = np.array(range(1, 101)) # data is a numpy array of integers from 1 to 100 np.random.shuffle(data) split_ratio = 0.8 # We are using an 80-20 split here split_index = int(split_ratio * len(data)) train_data = data[:split_index] test_data = data[split_index:]
Output
train_data = [52, 13, 87, 68, 48, 4, 34, 9, 74, 25, 30, 38, 90, 83, 54, 45, 61, 73, 80, 14, 70, 63, 75, 81, 97, 60, 96, 8, 43, 20, 79, 46, 50, 76, 18, 84, 26, 31, 71, 56, 22, 88, 64, 95, 91, 78, 69, 19, 42, 67, 77, 2, 41, 32, 11, 94, 40, 59, 17, 57, 99, 44, 5, 93, 62, 23, 3, 33, 47, 92] test_data = [49, 66, 7, 58, 37, 98, 100, 24, 6, 55, 28, 16, 85, 65, 51, 35, 12, 10, 86, 29]
Conclusion
Segregating data into learning and verification sets is a paramount step in machine learning and data science ventures. While sklearn furnishes a straightforward method to execute this task, it's crucial to comprehend how to achieve this manually. As we've demonstrated, this can be accomplished utilizing Python's built-in operations or the numpy library.
Whether you opt to use sklearn, Python's built-in operations, or numpy relies on your specific requisites and constraints. Each method carries its advantages and disadvantages. The manual methods bequeath you more control over the process, while sklearn's train_test_split() is simpler to utilize and includes additional attributes like stratification.