
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Load StackOverflow Questions Dataset using TensorFlow in Python
Tensorflow is a machine learning framework that is provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications, and much more. It is used in research and for production purposes. It has optimization techniques that help in performing complicated mathematical operations quickly.
This is because it uses NumPy and multi-dimensional arrays. These multi-dimensional arrays are also known as ‘tensors’. The framework supports working with a deep neural networks. It is highly scalable and comes with many popular datasets. It uses GPU computation and automates the management of resources. It comes with multitude of machine learning libraries, and is well-supported and documented. The framework has the ability to run deep neural network models, train them, and create applications that predict relevant characteristics of the respective datasets.
The ‘tensorflow’ package can be installed on Windows using the below line of code −
pip install tensorflow
We are using Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Collaboratory has been built on top of Jupyter Notebook. Following is the code snippet to load the dataset which contains StackOverflow questions using Python −
Example
batch_size = 32 seed = 42 print("The training parameters have been defined") raw_train_ds = preprocessing.text_dataset_from_directory( train_dir, batch_size=batch_size, validation_split=0.25, subset='training', seed=seed) for text_batch, label_batch in raw_train_ds.take(1): for i in range(10): print("Question: ", text_batch.numpy()[i][:100], '...') print("Label:", label_batch.numpy()[i])
Code credit − https://www.tensorflow.org/tutorials/load_data/text
Output
The training parameters have been defined Found 8000 files belonging to 4 classes. Using 6000 files for training. Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can' ... Label: 1 Question: b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds' ... Label: 3 Question: b'"option and validation in blank i want to add a new option on my system where i want to add two text' ... Label: 1 Question: b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand th' ... Label: 0 Question: b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like t' ... Label: 1 Question: b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (i' ... Label: 0 Question: b'how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemt' ... Label: 0 Question: b'"is there a way to check a variable that exists in a different script than the original one? i\'m try' ... Label: 3 Question: b'"blank control flow i made a number which asks for 2 numbers with blank and responds with the corre' ... Label: 0 Question: b'"credentials cannot be used for ntlm authentication i am getting org.apache.commons.httpclient.auth.' ... Label: 1
Explanation
The data is loaded off the disk and prepared to be in a form that is suited for training it.
The ‘text_dataset_from_dataset’ utility is used to create a labeled dataset.
The ‘tf.Data’ is a collection of tools which is powerful and is used to build input pipelines.
A directory structure is passed to the ‘text_dataset_from_dataset’ utility.
The StackOverflow question dataset is divided into training and test dataset.
A validation set is created using the ‘validation_split’ method.
The labels are either 0, or 1, or 2, or 3.