ChatGPT Prompt to get Datasets for Machine Learning

PandasAI Library from OpenAI

Last Updated : 16 Apr, 2025

We spend a lot of time editing, cleaning, and analyzing data using various methodologies in today's data-driven environment. Pandas is a well-known Python module that aids with data manipulation. It keeps data in structures known as dataframes and enables you to alter, clean up, or analyze data by carrying out various operations like generating a bar graph for the dataframe, adding a new row or column, or replacing some missing data. These duties frequently require a lot of time, which could be spent on other things. We now have PandasAI, a pandas library extension that can aid in more efficient data analysis and manipulation.

What is PandasAI?

Pandas AI is an extension to the pandas library using OpenAI's generative AI models. It allows you to generate insights from your dataframe using just a text prompt. It works on the text-to-query generative AI developed by OpenAI. Data Scientists and data analysts spend a lot of time preparing the data for analysis. They can now move forward with their data analysis. Pandas AI now makes it possible for data experts to use many of the strategies and procedures they have researched to reduce the time required for data preparation. PandasAI should not be used in place of Pandas; rather, it should be utilized in addition to Pandas. You can pose these queries to PandasAI, and it will provide responses in the form of Pandas DataFrames, saving you the time of having to manually browse and respond to queries about the dataset. With the use of the OpenAI API, Pandas AI aims to achieve the goal of allowing you to virtually converse with a machine that will then provide the desired outcomes rather than having to program the task yourself. The outcome will be generated by the machine as machine-readable code (DataFrame), which is the language they use.

How to use PandasAI?

Step 1: Install pandasai and openai library

!pip install -q pandasai openai

Step 2: Import the necessary libraries

Python

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

Step 3: Load the Dataset into a dataframe using a dictionary

Python

dataframe = {
    "country": [
        "Delhi",
        "Mumbai",
        "Kolkata",
        "Chennai",
        "Jaipur",
        "Lucknow",
        "Pune",
        "Bengaluru",
        "Amritsar",
        "Agra",
    ],
    "annual tax collected": [
        19294482072,
        28916155672,
        24112550372,
        34358173362,
        17454337886,
        11812051350,
        16074023894,
        14909678554,
        43807565410,
        146318441864,
    ],
    "happiness_index": [9.94, 7.16, 6.35, 8.07, 6.98, 6.1, 4.23, 8.22, 6.87, 3.36],
}
# reading dataset
df = pd.DataFrame(dataframe)
df.head()

Output:

Note: You can also read data from CSV file using the command pd.read_csv("file_location").

Step 4: Initialize an Open AI Large-Language Model (LLM)

Since PandasAI works on OpenAI LLM, we need to store OpenAI API key in the environment using the following code:

Python

# storing the API Token in Open AI environment
# replace "YOUR_API_KEY" with your generated API key
llm = OpenAI(api_token='YOUR_API_KEY')
#initializing an instance of Pandas AI with openAI environment 
pandas_ai = PandasAI(llm, verbose=True, conversational=False)

If you do not have an OpenAI API key, you can create an account on OpenAI platform and generate a new API key. Now we are all set to use our Generative model to generate insights or clean data using Pandas AI.

Step 5: Provide a text prompt and dataframe to PandaAI

Python

# text prompt explaining the operaiton to performed on the dataset
PROMPT="YOUR_TEXT_PROMPT"
# using pandasAI instance to process text prompt and dataset 
response = pandas_ai(df, PROMPT)
# printing the response
print(response)

Automate Pandas operations with pandasai

Now Let's try some prompts on our custom dataset

Prompt 1: Performing sum operation

Python

response = pandas_ai(df, "Calculate the total tax collected in north Indian cities")
print(response)

Output:

Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# Creating the dataframe
data = {'country': ['Mumbai', 'Jaipur', 'Kolkata', 'Delhi', 'Chennai', 'Lucknow', 'Hyderabad', 'Ahmedabad', 'Bangalore', 'Pune'],
        'annual tax collected': [3274294604, 7159422858, 8155677164, 3688595185, 3679908367, 4567890123, 2345678901, 3456789012, 5678901234, 6789012345],
        'happiness_index': [6.98, 8.07, 8.07, 6.35, 7.16, 7.89, 7.45, 7.12, 8.56, 7.23]}
df = pd.DataFrame(data)
# Filtering north Indian cities
north_cities = ['Jaipur', 'Kolkata', 'Delhi', 'Lucknow', 'Ahmedabad']
north_df = df[df['country'].isin(north_cities)]
# Calculating total tax collected in north Indian cities
total_tax_collected = north_df['annual tax collected'].sum()
print(total_tax_collected)
```
Code running:
```
data = {'country': ['Mumbai', 'Jaipur', 'Kolkata', 'Delhi', 'Chennai',
    'Lucknow', 'Hyderabad', 'Ahmedabad', 'Bangalore', 'Pune'],
    'annual tax collected': [3274294604, 7159422858, 8155677164, 3688595185,
    3679908367, 4567890123, 2345678901, 3456789012, 5678901234, 6789012345],
    'happiness_index': [6.98, 8.07, 8.07, 6.35, 7.16, 7.89, 7.45, 7.12, 
    8.56, 7.23]}
north_cities = ['Jaipur', 'Kolkata', 'Delhi', 'Lucknow', 'Ahmedabad']
north_df = df[df['country'].isin(north_cities)]
total_tax_collected = north_df['annual tax collected'].sum()
print(total_tax_collected)
```
Answer: 72673421680

Prompt 2: Analyzing the dataset

Python

response = pandas_ai.run(df, prompt='Which are the 5 happiest cities?')
print(response)

Output:

Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# Creating the dataframe
data = {'country': ['Kolkata', 'Jaipur', 'Delhi', 'Mumbai', 'Chennai', 'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
        'annual tax collected': [3560469532, 9597107067, 4821092001, 9053727452, 3738210455, 6489321000, 5183920000, 2874610000, 3958200000, 3129400000],
        'happiness_index': [6.35, 8.07, 7.16, 6.98, 6.98, 7.89, 7.45, 7.12, 6.78, 6.55]}
df = pd.DataFrame(data)
# Sorting the dataframe by happiness index in descending order
df = df.sort_values(by='happiness_index', ascending=False)
# Selecting the top 5 happiest cities
top_5_happiest_cities = df.head(5)['country'].tolist()
print(top_5_happiest_cities)
```
Code running:
```
data = {'country': ['Kolkata', 'Jaipur', 'Delhi', 'Mumbai', 'Chennai',
    'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
    'annual tax collected': [3560469532, 9597107067, 4821092001, 9053727452,
    3738210455, 6489321000, 5183920000, 2874610000, 3958200000, 3129400000],
    'happiness_index': [6.35, 8.07, 7.16, 6.98, 6.98, 7.89, 7.45, 7.12, 
    6.78, 6.55]}
top_5_happiest_cities = df.head(5)['country'].tolist()
print(top_5_happiest_cities)
```
Answer: ['Delhi', 'Mumbai', 'Kolkata', 'Chennai', 'Jaipur']

Prompt 3: Performing sort operation

Python

response = pandas_ai.run(df, 
                         prompt='''sort the dataset in ascending order
                          according to happiness index''')
print(response)

Output:

Running PandasAI with openai LLM...
Code generated:
```
df_sorted = df.sort_values(by='happiness_index', ascending=True)
print(df_sorted)
```
Code running:
```
df_sorted = df.sort_values(by='happiness_index', ascending=True)
print(df_sorted)
```
Answer:      country  annual tax collected  happiness_index
9       Agra          146318441864             3.36
6       Pune           16074023894             4.23
5    Lucknow           11812051350             6.10
2    Kolkata           24112550372             6.35
8   Amritsar           43807565410             6.87
4     Jaipur           17454337886             6.98
1     Mumbai           28916155672             7.16
3    Chennai           34358173362             8.07
7  Bengaluru           14909678554             8.22
0      Delhi           19294482072             9.94

Prompt 4: Plotting a histogram

Python

PROMPT = """
Plot a histogram showing the tax collection of all north indian cities,
 take y axis as tax collected and x axis as indian cities"""
response = pandas_ai.run(df, prompt=PROMPT)
print(response)

Output:

Running PandasAI with openai LLM...

histogram representing tax collected in north indian cities

Code generated: 
``` north_cities = ['Delhi', 'Jaipur'] 
    north_df = df[df['country'].isin(north_cities)] 
    import matplotlib.pyplot as plt plt.bar(north_df['country'], 
                                north_df['annual tax collected']) 
    plt.xlabel('Indian cities') 
    plt.ylabel('Tax collected') 
    plt.title('Tax collection of north Indian cities') 
    plt.show() 
``` Code running: 
         ``` north_cities = ['Delhi', 'Jaipur'] 
             north_df = df[df['country'].isin(north_cities)] 
             plt.bar(north_df['country'], north_df['annual tax collected']) 
             plt.xlabel('Indian cities') 
             plt.ylabel('Tax collected') 
             plt.title('Tax collection of north Indian cities') 
             plt.show() 
``` Answer:

Prompt 5: Performing groupby operation

Python

PROMPT = "Group the cities in the dataset according to their happiness index"
response = pandas_ai.run(df, prompt=PROMPT)
print(response)

Output:

Running PandasAI with openai LLM...
Code generated:
```
# Import pandas library
import pandas as pd
# Create the dataframe
data = {'country': ['Chennai', 'Delhi', 'Mumbai', 'Kolkata', 'Jaipur', 'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
        'annual tax collected': [4115278226, 5211175683, 9898166675, 2429903829, 2640456722, 6329812345, 4781234567, 3214567890, 5678901234, 4321987654],
        'happiness_index': [6.98, 6.98, 8.07, 9.94, 7.16, 8.56, 7.89, 6.78, 8.23, 7.45]}
df = pd.DataFrame(data)
# Group the cities by their happiness index
grouped_df = df.groupby('happiness_index')
# Print the groups
for name, group in grouped_df:
    print("Happiness Index:", name)
    print(group)
```
Code running:
```
data = {'country': ['Chennai', 'Delhi', 'Mumbai', 'Kolkata', 'Jaipur',
    'Bangalore', 'Hyderabad', 'Pune', 'Ahmedabad', 'Surat'],
    'annual tax collected': [4115278226, 5211175683, 9898166675, 2429903829,
    2640456722, 6329812345, 4781234567, 3214567890, 5678901234, 4321987654],
    'happiness_index': [6.98, 6.98, 8.07, 9.94, 7.16, 8.56, 7.89, 6.78, 
    8.23, 7.45]}
grouped_df = df.groupby('happiness_index')
for name, group in grouped_df:
    print('Happiness Index:', name)
    print(group)
```
Answer: Happiness Index: 3.36
  country  annual tax collected  happiness_index
9    Agra          146318441864             3.36
Happiness Index: 4.23
  country  annual tax collected  happiness_index
6    Pune           16074023894             4.23
Happiness Index: 6.1
   country  annual tax collected  happiness_index
5  Lucknow           11812051350              6.1
Happiness Index: 6.35
   country  annual tax collected  happiness_index
2  Kolkata           24112550372             6.35
Happiness Index: 6.87
    country  annual tax collected  happiness_index
8  Amritsar           43807565410             6.87
Happiness Index: 6.98
  country  annual tax collected  happiness_index
4  Jaipur           17454337886             6.98
Happiness Index: 7.16
  country  annual tax collected  happiness_index
1  Mumbai           28916155672             7.16
Happiness Index: 8.07
   country  annual tax collected  happiness_index
3  Chennai           34358173362             8.07
Happiness Index: 8.22
     country  annual tax collected  happiness_index
7  Bengaluru           14909678554             8.22
Happiness Index: 9.94

Prompt 6: Describe the dataset

Python

PROMPT = "Give statistical information about dataset"
response = pandas_ai.run(df, prompt=PROMPT)
print(response)

Output:

Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# create the dataframe
data = {'country': ['Delhi', 'Chennai', 'Kolkata', 'Mumbai', 'Jaipur'],
        'annual tax collected': [6851245018, 5569913156, 497203726, 1780282822, 856852833],
        'happiness_index': [6.98, 6.35, 6.35, 6.98, 7.16]}
df = pd.DataFrame(data)
# describe the dataframe
print(df.describe())
```
Code running:
```
data = {'country': ['Delhi', 'Chennai', 'Kolkata', 'Mumbai', 'Jaipur'],
    'annual tax collected': [6851245018, 5569913156, 497203726, 1780282822,
    856852833], 'happiness_index': [6.98, 6.35, 6.35, 6.98, 7.16]}
print(df.describe())
```
Answer:        annual tax collected  happiness_index
count          1.000000e+01        10.000000
mean           3.570575e+10         6.728000
std            4.010314e+10         1.907149
min            1.181205e+10         3.360000
25%            1.641910e+10         6.162500
50%            2.170352e+10         6.925000
75%            3.299767e+10         7.842500
max            1.463184e+11         9.940000

Prompt 7: Check for missing values

Python

PROMPT = "Are there any missing values in the dataset"
response = pandas_ai.run(df, prompt=PROMPT)
print(response)

Output:

Running PandasAI with openai LLM...
Code generated:
```
import pandas as pd
# Creating the dataframe
data = {'country': ['Jaipur', 'Delhi', 'Mumbai', 'Chennai', 'Kolkata'],
        'annual tax collected': [8203131465, 406012666, 6195812866, 8532100009, 2405598967],
        'happiness_index': [8.07, 6.98, 6.98, 7.16, 6.35]}
df = pd.DataFrame(data)
# Checking for missing values
print(df.isnull().values.any())
```
Code running:
```
data = {'country': ['Jaipur', 'Delhi', 'Mumbai', 'Chennai', 'Kolkata'],
    'annual tax collected': [8203131465, 406012666, 6195812866, 8532100009,
    2405598967], 'happiness_index': [8.07, 6.98, 6.98, 7.16, 6.35]}
print(df.isnull().values.any())
```
Answer: False

To learn more about Chat GPT, you can refer to:

ChatGPT vs Google BARD

Conclusion

In this article, we looked at PandasAI's advantages as a useful addition for pandas library users. PandasAI has several amazing capabilities, such as running language prompts that resemble SQL searches and producing visualizations directly from a DataFrame. It without a doubt increases productivity by automating several processes. It's crucial to remember that even if PandasAI is a strong tool, the Pandas library still needs to be used. The pandas library's capabilities are still necessary for some sophisticated operations, such as adding missing data to a DataFrame. Pandas' extensive ecosystem and wide range of features continue to be crucial for managing challenging data manipulation and analysis tasks. Consequently, PandasAI is a useful addition that enhances the functionality of the pandas library and further augments the efficiency and convenience of working with data in Python.

ChatGPT Prompt to get Datasets for Machine Learning

P

prathamso02t4

Improve

Article Tags :

Practice Tags :

Similar Reads

OpenAI Python API - Complete Guide

OpenAI is the leading company in the field of AI. With the public release of software like ChatGPT, DALL-E, GPT-3, and Whisper, the company has taken the entire AI industry by storm. Everyone has incorporated ChatGPT to do their work more efficiently and those who failed to do so have lost their job

Extract keywords from text with ChatGPT

In this article, we will learn how to extract keywords from text with ChatGPT using Python. ChatGPT is developed by OpenAI. It is an extensive language model based on the GPT-3.5 architecture. It is a type of AI chatbot that can take input from users and generate solutions similar to humans. ChatGPT

Pandas AI: The Generative AI Python Library

In the age of AI, many of our tasks have been automated especially after the launch of ChatGPT. One such tool that uses the power of ChatGPT to ease data manipulation task in Python is PandasAI. It leverages the power of ChatGPT to generate Python code and executes it. The output of the generated co

Text Manipulation using OpenAI

Open AI is a leading organization in the field of Artificial Intelligence and Machine Learning, they have provided the developers with state-of-the-art innovations like ChatGPT, WhisperAI, DALL-E, and many more to work on the vast unstructured data available. For text manipulation, OpenAI has compil

In today's time, data is available in many forms, like tables, images, text, audio, or video. We use this data to gain insights and make predictions for certain events using various machine learning and deep learning techniques. There are many techniques that help us work on tables, images, texts, a

Spam Classification using OpenAI

The majority of people in today's society own a mobile phone, and they all frequently get communications (SMS/email) on their phones. But the key point is that some of the messages you get may be spam, with very few being genuine or important interactions. You may be tricked into providing your pers

How to Use chatgpt on Linux

OpenAI has developed an AI-powered chatbot named `ChatGPT`, which is used by users to have their answers to questions and queries. One can access ChatGPT on searchingness easily. But some users want to access this chatbot on their Linux System. It can be accessed as a Desktop application on Ubuntu o

PandasAI Library from OpenAI

We spend a lot of time editing, cleaning, and analyzing data using various methodologies in today's data-driven environment. Pandas is a well-known Python module that aids with data manipulation. It keeps data in structures known as dataframes and enables you to alter, clean up, or analyze data by c

ChatGPT Prompt to get Datasets for Machine Learning

With the development of machine learning, access to high-quality datasets is becoming increasingly important. Datasets are crucial for assessing the accuracy and effectiveness of the final model, which is a prerequisite for any machine learning project. In this article, we'll learn how to use a Chat

How To Implement ChatGPT In Django

Integrating ChatGPT into a Django application allows you to create dynamic and interactive chat interfaces. By following the steps outlined in this article, you can implement ChatGPT in your Django project and provide users with engaging conversational experiences. Experiment with different prompts,