0% found this document useful (0 votes)
4 views9 pages

AI in HC - 1

The document outlines an experiment to generate a synthetic dataset using the Faker library in Python, which creates realistic-looking fake data. It provides installation instructions for necessary libraries and demonstrates how to generate various data types such as names, addresses, emails, and company information. The final section includes a function to create a custom dataset with specified formats and random values, showcasing the versatility of the Faker library.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

AI in HC - 1

The document outlines an experiment to generate a synthetic dataset using the Faker library in Python, which creates realistic-looking fake data. It provides installation instructions for necessary libraries and demonstrates how to generate various data types such as names, addresses, emails, and company information. The final section includes a function to create a custom dataset with specified formats and random values, showcasing the versatility of the Faker library.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Experiment -1

Aim:- Generate a custom or synthetic data set using


Python.

Theory:- Faker is a Python library that generates fake data for you.
It is useful to create realistic-looking datasets and can generate all
types of data. We’ll explore those most relevant for customer demos
but the documentation details all the “providers” of fake data
available in the library.

To begin, let’s make sure we have the necessary libraries installed. In


addition to Faker and Numpy, we’ll also need the handy pandas
library. The hana_ml library will be used to upload the dataset we
create to SAP HANA Cloud.

!pip install numpy


!pip install faker
!pip install pandas
!pip install hana_ml

import pandas as pd
from faker import Faker
import numpy as np

fake = Faker()

# First name
for _ in range(3):
print(fake.first_name())
Output:-
Tyler
Mark
Susan
There are providers for different types of data we can
generate on a fake “customer” by calling the appropriate
Faker provider.

# There are specific versions of these generators

# It can generate names


print('Male first names: ' + fake.first_name_male())
print('Female first names: ' + fake.first_name_female())
print('Last names: ' + fake.last_name())
print('Full names: ' + fake.name())

# Generate prefixes and suffixes (there are also gender specific


versions e.g. prefix_female())
print('Prefix: ' + fake.prefix())
print('Suffix: ' + fake.suffix())

# Generate emails
print('Company emails: ' + fake.ascii_company_email())
print('Safe emails: ' + fake.ascii_safe_email())
print('Free emails: ' + fake.ascii_free_email())
print('ASCII Emails: ' + fake.ascii_email())
print('Emails: ' + fake.email())

Output:-

Male first names: Luis


Female first names: Lori
Last names: Burton
Full names: Mitchell Maynard
Prefix: Mr.
Suffix: DDS
Company emails: [email protected]
Safe emails: [email protected]
Free emails: [email protected]
ASCII Emails: [email protected]
Emails: [email protected]
If you prefer to create a company-focused dataset, you can do
that too.

# Company names
print('Company name: ' + fake.company())
print('Company suffix: ' + fake.company_suffix())

# Generate Address components


print('Street address: ' + fake.street_address())
print('Bldg #: ' + fake.building_number())
print('City: ' + fake.city())
print('Country: ' + fake.country())
print('Postcode: ' + fake.postcode())

# Or generate full addresses


print('Full address: ' + fake.address())

# Even generate motto, etc.


print('Catch phrase: ' + fake.catch_phrase())
print('Motto: ' + fake.bs())

Output:-

Company name: Park-Osborne


Company suffix: Ltd
Street address: 2694 Hughes View Suite 654
Bldg #: 5802
City: Craigfurt
Country: Iran
Postcode: 78482
Full address: 46463 Juan Fall Apt. 788
Port Benjamin, RI 60825
Catch phrase: Managed 5thgeneration adapter
Motto: redefine 24/365 markets

Generate columns that match specific formats If you need to


create fake data that needs a specific format, such as a product
code or iPhone model, you can do that too:

# Use bothify to generate random numbers(#) or letters(?). Can


limit the letters used with letters=
print(fake.bothify('PROD-??-##', letters='ABCDE'))
print(fake.bothify('iPhone-#'))

# Create fake True/False values


# Random True/False
print(fake.boolean())

# Specify % True
print(fake.boolean(chance_of_getting_true=25))

For categorical columns, you can specify a list of values to


randomly choose from. Optionally, you can also specify the
weights to give to each value if you don’t want each element in
the list to have an equal chance of being selected.

import numpy as np

industry = ['Automotive','Health Care','Manufacturing','High


Tech','Retail']
# Specify probabilities of each category (must sum to 1.0)
weights = [0.6, 0.2, 0.1, 0.07, 0.03]
# p= specifies the probabilities of each category. Must sum to
1.0
print(np.random.choice(industry, p=weights))

# Generating choice without weights (equal probability on all


elements)
print(np.random.choice(industry))
Output:- Health Care
Health Care

import numpy as np

industry = ['Automotive','Health Care','Manufacturing','High


Tech','Retail']
# Specify probabilities of each category (must sum to 1.0)
weights = [0.6, 0.2, 0.1, 0.07, 0.03]

# p= specifies the probabilities of each category. Must sum to


1.0
print(np.random.choice(industry, p=weights))

# Generating choice without weights (equal probability on all


elements)
print(np.random.choice(industry))

Output:- Automotive
Manufacturing

# 1st argument is mean of distribution, 2nd is standard deviation


print(np.random.normal(1000, 100))
# Rounded result
print(round(np.random.normal(1000, 100)))

# Generate random integer between 0 and 4


print(np.random.randint(5))

Output:-
1174.2251307283339
961
0

print(fake.date_this_century().strftime('%m-%d-%Y'))
print(fake.date_this_decade().strftime('%m-%d-%Y'))
print(fake.date_this_year().strftime('%m-%d-%Y'))
print(fake.date_this_month().strftime('%m-%d-%Y'))
print(fake.time())
import pandas as pd

# Start and end dates to generate data


my_start = pd.to_datetime('01-01-2021')
my_end = pd.to_datetime('12-31-2021')

print(f'Random date between {my_start} & {my_end}')


fake.date_between_dates(my_start, my_end).strftime('%m-%d-%Y')

Output:-

01-28-2005
07-16-2020
03-19-2023
11-04-2023
18:31:29
Random date between 2021-01-01 00:00:00 & 2021-12-31 00:00:00
'11-04-2021

print(fake.year())
print(fake.month())
print(fake.day_of_month())
print(fake.day_of_week())
print(fake.month_name())
print(fake.past_date('-1y'))
print(fake.future_date('+1d'))
Output:-
1994
11
20
Friday
January
2022-12-21
2023-11-25
Use all the above code to generate a custom
dataset.

from faker import Faker


import numpy as np
import pandas as pd

industry = ['Automotive','Health Care','Manufacturing','High


Tech', 'Retail']

fake = Faker()
def create_data(x):

# dictionary
b_user ={}
for i in range(0, x):
b_user[i] = {}
b_user[i]['name'] = fake.name()
b_user[i]['job'] = fake.job()
b_user[i]['birthdate'] =
fake.date_of_birth(minimum_age=18,maximum_age=65)
b_user[i]['email'] = fake.company_email()
b_user[i]['company'] = fake.company()
b_user[i]['industry'] = fake.random_element(industry)
b_user[i]['city'] = fake.city()
b_user[i]['state'] = fake.state()
b_user[i]['zipcode'] = fake.postcode()
b_user[i]['netNew'] =
fake.boolean(chance_of_getting_true=65)
b_user[i]['sales_rounded'] =
round(np.random.normal(1000,200))
b_user[i]['sales_decimal'] = np.random.normal(1000,200)
b_user[i]['priority'] = fake.random_digit()
b_user[i]['industry2'] = np.random.choice(industry)
return b_user
df = pd.DataFrame(create_data(5)).transpose()
df.head(5)
Output:-

You might also like