0% found this document useful (0 votes)
19 views40 pages

De Lab Manual

The document provides a comprehensive lab manual detailing the installation and configuration procedures for various software tools including Apache NiFi, Apache Airflow, Elasticsearch, Kibana, PostgreSQL, and pgAdmin 4 on Windows. It includes step-by-step instructions for setting up each tool, troubleshooting tips, and examples for reading and writing files in Python, as well as processing files in Airflow and NiFi. The manual serves as a guide for users to establish a robust data integration and management stack.

Uploaded by

bashaa0669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views40 pages

De Lab Manual

The document provides a comprehensive lab manual detailing the installation and configuration procedures for various software tools including Apache NiFi, Apache Airflow, Elasticsearch, Kibana, PostgreSQL, and pgAdmin 4 on Windows. It includes step-by-step instructions for setting up each tool, troubleshooting tips, and examples for reading and writing files in Python, as well as processing files in Airflow and NiFi. The manual serves as a guide for users to establish a robust data integration and management stack.

Uploaded by

bashaa0669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

DE LAB MANUAL

EXPERIMENT -1

AIM : Installing and configuring Apache NiFi, Apache Airflow.

PROCEDURE :
Part 1: Installing and Configuring Apache NiFi on Windows

Apache NiFi is a powerful data integration tool for automating data flow. Here’s how to install
and configure it on Windows:

Step 1: Prerequisites

● Ensure that Java 8 or later is installed on your Windows system.


● Download and Install Java from: Oracle JDK.
● Add the Java bin directory to your PATH environment variable.

Step 2: Download Apache NiFi

1. Go to the official Apache NiFi website and download the Windows binary:
○ https://nifi.apache.org/download.html
2. Select the appropriate version and download the .zip file.

Step 3: Install Apache NiFi

1. Extract the ZIP file to a location of your choice, for example, C:\nifi.
2. Navigate to the NiFi folder (C:\nifi).

Step 4: Configure NiFi (Optional)

● You can configure NiFi by modifying the conf\nifi.properties file if needed (e.g.,
change ports, security settings).
● In most cases, you can use the default settings.

Step 5: Start NiFi


Open a Command Prompt as Administrator and navigate to the NiFi bin directory:
bash
code
cd C:\nifi\bin

By: C BALAJIM.Tech
Asst. Prof.
1.

Run the following command to start NiFi:


bash
code
nifi.bat start

2.

Step 6: Access NiFi UI

● Open a browser and navigate to the NiFi Web UI:


○ http://localhost:8080/nifi
● You should now see the NiFi UI where you can configure processors, create workflows,
and monitor data flows.

Step 7: Stop NiFi


To stop NiFi, run the following command in the bin directory:
bash
code
nifi.bat stop

Part 2: Installing and Configuring Apache Airflow on Windows

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor


workflows. Here’s how to install and configure it on Windows:

Step 1: Prerequisites

● Ensure you have Python 3.7+ installed on your system.


● You may also need PIP (Python's package manager) installed.

Step 2: Install Apache Airflow

1. Set up a Virtual Environment (recommended to avoid dependency conflicts):

Open a command prompt (or PowerShell) and create a virtual environment:


bash
code
python -m venv airflow_env

By: C BALAJIM.Tech
Asst. Prof.

Activate the virtual environment:


bash
code
airflow_env\Scripts\activate

Install Apache Airflow via PIP:


bash
code
pip install apache-airflow
You can also install a specific version of Airflow:
bash
code
pip install apache-airflow==2.3.0

2.

Step 3: Initialize Airflow Database


Airflow uses a metadata database. You need to initialize the database first:
bash
code
airflow db init

1.

Step 4: Start Airflow Web Server and Scheduler

1. Start the Airflow Web Server:

Run this command to start the Airflow UI on localhost:8080:


bash
code
airflow webserver --port 8080


2. Start the Airflow Scheduler:

In another command prompt, start the Airflow scheduler:


bash
code
airflow scheduler

By: C BALAJIM.Tech
Asst. Prof.

Step 5: Access Airflow UI

● Open a web browser and go to:


○ http://localhost:8080
● The default login credentials are:
○ Username: admin
○ Password: admin

Step 6: Configure Airflow DAGs

1. Create a DAG by writing a Python script in the dags directory (typically located at
C:\Users\<YourUsername>\airflow\dags).
○ Example of a simple DAG (simple_dag.py):

python
code
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime

default_args = {
'owner': 'airflow',
'start_date': datetime(2024, 11, 23),
'retries': 1,
}

dag = DAG('simple_dag', default_args=default_args,


schedule_interval='@daily')

start_task = DummyOperator(task_id='start', dag=dag)


end_task = DummyOperator(task_id='end', dag=dag)

start_task >> end_task

2.

Step 7: Stop Airflow

1. Stop the webserver and scheduler:

By: C BALAJIM.Tech
Asst. Prof.
○ To stop the web server, use Ctrl+C in the terminal where it's running.
○ Similarly, stop the scheduler with Ctrl+C in the scheduler terminal.

Troubleshooting Tips

For Apache NiFi:

1. Java Version Error:


○ If you receive an error related to Java, check your JAVA_HOME environment
variable and ensure it points to the correct Java version.
○ Restart your terminal after updating the PATH or JAVA_HOME variable.
2. Port Conflict:
○ NiFi runs by default on port 8080. If this port is in use, change it in the
nifi.properties file (located in the conf folder) by editing the
nifi.web.http.port property.

For Apache Airflow:

1. Database Initialization Errors:


○ Ensure you have initialized the Airflow database properly with airflow db
init.

If you encounter errors related to the database, clear the existing database (only in a
development environment) using:
bash
code
airflow db reset


2. Permissions:
○ Sometimes, Airflow needs elevated permissions (especially for Windows). If you
encounter permissions-related issues, ensure you're running the command
prompt as Administrator.

By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT - 2

AIM : Installing and configuring Elasticsearch, Kibana, PostgreSQL, pgAdmin 4

PROCEDURE :

Step-by-Step Guide to Install and Configure Elasticsearch, Kibana,


PostgreSQL, and pgAdmin 4 on Windows

Part 1: Installing and Configuring Elasticsearch

Elasticsearch is a distributed search and analytics engine. Here's how to install it on Windows.

Step 1: Download Elasticsearch

1. Go to the official Elasticsearch download page.


2. Select the Windows version of Elasticsearch and download the ZIP file.

Step 2: Extract Elasticsearch

1. Once downloaded, extract the ZIP file to a directory, e.g., C:\elasticsearch.

Step 3: Configure Elasticsearch (Optional)

1. Navigate to the config folder within the extracted directory (e.g.,


C:\elasticsearch\config).
2. Open elasticsearch.yml with a text editor to configure settings like network host,
cluster name, etc.

For example, to bind Elasticsearch to localhost:


network.host: localhost

Step 4: Start Elasticsearch

1. Open Command Prompt as Administrator.

Navigate to the bin folder within the Elasticsearch directory:


cd C:\elasticsearch\bin
By: C BALAJIM.Tech
Asst. Prof.
2. Run the following command to start Elasticsearch:
elasticsearch.bat

Step 5: Verify Elasticsearch Installation


Open a browser and go to:
http://localhost:9200


● You should see a JSON response confirming Elasticsearch is running.

Part 2: Installing and Configuring Kibana

Kibana is a data visualization platform that works with Elasticsearch. Here’s how to install it on
Windows.

Step 1: Download Kibana

1. Go to the official Kibana download page.


2. Download the Windows ZIP version of Kibana.

Step 2: Extract Kibana

1. Extract the ZIP file to a directory, e.g., C:\kibana.

Step 3: Configure Kibana

1. Navigate to the config folder inside the Kibana directory (e.g., C:\kibana\config).
2. Open the kibana.yml file and modify the following configuration:

Set the elasticsearch.hosts to match your Elasticsearch server (usually


localhost:9200):
elasticsearch.hosts: ["http://localhost:9200"]

Step 4: Start Kibana

1. Open Command Prompt as Administrator.

Navigate to the bin folder in the Kibana directory:


cd C:\kibana\bin

2.
By: C BALAJIM.Tech
Asst. Prof.
Start Kibana by running:
kibana.bat

3.

Step 5: Verify Kibana Installation


Open a browser and go to:
http://localhost:5601


● You should see the Kibana dashboard, confirming that Kibana is connected to
Elasticsearch.

Part 3: Installing and Configuring PostgreSQL

PostgreSQL is a powerful open-source relational database system. Follow these steps to install
it on Windows.

Step 1: Download PostgreSQL

1. Go to the official PostgreSQL download page.


2. Download the installer for Windows from the EnterpriseDB website.

Step 2: Install PostgreSQL

1. Run the PostgreSQL installer .exe file you downloaded.


2. Follow the installation wizard steps:
○ Choose your installation directory (default is C:\Program
Files\PostgreSQL\xx).
○ Select components (PostgreSQL Server, pgAdmin, Stack Builder, etc.).
○ Set the password for the postgres user (remember this password).
○ Choose the port (default is 5432).
○ Select the locale (default is fine).

Step 3: Start PostgreSQL

● PostgreSQL should start automatically after installation. You can verify that the
PostgreSQL service is running by checking the Services window (services.msc).

Step 4: Connect to PostgreSQL via Command Line (Optional)

1. Open Command Prompt as Administrator.

By: C BALAJIM.Tech
Asst. Prof.
2. Navigate to the PostgreSQL bin directory (usually in C:\Program
Files\PostgreSQL\xx\bin).

Run the following command to log in as the postgres user:


psql -U postgres

3.
4. Enter the password you set during installation.

Step 5: Verify PostgreSQL Installation


In the psql terminal, run the following SQL query to confirm PostgreSQL is working:
SELECT version();

Part 4: Installing and Configuring pgAdmin 4

pgAdmin is a web-based GUI for managing PostgreSQL databases. Here's how to install and
configure it.

Step 1: Download pgAdmin 4

1. Go to the official pgAdmin download page.


2. Download the Windows version of pgAdmin 4.

Step 2: Install pgAdmin 4

1. Run the pgAdmin 4 installer and follow the steps:


○ Choose installation directory and components.
○ Select whether to install pgAdmin for all users or only the current user.

Step 3: Launch pgAdmin 4

1. After installation, you can launch pgAdmin 4 from the Start Menu or desktop shortcut.
2. The first time you launch it, you will be prompted to set the master password (used for
securing your pgAdmin credentials).

Step 4: Add PostgreSQL Server in pgAdmin

1. Once inside pgAdmin, click on "Add New Server" in the browser panel.

2. In the General tab, name the server (e.g., PostgreSQL Server).

3. In the Connection tab, configure the following:

By: C BALAJIM.Tech
Asst. Prof.
○ Host name/address: localhost
○ Port: 5432 (default)
○ Maintenance database: postgres
○ Username: postgres
○ Password: The password you set during PostgreSQL installation.
4. Click Save to add the server.

Step 5: Verify Connection

● You should now see your PostgreSQL server listed in pgAdmin. Expand the server, and
you can access the databases, tables, and run SQL queries.

Conclusion

By following these steps, you should have successfully installed and configured:

● Elasticsearch (for distributed search and analytics).


● Kibana (for data visualization connected to Elasticsearch).
● PostgreSQL (a relational database).
● pgAdmin 4 (a GUI for managing PostgreSQL databases).

Troubleshooting Tips

1. Elasticsearch:

○If you face connection issues, ensure that Elasticsearch is running on


localhost:9200 and that no firewall is blocking the port.
2. Kibana:

○If Kibana isn't displaying properly, verify that Elasticsearch is running and check
the Kibana logs located in C:\kibana\logs\.
3. PostgreSQL:


If you can't connect to PostgreSQL, ensure that the PostgreSQL service is
running and that your firewall allows connections on port 5432.
4. pgAdmin:

○ If pgAdmin can't connect to PostgreSQL, double-check the server settings in


pgAdmin, especially the username and password.

With all components installed and configured, you now have a robust stack for data storage,
search, analytics, and visualization.
By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT -3

AIM : Reading and Writing files

a. Reading and writing files in Python

b. Processing files in Airflow

c. NiFi processors for handling files

d. Reading and writing data to databases in Python

e. Databases in Airflow

f. Database processors in NiFi

a. Reading and Writing Files in Python

Python provides built-in functions to read from and write to files. Below are examples for
common file operations.

Reading Files in Python

To read a file in Python, you can use the open() function with the r mode (read):

python

Code

# Reading a file

with open('example.txt', 'r') as file:

content = file.read()

print(content)

● open(): Opens the file in the specified mode ('r' for reading).
● with statement: Automatically closes the file after reading.

By: C BALAJIM.Tech
Asst. Prof.
Writing Files in Python

To write data to a file, use the open() function with the w (write) or a (append) mode:

python

Code

# Writing to a file

with open('output.txt', 'w') as file:

file.write("Hello, this is a test!\n")

file.write("Second line.")

● w mode: Overwrites the file (creates the file if it doesn't exist).


● a mode: Appends data to the file if it already exists.

Reading and Writing Binary Files in Python

For reading and writing binary files (e.g., images or PDFs), use the rb and wb modes
respectively:

python

Code

# Reading binary file

with open('image.jpg', 'rb') as file:

data = file.read()

# Writing binary file

with open('copy_image.jpg', 'wb') as file:

file.write(data)

b. Processing Files in Apache Airflow


By: C BALAJIM.Tech
Asst. Prof.
Apache Airflow allows you to automate file processing tasks, such as reading, writing, and
moving files between systems. Airflow uses DAGs (Directed Acyclic Graphs) to define
workflows, which can include file operations using operators like PythonOperator,
BashOperator, or specialized operators.

Example of File Processing in Airflow

python

Code

from airflow import DAG

from airflow.operators.python_operator import PythonOperator

from datetime import datetime

# Function to read from and write to a file

def process_file():

with open('/path/to/input_file.txt', 'r') as file:

content = file.read()

with open('/path/to/output_file.txt', 'w') as file:

file.write(f"Processed: {content}")

# Define the DAG

dag = DAG('file_processing_dag', start_date=datetime(2024, 1, 1))

# Create a PythonOperator to call the function

process_file_task = PythonOperator(

task_id='process_file',
By: C BALAJIM.Tech
Asst. Prof.
python_callable=process_file,

dag=dag

● PythonOperator: Executes a Python function.


● The function reads from an input file, processes the content, and writes the result to an
output file.

Using BashOperator to Process Files

You can also use the BashOperator to run shell commands, such as copying or moving files:

python

Code

from airflow.operators.bash_operator import BashOperator

move_file_task = BashOperator(

task_id='move_file',

bash_command='mv /path/to/input_file.txt
/path/to/output_file.txt',

dag=dag

).

c. NiFi Processors for Handling Files

Apache NiFi provides processors to handle files, such as reading,


writing, and moving files. Here are some of the key NiFi processors
for working with files:

1. GetFile

By: C BALAJIM.Tech
Asst. Prof.
● Function: Reads files from a directory and sends them as flow
files.
● Use Case: Ingesting files from local directories into NiFi.
● Configuration: Set the Input Directory to the folder containing
files you want to read.

2. PutFile

● Function: Writes flow files to a directory.


● Use Case: Saving processed files to disk.
● Configuration: Set the Directory property to the folder where you
want to write the files.

3. ListFile

● Function: Lists files in a specified directory without consuming


them.
● Use Case: To list files in a directory for further processing.
● Configuration: Set the Directory to where the files are located.

4. FetchFile

● Function: Retrieves files from a local directory or network


location.
● Use Case: Fetching a specific file to process further in NiFi.
● Configuration: Specify the file path or URL to fetch the file.

5. ReplaceText

● Function: Reads the content of a flow file, modifies it using


regular expressions, and writes the modified content back.
● Use Case: Processing the contents of a file (e.g., replacing
specific words or formatting).

d. Reading and Writing Data to Databases in Python

By: C BALAJIM.Tech
Asst. Prof.
Python supports database connections through libraries like psycopg2
(for PostgreSQL), pyodbc (for SQL Server), and SQLAlchemy (for ORM-
based access).

Example: Reading from PostgreSQL in Python

Install psycopg2:
bash
Code
pip install psycopg2

1.

Connect to a PostgreSQL database and fetch data:


python
Code
import psycopg2

# Connect to the PostgreSQL server

conn = psycopg2.connect(dbname="test_db", user="user",


password="password", host="localhost", port="5432")

cursor = conn.cursor()

# Query to fetch data

cursor.execute("SELECT * FROM users;")

rows = cursor.fetchall()

# Print results

for row in rows:

print(row)

By: C BALAJIM.Tech
Asst. Prof.
# Close the connection

cursor.close()

conn.close()

2.

Example: Writing to PostgreSQL in Python

python

Code

import psycopg2

# Connect to PostgreSQL

conn = psycopg2.connect(dbname="test_db", user="user",


password="password", host="localhost", port="5432")

cursor = conn.cursor()

# Insert data into table

cursor.execute("INSERT INTO users (name, age) VALUES (%s, %s)",


("Alice", 30))

# Commit changes and close the connection

conn.commit()

cursor.close()

conn.close()

By: C BALAJIM.Tech
Asst. Prof.
e. Databases in Airflow

Airflow offers a variety of operators to interact with databases.


Some commonly used operators include:

● PostgresOperator: To execute SQL commands on a PostgreSQL


database.
● MySqlOperator: To interact with MySQL databases.
● PythonOperator: To execute Python code, which can include
database interactions.

Example of Using PostgresOperator in Airflow

python

Code

from airflow import DAG

from airflow.providers.postgres.operators.postgres import


PostgresOperator

from datetime import datetime

# Define the DAG

dag = DAG('postgres_example', start_date=datetime(2024, 1, 1))

# SQL command to run

sql_query = """

CREATE TABLE IF NOT EXISTS users (

id SERIAL PRIMARY KEY,

name VARCHAR(100),

age INT
By: C BALAJIM.Tech
Asst. Prof.
);

"""

# Task to execute SQL command

create_table = PostgresOperator(

task_id='create_table',

postgres_conn_id='postgres_default', # Connection ID to
PostgreSQL

sql=sql_query,

autocommit=True, # Automatically commits the transaction

dag=dag

● PostgresOperator: Executes SQL queries on a PostgreSQL database.


● postgres_conn_id: Connection ID set in Airflow's connection
configuration.

f. Database Processors in NiFi

NiFi provides processors to interact with databases. Common database


processors include:

1. ExecuteSQL

● Function: Executes SQL queries (SELECT, INSERT, UPDATE, DELETE)


against a database.
● Use Case: Running custom SQL queries to interact with databases.
● Configuration: Provide a Database Connection Pooling Service
(e.g., DBCPConnectionPool), SQL statement, and other database
settings.
By: C BALAJIM.Tech
Asst. Prof.
2. PutSQL

● Function: Executes SQL commands (INSERT, UPDATE, DELETE) to


update the database.
● Use Case: Insert or update data in a relational database.
● Configuration: Provide a Database Connection Pooling Service and
the SQL statement.

3. QueryDatabaseTable

● Function: Queries a database table and converts the rows into


flow files.
● Use Case: Extract data from a database table.
● Configuration: Set the SQL Query and Database Connection Pooling
Service.

4. GenerateTableFetch

● Function: Generates SQL queries to fetch data from a database.


● Use Case: Automatically generate queries to fetch records from a
table (useful for large tables).
● Configuration: Define Maximum Batch Size, Columns to Query, and
Database Connection Pooling Service.

By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT -4

AIM: Working with Databases

a. Inserting and extracting relational data in Python

b. Inserting and extracting NoSQL database data in Python

c. Building database pipelines in Airflow

d. Building database pipelines in NiFi

a. Inserting and Extracting Relational Data in Python

To interact with a relational database (like PostgreSQL, MySQL, etc.)


in Python, you can use libraries such as psycopg2 (PostgreSQL) or
mysql-connector-python (MySQL). Here's an example using PostgreSQL:

Extracting Data from a Relational Database (PostgreSQL)

python

Code

import psycopg2

# Connect to your database

conn = psycopg2.connect(

dbname="your_database",

user="your_user",

password="your_password",

host="localhost",

By: C BALAJIM.Tech
Asst. Prof.
port="5432"

cursor = conn.cursor()

# Extract data (SELECT query)

cursor.execute("SELECT * FROM users;")

rows = cursor.fetchall()

# Process and print data

for row in rows:

print(f"ID: {row[0]}, Name: {row[1]}, Age: {row[2]}")

# Close connection

cursor.close()

conn.close()

Inserting Data into a Relational Database (PostgreSQL)

python

Code

import psycopg2

# Connect to the PostgreSQL database

By: C BALAJIM.Tech
Asst. Prof.
conn = psycopg2.connect(

dbname="your_database",

user="your_user",

password="your_password",

host="localhost",

port="5432"

cursor = conn.cursor()

# Insert data into the "users" table

cursor.execute("""

INSERT INTO users (name, age)

VALUES (%s, %s)

""", ("John Doe", 29))

# Commit the transaction

conn.commit()

# Close connection

cursor.close()

conn.close()

By: C BALAJIM.Tech
Asst. Prof.
b.Inserting and Extracting NoSQL Database Data in Python

For NoSQL databases such as MongoDB, you can use the pymongo library
to interact with the database.

Extracting Data from MongoDB

python

Code

from pymongo import MongoClient

# Connect to MongoDB (default localhost and port)

client = MongoClient("mongodb://localhost:27017/")

db = client['your_database']

collection = db['your_collection']

# Extract data

documents = collection.find()

# Print each document

for doc in documents:

print(doc)

By: C BALAJIM.Tech
Asst. Prof.
Inserting Data into MongoDB

python

Code

from pymongo import MongoClient

# Connect to MongoDB (default localhost and port)

client = MongoClient("mongodb://localhost:27017/")

db = client['your_database']

collection = db['your_collection']

# Insert a document into the collection

collection.insert_one({

"name": "Jane Doe",

"age": 30,

"email": "[email protected]"

})

c.Building Database Pipelines in Apache Airflow

Airflow allows you to create automated workflows (DAGs) for


interacting with databases. Here’s an example of a simple pipeline to
extract data from PostgreSQL, process it, and insert it back into the
database.

Building a Database Pipeline in Airflow

By: C BALAJIM.Tech
Asst. Prof.
python

Code

from airflow import DAG

from airflow.providers.postgres.operators.postgres import


PostgresOperator

from airflow.operators.python_operator import PythonOperator

from datetime import datetime

import psycopg2

# Define a function to insert data into PostgreSQL

def insert_data_to_db():

# Connect to PostgreSQL

conn = psycopg2.connect(

dbname="your_database",

user="your_user",

password="your_password",

host="localhost",

port="5432"

cursor = conn.cursor()

# Insert data into table

cursor.execute("""
By: C BALAJIM.Tech
Asst. Prof.
INSERT INTO users (name, age)

VALUES (%s, %s)

""", ("New User", 25))

# Commit changes and close the connection

conn.commit()

cursor.close()

conn.close()

# Define the DAG

dag = DAG('database_pipeline_dag', start_date=datetime(2024, 1, 1))

# Task to execute SQL query

create_table_task = PostgresOperator(

task_id='create_table',

postgres_conn_id='postgres_default', # Connection ID in Airflow

sql="""

CREATE TABLE IF NOT EXISTS users (

id SERIAL PRIMARY KEY,

name VARCHAR(100),

age INT

);

By: C BALAJIM.Tech
Asst. Prof.
""",

autocommit=True,

dag=dag

# Task to insert data using Python function

insert_data_task = PythonOperator(

task_id='insert_data_to_db',

python_callable=insert_data_to_db,

dag=dag

# Set task dependencies

create_table_task >> insert_data_task

● PostgresOperator: Executes SQL commands (like creating tables or


running queries).
● PythonOperator: Executes Python functions, such as inserting
data.
● DAG: Defines the directed acyclic graph (workflow) of tasks.

d.Building Database Pipelines in NiFi

Apache NiFi provides various processors to interact with relational


and NoSQL databases. Below are some NiFi processors that you can use
to extract data and insert data into databases.

By: C BALAJIM.Tech
Asst. Prof.
1. Inserting Data into a Relational Database (PostgreSQL)

To insert data into a database, use the PutSQL processor in NiFi.


Here's how you can configure it:

Steps:

1. Add the PutSQL processor to your NiFi canvas.


2. Configure the processor to use a Database Connection Pooling
Service like DBCPConnectionPool for PostgreSQL.
3. Define your SQL statement for inserting data (e.g., INSERT INTO
users (name, age) VALUES (?, ?)).
4. Link your input data to the processor to dynamically insert data.

NiFi Processor Example for Inserting Data:

● Processor: PutSQL
● SQL Query: INSERT INTO users (name, age) VALUES (?, ?)
● Database Connection Pooling Service: DBCPConnectionPool

Configuration:

1. Database Connection Pooling Service: Configure the


DBCPConnectionPool to connect to your PostgreSQL database.
2. SQL Statement: Provide the SQL query for inserting data. You can
use NiFi Expression Language to pass values dynamically.
3. Batch Size: Adjust this based on how many records you want to
insert at once.

2. Extracting Data from a Relational Database (PostgreSQL)

To extract data from a database, use the ExecuteSQL processor in NiFi.

Steps:

1. Add the ExecuteSQL processor to your NiFi canvas.


2. Configure it to use the same Database Connection Pooling Service.
3. Define your SQL query to fetch data, e.g., SELECT * FROM users;.
4. Set the output format (usually flowfile with the result data).

NiFi Processor Example for Extracting Data:

By: C BALAJIM.Tech
Asst. Prof.
● Processor: ExecuteSQL
● SQL Query: SELECT * FROM users;
● Database Connection Pooling Service: DBCPConnectionPool

Configuration:

1. Database Connection Pooling Service: Set up the connection


pooling service to interact with the database.
2. SQL Query: Set the SQL query you want to execute.
3. Output Flowfile: The results from the database query will be
output as flowfiles that you can process further in your
pipeline.

By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT-5

AIM : Cleaning, Transforming and Enriching Data

a. Performing exploratory data analysis in Python

b. Handling common data issues using pandas

c. Cleaning data using Airflow

a.Performing Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) is the first step in analyzing a


dataset, and it typically involves summarizing the data's key
characteristics and visualizing relationships between variables.

Steps for EDA in Python:

1. Load the dataset using Pandas.


2. Summarize the data.
3. Visualize the distribution of the data and relationships.
4. Identify outliers, missing values, and potential transformations.

Code for Performing EDA in Python

python

Code

import pandas as pd

import matplotlib.pyplot as plt

By: C BALAJIM.Tech
Asst. Prof.
import seaborn as sns

# Load the dataset (assuming a CSV file)

df = pd.read_csv('your_data.csv')

# 1. Basic Information

print("Dataset Info:")

print(df.info()) # Shows info about columns, non-null counts, and


types

# 2. Statistical Summary

print("\nStatistical Summary:")

print(df.describe()) # Shows statistical summary of numeric columns

# 3. Handling Missing Data

print("\nMissing Data (percentage of missing values):")

print(df.isnull().mean() * 100) # Displays percentage of missing


values for each column

# 4. Visualizing Data Distribution (histograms)

df.hist(figsize=(12, 10), bins=30)

plt.suptitle('Data Distribution')

plt.show()

By: C BALAJIM.Tech
Asst. Prof.
# 5. Correlation Matrix

correlation_matrix = df.corr() # Get correlation between numeric


columns

plt.figure(figsize=(10, 8))

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")

plt.title('Correlation Heatmap')

plt.show()

# 6. Visualizing Outliers (Box Plot)

plt.figure(figsize=(10, 8))

sns.boxplot(data=df)

plt.title('Box Plot for Outliers')

plt.show()

# 7. Pairplot for relationships between variables

sns.pairplot(df)

plt.suptitle('Pairplot of Variables', y=1.02)

plt.show()

What this code does:

1. df.info(): Provides the data types, non-null counts, and general


information.

By: C BALAJIM.Tech
Asst. Prof.
2. df.describe(): Shows the statistical summary for numerical
columns (mean, standard deviation, min/max, etc.).
3. df.isnull().mean(): Identifies missing values and shows the
percentage of missing data for each column.
4. df.hist(): Plots histograms for each numerical column.
5. Correlation heatmap: Shows the correlation between numerical
columns in the dataset.
6. Boxplots: Identify potential outliers in the dataset.
7. Pairplot: Provides pairwise plots to examine relationships
between columns.

b.Handling Common Data Issues Using Pandas

When working with data in Python, Pandas provides a wide range of


functions to handle common issues such as missing data, duplicate
rows, and inconsistent data formats. Below are some common tasks:

Common Data Issues & Solutions:

1. Handling Missing Data:

Removing missing values:


python
Code
df.dropna(inplace=True) # Removes rows with any missing values

Filling missing values (using the mean, median, or custom values):


python
Code
df['column_name'].fillna(df['column_name'].mean(), inplace=True) #
Fill with mean value


2. Handling Duplicates:

By: C BALAJIM.Tech
Asst. Prof.
Removing duplicates:
python
Code
df.drop_duplicates(inplace=True)


3. Handling Data Type Issues:

Converting columns to the correct data type:


python
Code
df['date_column'] = pd.to_datetime(df['date_column'])

df['numeric_column'] = pd.to_numeric(df['numeric_column'],
errors='coerce') # Convert non-numeric values to NaN


4. Handling Inconsistent Data (e.g., string values with
leading/trailing spaces):

Stripping spaces from string columns:


python
Code
df['string_column'] = df['string_column'].str.strip()


5. Normalizing/Scaling Data:

Min-Max Scaling:
python
Code
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['scaled_column'] = scaler.fit_transform(df[['column_to_scale']])

Renaming Columns:
python
By: C BALAJIM.Tech
Asst. Prof.
Code
df.rename(columns={'old_name': 'new_name'}, inplace=True)

6.
7. Replacing values:

Replacing specific values in a column:


python
Code
df['column_name'] = df['column_name'].replace({'old_value':
'new_value'})

Code for Handling Missing Values and Duplicates

python

Code

import pandas as pd

# Load dataset

df = pd.read_csv('your_data.csv')

# Remove rows with missing values

df.dropna(inplace=True)

# Fill missing values with mean for numeric columns

df.fillna(df.mean(), inplace=True)

# Remove duplicate rows


By: C BALAJIM.Tech
Asst. Prof.
df.drop_duplicates(inplace=True)

# Convert columns to the correct data type

df['date_column'] = pd.to_datetime(df['date_column'])

# Normalize a numeric column

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['normalized_column'] = scaler.fit_transform(df[['numeric_column']])

# Strip leading/trailing spaces from string columns

df['column_name'] = df['column_name'].str.strip()

# Save cleaned data to a new CSV file

df.to_csv('cleaned_data.csv', index=False)

c.Cleaning Data Using Airflow

In Airflow, you can automate data cleaning tasks using the


PythonOperator or custom operators. Here's how you can build a DAG
that cleans and transforms data as part of an ETL pipeline.

Steps for Cleaning Data in Airflow:

1. Create a DAG to define the pipeline.

By: C BALAJIM.Tech
Asst. Prof.
2. Use PythonOperator to clean data (remove missing values,
normalize, etc.).
3. Schedule the DAG to run periodically.

Code to Clean Data Using Airflow

python

Code

from airflow import DAG

from airflow.operators.python import PythonOperator

from datetime import datetime

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

# Function to clean data

def clean_data():

# Load data

df = pd.read_csv('your_data.csv')

# Remove rows with missing values

df.dropna(inplace=True)

# Fill missing values with the mean for numeric columns

df.fillna(df.mean(), inplace=True)

By: C BALAJIM.Tech
Asst. Prof.
# Normalize a numeric column

scaler = MinMaxScaler()

df['normalized_column'] =
scaler.fit_transform(df[['numeric_column']])

# Strip spaces from string columns

df['column_name'] = df['column_name'].str.strip()

# Save the cleaned data

df.to_csv('cleaned_data.csv', index=False)

# Define the DAG

dag = DAG(

'data_cleaning_dag',

start_date=datetime(2024, 1, 1),

schedule_interval='@daily', # Runs every day

catchup=False

# Task to clean data

clean_data_task = PythonOperator(

task_id='clean_data',

python_callable=clean_data,
By: C BALAJIM.Tech
Asst. Prof.
dag=dag

# Set the task dependency

clean_data_task

By: C BALAJIM.Tech
Asst. Prof.

You might also like