0% found this document useful (0 votes)

20 views40 pages

De Lab Manual

The document provides a comprehensive lab manual detailing the installation and configuration procedures for various software tools including Apache NiFi, Apache Airflow, Elasticsearch, Kibana, PostgreSQL, and pgAdmin 4 on Windows. It includes step-by-step instructions for setting up each tool, troubleshooting tips, and examples for reading and writing files in Python, as well as processing files in Airflow and NiFi. The manual serves as a guide for users to establish a robust data integration and management stack.

Uploaded by

bashaa0669

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views40 pages

De Lab Manual

Uploaded by

bashaa0669

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

DE LAB MANUAL

EXPERIMENT -1

AIM : Installing and configuring Apache NiFi, Apache Airflow.

PROCEDURE :
Part 1: Installing and Configuring Apache NiFi on Windows

Apache NiFi is a powerful data integration tool for automating data flow. Here’s how to install
and configure it on Windows:

Step 1: Prerequisites

● Ensure that Java 8 or later is installed on your Windows system.

● Download and Install Java from: Oracle JDK.
● Add the Java bin directory to your PATH environment variable.

Step 2: Download Apache NiFi

1. Go to the official Apache NiFi website and download the Windows binary:
○ https://nifi.apache.org/download.html
2. Select the appropriate version and download the .zip file.

Step 3: Install Apache NiFi

1. Extract the ZIP file to a location of your choice, for example, C:\nifi.
2. Navigate to the NiFi folder (C:\nifi).

Step 4: Configure NiFi (Optional)

● You can configure NiFi by modifying the conf\nifi.properties file if needed (e.g.,
change ports, security settings).
● In most cases, you can use the default settings.

Step 5: Start NiFi

Open a Command Prompt as Administrator and navigate to the NiFi bin directory:
bash
code
cd C:\nifi\bin

By: C BALAJIM.Tech
Asst. Prof.
1.

Run the following command to start NiFi:

bash
code
nifi.bat start

Step 6: Access NiFi UI

● Open a browser and navigate to the NiFi Web UI:

○ http://localhost:8080/nifi
● You should now see the NiFi UI where you can configure processors, create workflows,
and monitor data flows.

Step 7: Stop NiFi

To stop NiFi, run the following command in the bin directory:
bash
code
nifi.bat stop

Part 2: Installing and Configuring Apache Airflow on Windows

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor

workflows. Here’s how to install and configure it on Windows:

Step 1: Prerequisites

● Ensure you have Python 3.7+ installed on your system.

● You may also need PIP (Python's package manager) installed.

Step 2: Install Apache Airflow

1. Set up a Virtual Environment (recommended to avoid dependency conflicts):

Open a command prompt (or PowerShell) and create a virtual environment:

bash
code
python -m venv airflow_env

By: C BALAJIM.Tech
Asst. Prof.
○

Activate the virtual environment:

bash
code
airflow_env\Scripts\activate

Install Apache Airflow via PIP:

bash
code
pip install apache-airflow
You can also install a specific version of Airflow:
bash
code
pip install apache-airflow==2.3.0

Step 3: Initialize Airflow Database

Airflow uses a metadata database. You need to initialize the database first:
bash
code
airflow db init

Step 4: Start Airflow Web Server and Scheduler

1. Start the Airflow Web Server:

Run this command to start the Airflow UI on localhost:8080:

bash
code
airflow webserver --port 8080

○
2. Start the Airflow Scheduler:

In another command prompt, start the Airflow scheduler:

bash
code
airflow scheduler

By: C BALAJIM.Tech
Asst. Prof.
○

Step 5: Access Airflow UI

● Open a web browser and go to:

○ http://localhost:8080
● The default login credentials are:
○ Username: admin
○ Password: admin

Step 6: Configure Airflow DAGs

1. Create a DAG by writing a Python script in the dags directory (typically located at
C:\Users\<YourUsername>\airflow\dags).
○ Example of a simple DAG (simple_dag.py):

python
code
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime

default_args = {
'owner': 'airflow',
'start_date': datetime(2024, 11, 23),
'retries': 1,
}

dag = DAG('simple_dag', default_args=default_args,

schedule_interval='@daily')

start_task = DummyOperator(task_id='start', dag=dag)

end_task = DummyOperator(task_id='end', dag=dag)

start_task >> end_task

Step 7: Stop Airflow

1. Stop the webserver and scheduler:

By: C BALAJIM.Tech
Asst. Prof.
○ To stop the web server, use Ctrl+C in the terminal where it's running.
○ Similarly, stop the scheduler with Ctrl+C in the scheduler terminal.

Troubleshooting Tips

For Apache NiFi:

1. Java Version Error:

○ If you receive an error related to Java, check your JAVA_HOME environment
variable and ensure it points to the correct Java version.
○ Restart your terminal after updating the PATH or JAVA_HOME variable.
2. Port Conflict:
○ NiFi runs by default on port 8080. If this port is in use, change it in the
nifi.properties file (located in the conf folder) by editing the
nifi.web.http.port property.

For Apache Airflow:

1. Database Initialization Errors:

○ Ensure you have initialized the Airflow database properly with airflow db
init.

If you encounter errors related to the database, clear the existing database (only in a
development environment) using:
bash
code
airflow db reset

○
2. Permissions:
○ Sometimes, Airflow needs elevated permissions (especially for Windows). If you
encounter permissions-related issues, ensure you're running the command
prompt as Administrator.

By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT - 2

AIM : Installing and configuring Elasticsearch, Kibana, PostgreSQL, pgAdmin 4

PROCEDURE :

Step-by-Step Guide to Install and Configure Elasticsearch, Kibana,

PostgreSQL, and pgAdmin 4 on Windows

Part 1: Installing and Configuring Elasticsearch

Elasticsearch is a distributed search and analytics engine. Here's how to install it on Windows.

Step 1: Download Elasticsearch

1. Go to the official Elasticsearch download page.

2. Select the Windows version of Elasticsearch and download the ZIP file.

Step 2: Extract Elasticsearch

1. Once downloaded, extract the ZIP file to a directory, e.g., C:\elasticsearch.

Step 3: Configure Elasticsearch (Optional)

1. Navigate to the config folder within the extracted directory (e.g.,

C:\elasticsearch\config).
2. Open elasticsearch.yml with a text editor to configure settings like network host,
cluster name, etc.

For example, to bind Elasticsearch to localhost:

network.host: localhost

Step 4: Start Elasticsearch

1. Open Command Prompt as Administrator.

Navigate to the bin folder within the Elasticsearch directory:

cd C:\elasticsearch\bin
By: C BALAJIM.Tech
Asst. Prof.
2. Run the following command to start Elasticsearch:
elasticsearch.bat

Step 5: Verify Elasticsearch Installation

Open a browser and go to:
http://localhost:9200

●
● You should see a JSON response confirming Elasticsearch is running.

Part 2: Installing and Configuring Kibana

Kibana is a data visualization platform that works with Elasticsearch. Here’s how to install it on
Windows.

Step 1: Download Kibana

1. Go to the official Kibana download page.

2. Download the Windows ZIP version of Kibana.

Step 2: Extract Kibana

1. Extract the ZIP file to a directory, e.g., C:\kibana.

Step 3: Configure Kibana

1. Navigate to the config folder inside the Kibana directory (e.g., C:\kibana\config).
2. Open the kibana.yml file and modify the following configuration:

Set the elasticsearch.hosts to match your Elasticsearch server (usually

localhost:9200):
elasticsearch.hosts: ["http://localhost:9200"]

Step 4: Start Kibana

1. Open Command Prompt as Administrator.

Navigate to the bin folder in the Kibana directory:

cd C:\kibana\bin

2.
By: C BALAJIM.Tech
Asst. Prof.
Start Kibana by running:
kibana.bat

Step 5: Verify Kibana Installation

Open a browser and go to:
http://localhost:5601

●
● You should see the Kibana dashboard, confirming that Kibana is connected to
Elasticsearch.

Part 3: Installing and Configuring PostgreSQL

PostgreSQL is a powerful open-source relational database system. Follow these steps to install
it on Windows.

Step 1: Download PostgreSQL

1. Go to the official PostgreSQL download page.

2. Download the installer for Windows from the EnterpriseDB website.

Step 2: Install PostgreSQL

1. Run the PostgreSQL installer .exe file you downloaded.

2. Follow the installation wizard steps:
○ Choose your installation directory (default is C:\Program
Files\PostgreSQL\xx).
○ Select components (PostgreSQL Server, pgAdmin, Stack Builder, etc.).
○ Set the password for the postgres user (remember this password).
○ Choose the port (default is 5432).
○ Select the locale (default is fine).

Step 3: Start PostgreSQL

● PostgreSQL should start automatically after installation. You can verify that the
PostgreSQL service is running by checking the Services window (services.msc).

Step 4: Connect to PostgreSQL via Command Line (Optional)

1. Open Command Prompt as Administrator.

By: C BALAJIM.Tech
Asst. Prof.
2. Navigate to the PostgreSQL bin directory (usually in C:\Program
Files\PostgreSQL\xx\bin).

Run the following command to log in as the postgres user:

psql -U postgres

3.
4. Enter the password you set during installation.

Step 5: Verify PostgreSQL Installation

In the psql terminal, run the following SQL query to confirm PostgreSQL is working:
SELECT version();

Part 4: Installing and Configuring pgAdmin 4

pgAdmin is a web-based GUI for managing PostgreSQL databases. Here's how to install and
configure it.

Step 1: Download pgAdmin 4

1. Go to the official pgAdmin download page.

2. Download the Windows version of pgAdmin 4.

Step 2: Install pgAdmin 4

1. Run the pgAdmin 4 installer and follow the steps:

○ Choose installation directory and components.
○ Select whether to install pgAdmin for all users or only the current user.

Step 3: Launch pgAdmin 4

1. After installation, you can launch pgAdmin 4 from the Start Menu or desktop shortcut.
2. The first time you launch it, you will be prompted to set the master password (used for
securing your pgAdmin credentials).

Step 4: Add PostgreSQL Server in pgAdmin

1. Once inside pgAdmin, click on "Add New Server" in the browser panel.

2. In the General tab, name the server (e.g., PostgreSQL Server).

3. In the Connection tab, configure the following:

By: C BALAJIM.Tech
Asst. Prof.
○ Host name/address: localhost
○ Port: 5432 (default)
○ Maintenance database: postgres
○ Username: postgres
○ Password: The password you set during PostgreSQL installation.
4. Click Save to add the server.

Step 5: Verify Connection

● You should now see your PostgreSQL server listed in pgAdmin. Expand the server, and
you can access the databases, tables, and run SQL queries.

Conclusion

By following these steps, you should have successfully installed and configured:

● Elasticsearch (for distributed search and analytics).

● Kibana (for data visualization connected to Elasticsearch).
● PostgreSQL (a relational database).
● pgAdmin 4 (a GUI for managing PostgreSQL databases).

Troubleshooting Tips

1. Elasticsearch:

○If you face connection issues, ensure that Elasticsearch is running on

localhost:9200 and that no firewall is blocking the port.
2. Kibana:

○If Kibana isn't displaying properly, verify that Elasticsearch is running and check
the Kibana logs located in C:\kibana\logs\.
3. PostgreSQL:

○
If you can't connect to PostgreSQL, ensure that the PostgreSQL service is
running and that your firewall allows connections on port 5432.
4. pgAdmin:

○ If pgAdmin can't connect to PostgreSQL, double-check the server settings in

pgAdmin, especially the username and password.

With all components installed and configured, you now have a robust stack for data storage,
search, analytics, and visualization.
By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT -3

AIM : Reading and Writing files

a. Reading and writing files in Python

b. Processing files in Airflow

c. NiFi processors for handling files

d. Reading and writing data to databases in Python

e. Databases in Airflow

f. Database processors in NiFi

a. Reading and Writing Files in Python

Python provides built-in functions to read from and write to files. Below are examples for
common file operations.

Reading Files in Python

To read a file in Python, you can use the open() function with the r mode (read):

python

Code

# Reading a file

with open('example.txt', 'r') as file:

content = file.read()

print(content)

● open(): Opens the file in the specified mode ('r' for reading).
● with statement: Automatically closes the file after reading.

By: C BALAJIM.Tech
Asst. Prof.
Writing Files in Python

To write data to a file, use the open() function with the w (write) or a (append) mode:

python

Code

# Writing to a file

with open('output.txt', 'w') as file:

file.write("Hello, this is a test!\n")

file.write("Second line.")

● w mode: Overwrites the file (creates the file if it doesn't exist).

● a mode: Appends data to the file if it already exists.

Reading and Writing Binary Files in Python

For reading and writing binary files (e.g., images or PDFs), use the rb and wb modes
respectively:

python

Code

# Reading binary file

with open('image.jpg', 'rb') as file:

data = file.read()

# Writing binary file

with open('copy_image.jpg', 'wb') as file:

file.write(data)

b. Processing Files in Apache Airflow

By: C BALAJIM.Tech
Asst. Prof.
Apache Airflow allows you to automate file processing tasks, such as reading, writing, and
moving files between systems. Airflow uses DAGs (Directed Acyclic Graphs) to define
workflows, which can include file operations using operators like PythonOperator,
BashOperator, or specialized operators.

Example of File Processing in Airflow

python

Code

from airflow import DAG

from airflow.operators.python_operator import PythonOperator

from datetime import datetime

# Function to read from and write to a file

def process_file():

with open('/path/to/input_file.txt', 'r') as file:

content = file.read()

with open('/path/to/output_file.txt', 'w') as file:

file.write(f"Processed: {content}")

# Define the DAG

dag = DAG('file_processing_dag', start_date=datetime(2024, 1, 1))

# Create a PythonOperator to call the function

process_file_task = PythonOperator(

task_id='process_file',
By: C BALAJIM.Tech
Asst. Prof.
python_callable=process_file,

dag=dag

● PythonOperator: Executes a Python function.

● The function reads from an input file, processes the content, and writes the result to an
output file.

Using BashOperator to Process Files

You can also use the BashOperator to run shell commands, such as copying or moving files:

python

Code

from airflow.operators.bash_operator import BashOperator

move_file_task = BashOperator(

task_id='move_file',

bash_command='mv /path/to/input_file.txt
/path/to/output_file.txt',

dag=dag

c. NiFi Processors for Handling Files

Apache NiFi provides processors to handle files, such as reading,

writing, and moving files. Here are some of the key NiFi processors
for working with files:

1. GetFile

By: C BALAJIM.Tech
Asst. Prof.
● Function: Reads files from a directory and sends them as flow
files.
● Use Case: Ingesting files from local directories into NiFi.
● Configuration: Set the Input Directory to the folder containing
files you want to read.

2. PutFile

● Function: Writes flow files to a directory.

● Use Case: Saving processed files to disk.
● Configuration: Set the Directory property to the folder where you
want to write the files.

3. ListFile

● Function: Lists files in a specified directory without consuming

them.
● Use Case: To list files in a directory for further processing.
● Configuration: Set the Directory to where the files are located.

4. FetchFile

● Function: Retrieves files from a local directory or network

location.
● Use Case: Fetching a specific file to process further in NiFi.
● Configuration: Specify the file path or URL to fetch the file.

5. ReplaceText

● Function: Reads the content of a flow file, modifies it using

regular expressions, and writes the modified content back.
● Use Case: Processing the contents of a file (e.g., replacing
specific words or formatting).

d. Reading and Writing Data to Databases in Python

By: C BALAJIM.Tech
Asst. Prof.
Python supports database connections through libraries like psycopg2
(for PostgreSQL), pyodbc (for SQL Server), and SQLAlchemy (for ORM-
based access).

Example: Reading from PostgreSQL in Python

Install psycopg2:
bash
Code
pip install psycopg2

Connect to a PostgreSQL database and fetch data:

python
Code
import psycopg2

# Connect to the PostgreSQL server

conn = psycopg2.connect(dbname="test_db", user="user",

password="password", host="localhost", port="5432")

cursor = conn.cursor()

# Query to fetch data

cursor.execute("SELECT * FROM users;")

rows = cursor.fetchall()

# Print results

for row in rows:

print(row)

By: C BALAJIM.Tech
Asst. Prof.
# Close the connection

cursor.close()

conn.close()

Example: Writing to PostgreSQL in Python

python

Code

import psycopg2

# Connect to PostgreSQL

conn = psycopg2.connect(dbname="test_db", user="user",

password="password", host="localhost", port="5432")

cursor = conn.cursor()

# Insert data into table

cursor.execute("INSERT INTO users (name, age) VALUES (%s, %s)",

("Alice", 30))

# Commit changes and close the connection

conn.commit()

cursor.close()

conn.close()

By: C BALAJIM.Tech
Asst. Prof.
e. Databases in Airflow

Airflow offers a variety of operators to interact with databases.

Some commonly used operators include:

● PostgresOperator: To execute SQL commands on a PostgreSQL

database.
● MySqlOperator: To interact with MySQL databases.
● PythonOperator: To execute Python code, which can include
database interactions.

Example of Using PostgresOperator in Airflow

python

Code

from airflow import DAG

from airflow.providers.postgres.operators.postgres import

PostgresOperator

from datetime import datetime

# Define the DAG

dag = DAG('postgres_example', start_date=datetime(2024, 1, 1))

# SQL command to run

sql_query = """

CREATE TABLE IF NOT EXISTS users (

id SERIAL PRIMARY KEY,

name VARCHAR(100),

age INT
By: C BALAJIM.Tech
Asst. Prof.
);

"""

# Task to execute SQL command

create_table = PostgresOperator(

task_id='create_table',

postgres_conn_id='postgres_default', # Connection ID to
PostgreSQL

sql=sql_query,

autocommit=True, # Automatically commits the transaction

dag=dag

● PostgresOperator: Executes SQL queries on a PostgreSQL database.

● postgres_conn_id: Connection ID set in Airflow's connection
configuration.

f. Database Processors in NiFi

NiFi provides processors to interact with databases. Common database

processors include:

1. ExecuteSQL

● Function: Executes SQL queries (SELECT, INSERT, UPDATE, DELETE)

against a database.
● Use Case: Running custom SQL queries to interact with databases.
● Configuration: Provide a Database Connection Pooling Service
(e.g., DBCPConnectionPool), SQL statement, and other database
settings.
By: C BALAJIM.Tech
Asst. Prof.
2. PutSQL

● Function: Executes SQL commands (INSERT, UPDATE, DELETE) to

update the database.
● Use Case: Insert or update data in a relational database.
● Configuration: Provide a Database Connection Pooling Service and
the SQL statement.

3. QueryDatabaseTable

● Function: Queries a database table and converts the rows into

flow files.
● Use Case: Extract data from a database table.
● Configuration: Set the SQL Query and Database Connection Pooling
Service.

4. GenerateTableFetch

● Function: Generates SQL queries to fetch data from a database.

● Use Case: Automatically generate queries to fetch records from a
table (useful for large tables).
● Configuration: Define Maximum Batch Size, Columns to Query, and
Database Connection Pooling Service.

By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT -4

AIM: Working with Databases

a. Inserting and extracting relational data in Python

b. Inserting and extracting NoSQL database data in Python

c. Building database pipelines in Airflow

d. Building database pipelines in NiFi

a. Inserting and Extracting Relational Data in Python

To interact with a relational database (like PostgreSQL, MySQL, etc.)

in Python, you can use libraries such as psycopg2 (PostgreSQL) or
mysql-connector-python (MySQL). Here's an example using PostgreSQL:

Extracting Data from a Relational Database (PostgreSQL)

python

Code

import psycopg2

# Connect to your database

conn = psycopg2.connect(

dbname="your_database",

user="your_user",

password="your_password",

host="localhost",

By: C BALAJIM.Tech
Asst. Prof.
port="5432"

cursor = conn.cursor()

# Extract data (SELECT query)

cursor.execute("SELECT * FROM users;")

rows = cursor.fetchall()

# Process and print data

for row in rows:

print(f"ID: {row[0]}, Name: {row[1]}, Age: {row[2]}")

# Close connection

cursor.close()

conn.close()

Inserting Data into a Relational Database (PostgreSQL)

python

Code

import psycopg2

# Connect to the PostgreSQL database

By: C BALAJIM.Tech
Asst. Prof.
conn = psycopg2.connect(

dbname="your_database",

user="your_user",

password="your_password",

host="localhost",

port="5432"

cursor = conn.cursor()

# Insert data into the "users" table

cursor.execute("""

INSERT INTO users (name, age)

VALUES (%s, %s)

""", ("John Doe", 29))

# Commit the transaction

conn.commit()

# Close connection

cursor.close()

conn.close()

By: C BALAJIM.Tech
Asst. Prof.
b.Inserting and Extracting NoSQL Database Data in Python

For NoSQL databases such as MongoDB, you can use the pymongo library
to interact with the database.

Extracting Data from MongoDB

python

Code

from pymongo import MongoClient

# Connect to MongoDB (default localhost and port)

client = MongoClient("mongodb://localhost:27017/")

db = client['your_database']

collection = db['your_collection']

# Extract data

documents = collection.find()

# Print each document

for doc in documents:

print(doc)

By: C BALAJIM.Tech
Asst. Prof.
Inserting Data into MongoDB

python

Code

from pymongo import MongoClient

# Connect to MongoDB (default localhost and port)

client = MongoClient("mongodb://localhost:27017/")

db = client['your_database']

collection = db['your_collection']

# Insert a document into the collection

collection.insert_one({

"name": "Jane Doe",

"age": 30,

"email": "[email protected]"

})

c.Building Database Pipelines in Apache Airflow

Airflow allows you to create automated workflows (DAGs) for

interacting with databases. Here’s an example of a simple pipeline to
extract data from PostgreSQL, process it, and insert it back into the
database.

Building a Database Pipeline in Airflow

By: C BALAJIM.Tech
Asst. Prof.
python

Code

from airflow import DAG

from airflow.providers.postgres.operators.postgres import

PostgresOperator

from airflow.operators.python_operator import PythonOperator

from datetime import datetime

import psycopg2

# Define a function to insert data into PostgreSQL

def insert_data_to_db():

# Connect to PostgreSQL

conn = psycopg2.connect(

dbname="your_database",

user="your_user",

password="your_password",

host="localhost",

port="5432"

cursor = conn.cursor()

# Insert data into table

cursor.execute("""
By: C BALAJIM.Tech
Asst. Prof.
INSERT INTO users (name, age)

VALUES (%s, %s)

""", ("New User", 25))

# Commit changes and close the connection

conn.commit()

cursor.close()

conn.close()

# Define the DAG

dag = DAG('database_pipeline_dag', start_date=datetime(2024, 1, 1))

# Task to execute SQL query

create_table_task = PostgresOperator(

task_id='create_table',

postgres_conn_id='postgres_default', # Connection ID in Airflow

sql="""

CREATE TABLE IF NOT EXISTS users (

id SERIAL PRIMARY KEY,

name VARCHAR(100),

age INT

);

By: C BALAJIM.Tech
Asst. Prof.
""",

autocommit=True,

dag=dag

# Task to insert data using Python function

insert_data_task = PythonOperator(

task_id='insert_data_to_db',

python_callable=insert_data_to_db,

dag=dag

# Set task dependencies

create_table_task >> insert_data_task

● PostgresOperator: Executes SQL commands (like creating tables or

running queries).
● PythonOperator: Executes Python functions, such as inserting
data.
● DAG: Defines the directed acyclic graph (workflow) of tasks.

d.Building Database Pipelines in NiFi

Apache NiFi provides various processors to interact with relational

and NoSQL databases. Below are some NiFi processors that you can use
to extract data and insert data into databases.

By: C BALAJIM.Tech
Asst. Prof.
1. Inserting Data into a Relational Database (PostgreSQL)

To insert data into a database, use the PutSQL processor in NiFi.

Here's how you can configure it:

Steps:

1. Add the PutSQL processor to your NiFi canvas.

2. Configure the processor to use a Database Connection Pooling
Service like DBCPConnectionPool for PostgreSQL.
3. Define your SQL statement for inserting data (e.g., INSERT INTO
users (name, age) VALUES (?, ?)).
4. Link your input data to the processor to dynamically insert data.

NiFi Processor Example for Inserting Data:

● Processor: PutSQL
● SQL Query: INSERT INTO users (name, age) VALUES (?, ?)
● Database Connection Pooling Service: DBCPConnectionPool

Configuration:

1. Database Connection Pooling Service: Configure the

DBCPConnectionPool to connect to your PostgreSQL database.
2. SQL Statement: Provide the SQL query for inserting data. You can
use NiFi Expression Language to pass values dynamically.
3. Batch Size: Adjust this based on how many records you want to
insert at once.

2. Extracting Data from a Relational Database (PostgreSQL)

To extract data from a database, use the ExecuteSQL processor in NiFi.

Steps:

1. Add the ExecuteSQL processor to your NiFi canvas.

2. Configure it to use the same Database Connection Pooling Service.
3. Define your SQL query to fetch data, e.g., SELECT * FROM users;.
4. Set the output format (usually flowfile with the result data).

NiFi Processor Example for Extracting Data:

By: C BALAJIM.Tech
Asst. Prof.
● Processor: ExecuteSQL
● SQL Query: SELECT * FROM users;
● Database Connection Pooling Service: DBCPConnectionPool

Configuration:

1. Database Connection Pooling Service: Set up the connection

pooling service to interact with the database.
2. SQL Query: Set the SQL query you want to execute.
3. Output Flowfile: The results from the database query will be
output as flowfiles that you can process further in your
pipeline.

By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT-5

AIM : Cleaning, Transforming and Enriching Data

a. Performing exploratory data analysis in Python

b. Handling common data issues using pandas

c. Cleaning data using Airflow

a.Performing Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) is the first step in analyzing a

dataset, and it typically involves summarizing the data's key
characteristics and visualizing relationships between variables.

Steps for EDA in Python:

1. Load the dataset using Pandas.

2. Summarize the data.
3. Visualize the distribution of the data and relationships.
4. Identify outliers, missing values, and potential transformations.

Code for Performing EDA in Python

python

Code

import pandas as pd

import matplotlib.pyplot as plt

By: C BALAJIM.Tech
Asst. Prof.
import seaborn as sns

# Load the dataset (assuming a CSV file)

df = pd.read_csv('your_data.csv')

# 1. Basic Information

print("Dataset Info:")

print(df.info()) # Shows info about columns, non-null counts, and

types

# 2. Statistical Summary

print("\nStatistical Summary:")

print(df.describe()) # Shows statistical summary of numeric columns

# 3. Handling Missing Data

print("\nMissing Data (percentage of missing values):")

print(df.isnull().mean() * 100) # Displays percentage of missing

values for each column

# 4. Visualizing Data Distribution (histograms)

df.hist(figsize=(12, 10), bins=30)

plt.suptitle('Data Distribution')

plt.show()

By: C BALAJIM.Tech
Asst. Prof.
# 5. Correlation Matrix

correlation_matrix = df.corr() # Get correlation between numeric

columns

plt.figure(figsize=(10, 8))

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")

plt.title('Correlation Heatmap')

plt.show()

# 6. Visualizing Outliers (Box Plot)

plt.figure(figsize=(10, 8))

sns.boxplot(data=df)

plt.title('Box Plot for Outliers')

plt.show()

# 7. Pairplot for relationships between variables

sns.pairplot(df)

plt.suptitle('Pairplot of Variables', y=1.02)

plt.show()

What this code does:

1. df.info(): Provides the data types, non-null counts, and general

information.

By: C BALAJIM.Tech
Asst. Prof.
2. df.describe(): Shows the statistical summary for numerical
columns (mean, standard deviation, min/max, etc.).
3. df.isnull().mean(): Identifies missing values and shows the
percentage of missing data for each column.
4. df.hist(): Plots histograms for each numerical column.
5. Correlation heatmap: Shows the correlation between numerical
columns in the dataset.
6. Boxplots: Identify potential outliers in the dataset.
7. Pairplot: Provides pairwise plots to examine relationships
between columns.

b.Handling Common Data Issues Using Pandas

When working with data in Python, Pandas provides a wide range of

functions to handle common issues such as missing data, duplicate
rows, and inconsistent data formats. Below are some common tasks:

Common Data Issues & Solutions:

1. Handling Missing Data:

Removing missing values:

python
Code
df.dropna(inplace=True) # Removes rows with any missing values

Filling missing values (using the mean, median, or custom values):

python
Code
df['column_name'].fillna(df['column_name'].mean(), inplace=True) #
Fill with mean value

○
2. Handling Duplicates:

By: C BALAJIM.Tech
Asst. Prof.
Removing duplicates:
python
Code
df.drop_duplicates(inplace=True)

○
3. Handling Data Type Issues:

Converting columns to the correct data type:

python
Code
df['date_column'] = pd.to_datetime(df['date_column'])

df['numeric_column'] = pd.to_numeric(df['numeric_column'],
errors='coerce') # Convert non-numeric values to NaN

○
4. Handling Inconsistent Data (e.g., string values with
leading/trailing spaces):

Stripping spaces from string columns:

python
Code
df['string_column'] = df['string_column'].str.strip()

○
5. Normalizing/Scaling Data:

Min-Max Scaling:
python
Code
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['scaled_column'] = scaler.fit_transform(df[['column_to_scale']])

Renaming Columns:
python
By: C BALAJIM.Tech
Asst. Prof.
Code
df.rename(columns={'old_name': 'new_name'}, inplace=True)

6.
7. Replacing values:

Replacing specific values in a column:

python
Code
df['column_name'] = df['column_name'].replace({'old_value':
'new_value'})

Code for Handling Missing Values and Duplicates

python

Code

import pandas as pd

# Load dataset

df = pd.read_csv('your_data.csv')

# Remove rows with missing values

df.dropna(inplace=True)

# Fill missing values with mean for numeric columns

df.fillna(df.mean(), inplace=True)

# Remove duplicate rows

By: C BALAJIM.Tech
Asst. Prof.
df.drop_duplicates(inplace=True)

# Convert columns to the correct data type

df['date_column'] = pd.to_datetime(df['date_column'])

# Normalize a numeric column

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['normalized_column'] = scaler.fit_transform(df[['numeric_column']])

# Strip leading/trailing spaces from string columns

df['column_name'] = df['column_name'].str.strip()

# Save cleaned data to a new CSV file

df.to_csv('cleaned_data.csv', index=False)

c.Cleaning Data Using Airflow

In Airflow, you can automate data cleaning tasks using the

PythonOperator or custom operators. Here's how you can build a DAG
that cleans and transforms data as part of an ETL pipeline.

Steps for Cleaning Data in Airflow:

1. Create a DAG to define the pipeline.

By: C BALAJIM.Tech
Asst. Prof.
2. Use PythonOperator to clean data (remove missing values,
normalize, etc.).
3. Schedule the DAG to run periodically.

Code to Clean Data Using Airflow

python

Code

from airflow import DAG

from airflow.operators.python import PythonOperator

from datetime import datetime

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

# Function to clean data

def clean_data():

# Load data

df = pd.read_csv('your_data.csv')

# Remove rows with missing values

df.dropna(inplace=True)

# Fill missing values with the mean for numeric columns

df.fillna(df.mean(), inplace=True)

By: C BALAJIM.Tech
Asst. Prof.
# Normalize a numeric column

scaler = MinMaxScaler()

df['normalized_column'] =
scaler.fit_transform(df[['numeric_column']])

# Strip spaces from string columns

df['column_name'] = df['column_name'].str.strip()

# Save the cleaned data

df.to_csv('cleaned_data.csv', index=False)

# Define the DAG

dag = DAG(

'data_cleaning_dag',

start_date=datetime(2024, 1, 1),

schedule_interval='@daily', # Runs every day

catchup=False

# Task to clean data

clean_data_task = PythonOperator(

task_id='clean_data',

python_callable=clean_data,
By: C BALAJIM.Tech
Asst. Prof.
dag=dag

# Set the task dependency

clean_data_task

By: C BALAJIM.Tech
Asst. Prof.

Class-12-Computer Science Practical File 2023-24
75% (24)
Class-12-Computer Science Practical File 2023-24
45 pages
Exam Ref Az-900 Microsoft Azure Fundamentals 3rd Edition
67% (3)
Exam Ref Az-900 Microsoft Azure Fundamentals 3rd Edition
46 pages
Apache Nifi Tutorial
No ratings yet
Apache Nifi Tutorial
19 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Nifi 210415 Exercise Manual
100% (1)
Nifi 210415 Exercise Manual
140 pages
Arch Linux: Fast and Light!
From Everand
Arch Linux: Fast and Light!
Frank Cheung
3/5 (2)
Practical Ethical Hacking from Scratch
From Everand
Practical Ethical Hacking from Scratch
Ansh Goyal
4.5/5 (2)
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
Miniproject Testplan
No ratings yet
Miniproject Testplan
6 pages
Installing and Configuring Tools
No ratings yet
Installing and Configuring Tools
5 pages
Final Manual Practical 1 DE
No ratings yet
Final Manual Practical 1 DE
6 pages
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mamood Alassouli
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mahmood Alassouli
No ratings yet
A concise guide to PHP MySQL and Apache
From Everand
A concise guide to PHP MySQL and Apache
alasdair gilchrist
4/5 (2)
Infrastructure as Code (IAC) Cookbook
From Everand
Infrastructure as Code (IAC) Cookbook
Stephane Jourdan
No ratings yet
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
De Lab Manual
No ratings yet
De Lab Manual
39 pages
Red Hat Enterprise Linux 6 Administration: Real World Skills for Red Hat Administrators
From Everand
Red Hat Enterprise Linux 6 Administration: Real World Skills for Red Hat Administrators
Sander van Vugt
No ratings yet
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
From Everand
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
DAISY JOHNSTON
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
From Everand
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
Kiet Huynh
No ratings yet
Plone 3.3 Site Administration
From Everand
Plone 3.3 Site Administration
Alex Clark
No ratings yet
Perl and Apache: Your visual blueprint for developing dynamic Web content
From Everand
Perl and Apache: Your visual blueprint for developing dynamic Web content
Adam McDaniel
No ratings yet
Configuration of Apache Server to Support Asp
From Everand
Configuration of Apache Server to Support Asp
Dr. Hidaia Mahmood Alassouli
No ratings yet
Backend Handbook: for Ruby on Rails Apps
From Everand
Backend Handbook: for Ruby on Rails Apps
Francisco Quintero
1/5 (1)
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
From Everand
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
Andrew Lee
3/5 (2)
Introduction to PHP, Part 1, Second Edition
From Everand
Introduction to PHP, Part 1, Second Edition
Adam Majczak
No ratings yet
Professional Node.js: Building Javascript Based Scalable Software
From Everand
Professional Node.js: Building Javascript Based Scalable Software
Pedro Teixeira
No ratings yet
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet
Experiment-1 241216 112158
No ratings yet
Experiment-1 241216 112158
16 pages
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
From Everand
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
Dr. Hidaia Mahmood Alassouli
No ratings yet
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
Configuration of Apache Server To Support ASP
From Everand
Configuration of Apache Server To Support ASP
Dr. Hedaya Mahmood Alasooly
No ratings yet
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
From Everand
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
The Definitive Guide to Getting Started with OpenCart 2.x
From Everand
The Definitive Guide to Getting Started with OpenCart 2.x
iSenseLabs
No ratings yet
Hands-on React Native
From Everand
Hands-on React Native
Ahmed Bouchefra
1/5 (1)
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
vSphere 5 AutoLab 1.1a Deployment Guide
From Everand
vSphere 5 AutoLab 1.1a Deployment Guide
Alastair Cooke
No ratings yet
Python Programming: Learn, Code, Create
From Everand
Python Programming: Learn, Code, Create
Sachin Naha
No ratings yet
Hacking of Computer Networks: Full Course on Hacking of Computer Networks
From Everand
Hacking of Computer Networks: Full Course on Hacking of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
Alfresco Developer Guide
From Everand
Alfresco Developer Guide
Jeff Potts
No ratings yet
Apache Airflow Cookbook 2
No ratings yet
Apache Airflow Cookbook 2
55 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
From Everand
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
Arnaud Weil
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
An Ultimate Guide to Kali Linux for Beginners
From Everand
An Ultimate Guide to Kali Linux for Beginners
Ansh Goyal
3.5/5 (4)
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
From Everand
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
Nathan Metzler
4/5 (2)
Hallo Docker: Learning Docker Containers by Doing Projects
From Everand
Hallo Docker: Learning Docker Containers by Doing Projects
Agus Kurniawan
No ratings yet
Learn Kubernetes - Container orchestration using Docker: Learn Collection
From Everand
Learn Kubernetes - Container orchestration using Docker: Learn Collection
Arnaud Weil
4/5 (1)
Node.js: Tools & Skills
From Everand
Node.js: Tools & Skills
James Hibbard
No ratings yet
Creating Wordpress Online Store and Wordpress Online Magazine
From Everand
Creating Wordpress Online Store and Wordpress Online Magazine
Dr. Hidaia Mahmood Alassouli
No ratings yet
Make Bootstrap Themes
From Everand
Make Bootstrap Themes
Bo Feng
No ratings yet
iOS and OS X Network Programming Cookbook
From Everand
iOS and OS X Network Programming Cookbook
Jon Hoffman
No ratings yet
Extending Docker
From Everand
Extending Docker
Russ McKendrick
5/5 (1)
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
Your First Week With Node.js
From Everand
Your First Week With Node.js
James Hibbard
No ratings yet
Troubleshooting Ubuntu Server
From Everand
Troubleshooting Ubuntu Server
Bhargav Skanda
No ratings yet
Artificial Intelligence and P
No ratings yet
Artificial Intelligence and P
7 pages
IDs Notes
No ratings yet
IDs Notes
52 pages
De Unit 4
No ratings yet
De Unit 4
33 pages
De Programs2
No ratings yet
De Programs2
16 pages
Library Management System SQL Project Document
No ratings yet
Library Management System SQL Project Document
6 pages
AJP Lab Manual
No ratings yet
AJP Lab Manual
40 pages
Abap Material Final
0% (1)
Abap Material Final
385 pages
Teste - Professional - Respostas
No ratings yet
Teste - Professional - Respostas
24 pages
Presentation ON RDBMS: Submitted By-Dilpreet Singh Joginder Singh Class - Mba (Bu) 3 SEM
100% (3)
Presentation ON RDBMS: Submitted By-Dilpreet Singh Joginder Singh Class - Mba (Bu) 3 SEM
11 pages
B.Tech 5th Sem CSE Final 1
No ratings yet
B.Tech 5th Sem CSE Final 1
16 pages
Enterprise Resource Planning System
No ratings yet
Enterprise Resource Planning System
32 pages
CS6005 Advanced Database System UNIT II
No ratings yet
CS6005 Advanced Database System UNIT II
95 pages
Scou 220 Manual T 03
No ratings yet
Scou 220 Manual T 03
18 pages
Pallavi Model School Bowenpally Unit 3 Relational Database Management Systems (Basic)
No ratings yet
Pallavi Model School Bowenpally Unit 3 Relational Database Management Systems (Basic)
2 pages
Aws Developer Guide
No ratings yet
Aws Developer Guide
784 pages
aspenONE InstCtfgV8 - 8 PDF
No ratings yet
aspenONE InstCtfgV8 - 8 PDF
67 pages
Nadra Jobs
No ratings yet
Nadra Jobs
4 pages
Information Systems in Logistics - Fin
No ratings yet
Information Systems in Logistics - Fin
26 pages
Rönnqvist Et Al 2017 Calibrated Route Finder Improving The Safety Environmental Consciousness and Cost Effectiveness of
No ratings yet
Rönnqvist Et Al 2017 Calibrated Route Finder Improving The Safety Environmental Consciousness and Cost Effectiveness of
25 pages
Solix EDMS Database Archiving
No ratings yet
Solix EDMS Database Archiving
3 pages
Upgrade of Oracle Applicationto115102
No ratings yet
Upgrade of Oracle Applicationto115102
21 pages
CH 5 ERP Life Cycle - Planning and Package Selection
100% (1)
CH 5 ERP Life Cycle - Planning and Package Selection
35 pages
SSIS
0% (1)
SSIS
99 pages
Oracle Applications Tables
No ratings yet
Oracle Applications Tables
58 pages
LIBRARY BOOK LOCATOR PROJECT - Android
No ratings yet
LIBRARY BOOK LOCATOR PROJECT - Android
22 pages
Minor Project Report
No ratings yet
Minor Project Report
38 pages
CV Alexandru Schiopu
No ratings yet
CV Alexandru Schiopu
3 pages
Beijer Ix Training
100% (1)
Beijer Ix Training
206 pages
ZXDSL 9806E V1 0 Broadband Integrated Ac
No ratings yet
ZXDSL 9806E V1 0 Broadband Integrated Ac
391 pages
Fox Pro Command
100% (3)
Fox Pro Command
11 pages
System Software: Important Features
No ratings yet
System Software: Important Features
13 pages