De Lab Manual
De Lab Manual
EXPERIMENT -1
PROCEDURE :
Part 1: Installing and Configuring Apache NiFi on Windows
Apache NiFi is a powerful data integration tool for automating data flow. Here’s how to install
and configure it on Windows:
Step 1: Prerequisites
1. Go to the official Apache NiFi website and download the Windows binary:
○ https://nifi.apache.org/download.html
2. Select the appropriate version and download the .zip file.
1. Extract the ZIP file to a location of your choice, for example, C:\nifi.
2. Navigate to the NiFi folder (C:\nifi).
● You can configure NiFi by modifying the conf\nifi.properties file if needed (e.g.,
change ports, security settings).
● In most cases, you can use the default settings.
By: C BALAJIM.Tech
Asst. Prof.
1.
2.
Step 1: Prerequisites
By: C BALAJIM.Tech
Asst. Prof.
○
2.
1.
○
2. Start the Airflow Scheduler:
By: C BALAJIM.Tech
Asst. Prof.
○
1. Create a DAG by writing a Python script in the dags directory (typically located at
C:\Users\<YourUsername>\airflow\dags).
○ Example of a simple DAG (simple_dag.py):
python
code
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2024, 11, 23),
'retries': 1,
}
2.
By: C BALAJIM.Tech
Asst. Prof.
○ To stop the web server, use Ctrl+C in the terminal where it's running.
○ Similarly, stop the scheduler with Ctrl+C in the scheduler terminal.
Troubleshooting Tips
If you encounter errors related to the database, clear the existing database (only in a
development environment) using:
bash
code
airflow db reset
○
2. Permissions:
○ Sometimes, Airflow needs elevated permissions (especially for Windows). If you
encounter permissions-related issues, ensure you're running the command
prompt as Administrator.
By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT - 2
PROCEDURE :
Elasticsearch is a distributed search and analytics engine. Here's how to install it on Windows.
●
● You should see a JSON response confirming Elasticsearch is running.
Kibana is a data visualization platform that works with Elasticsearch. Here’s how to install it on
Windows.
1. Navigate to the config folder inside the Kibana directory (e.g., C:\kibana\config).
2. Open the kibana.yml file and modify the following configuration:
2.
By: C BALAJIM.Tech
Asst. Prof.
Start Kibana by running:
kibana.bat
3.
●
● You should see the Kibana dashboard, confirming that Kibana is connected to
Elasticsearch.
PostgreSQL is a powerful open-source relational database system. Follow these steps to install
it on Windows.
● PostgreSQL should start automatically after installation. You can verify that the
PostgreSQL service is running by checking the Services window (services.msc).
By: C BALAJIM.Tech
Asst. Prof.
2. Navigate to the PostgreSQL bin directory (usually in C:\Program
Files\PostgreSQL\xx\bin).
3.
4. Enter the password you set during installation.
pgAdmin is a web-based GUI for managing PostgreSQL databases. Here's how to install and
configure it.
1. After installation, you can launch pgAdmin 4 from the Start Menu or desktop shortcut.
2. The first time you launch it, you will be prompted to set the master password (used for
securing your pgAdmin credentials).
1. Once inside pgAdmin, click on "Add New Server" in the browser panel.
By: C BALAJIM.Tech
Asst. Prof.
○ Host name/address: localhost
○ Port: 5432 (default)
○ Maintenance database: postgres
○ Username: postgres
○ Password: The password you set during PostgreSQL installation.
4. Click Save to add the server.
● You should now see your PostgreSQL server listed in pgAdmin. Expand the server, and
you can access the databases, tables, and run SQL queries.
Conclusion
By following these steps, you should have successfully installed and configured:
Troubleshooting Tips
1. Elasticsearch:
○If Kibana isn't displaying properly, verify that Elasticsearch is running and check
the Kibana logs located in C:\kibana\logs\.
3. PostgreSQL:
○
If you can't connect to PostgreSQL, ensure that the PostgreSQL service is
running and that your firewall allows connections on port 5432.
4. pgAdmin:
With all components installed and configured, you now have a robust stack for data storage,
search, analytics, and visualization.
By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT -3
e. Databases in Airflow
Python provides built-in functions to read from and write to files. Below are examples for
common file operations.
To read a file in Python, you can use the open() function with the r mode (read):
python
Code
# Reading a file
content = file.read()
print(content)
● open(): Opens the file in the specified mode ('r' for reading).
● with statement: Automatically closes the file after reading.
By: C BALAJIM.Tech
Asst. Prof.
Writing Files in Python
To write data to a file, use the open() function with the w (write) or a (append) mode:
python
Code
# Writing to a file
file.write("Second line.")
For reading and writing binary files (e.g., images or PDFs), use the rb and wb modes
respectively:
python
Code
data = file.read()
file.write(data)
python
Code
def process_file():
content = file.read()
file.write(f"Processed: {content}")
process_file_task = PythonOperator(
task_id='process_file',
By: C BALAJIM.Tech
Asst. Prof.
python_callable=process_file,
dag=dag
You can also use the BashOperator to run shell commands, such as copying or moving files:
python
Code
move_file_task = BashOperator(
task_id='move_file',
bash_command='mv /path/to/input_file.txt
/path/to/output_file.txt',
dag=dag
).
1. GetFile
By: C BALAJIM.Tech
Asst. Prof.
● Function: Reads files from a directory and sends them as flow
files.
● Use Case: Ingesting files from local directories into NiFi.
● Configuration: Set the Input Directory to the folder containing
files you want to read.
2. PutFile
3. ListFile
4. FetchFile
5. ReplaceText
By: C BALAJIM.Tech
Asst. Prof.
Python supports database connections through libraries like psycopg2
(for PostgreSQL), pyodbc (for SQL Server), and SQLAlchemy (for ORM-
based access).
Install psycopg2:
bash
Code
pip install psycopg2
1.
cursor = conn.cursor()
rows = cursor.fetchall()
# Print results
print(row)
By: C BALAJIM.Tech
Asst. Prof.
# Close the connection
cursor.close()
conn.close()
2.
python
Code
import psycopg2
# Connect to PostgreSQL
cursor = conn.cursor()
conn.commit()
cursor.close()
conn.close()
By: C BALAJIM.Tech
Asst. Prof.
e. Databases in Airflow
python
Code
sql_query = """
name VARCHAR(100),
age INT
By: C BALAJIM.Tech
Asst. Prof.
);
"""
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id='postgres_default', # Connection ID to
PostgreSQL
sql=sql_query,
dag=dag
1. ExecuteSQL
3. QueryDatabaseTable
4. GenerateTableFetch
By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT -4
python
Code
import psycopg2
conn = psycopg2.connect(
dbname="your_database",
user="your_user",
password="your_password",
host="localhost",
By: C BALAJIM.Tech
Asst. Prof.
port="5432"
cursor = conn.cursor()
rows = cursor.fetchall()
# Close connection
cursor.close()
conn.close()
python
Code
import psycopg2
By: C BALAJIM.Tech
Asst. Prof.
conn = psycopg2.connect(
dbname="your_database",
user="your_user",
password="your_password",
host="localhost",
port="5432"
cursor = conn.cursor()
cursor.execute("""
conn.commit()
# Close connection
cursor.close()
conn.close()
By: C BALAJIM.Tech
Asst. Prof.
b.Inserting and Extracting NoSQL Database Data in Python
For NoSQL databases such as MongoDB, you can use the pymongo library
to interact with the database.
python
Code
client = MongoClient("mongodb://localhost:27017/")
db = client['your_database']
collection = db['your_collection']
# Extract data
documents = collection.find()
print(doc)
By: C BALAJIM.Tech
Asst. Prof.
Inserting Data into MongoDB
python
Code
client = MongoClient("mongodb://localhost:27017/")
db = client['your_database']
collection = db['your_collection']
collection.insert_one({
"age": 30,
"email": "[email protected]"
})
By: C BALAJIM.Tech
Asst. Prof.
python
Code
import psycopg2
def insert_data_to_db():
# Connect to PostgreSQL
conn = psycopg2.connect(
dbname="your_database",
user="your_user",
password="your_password",
host="localhost",
port="5432"
cursor = conn.cursor()
cursor.execute("""
By: C BALAJIM.Tech
Asst. Prof.
INSERT INTO users (name, age)
conn.commit()
cursor.close()
conn.close()
create_table_task = PostgresOperator(
task_id='create_table',
sql="""
name VARCHAR(100),
age INT
);
By: C BALAJIM.Tech
Asst. Prof.
""",
autocommit=True,
dag=dag
insert_data_task = PythonOperator(
task_id='insert_data_to_db',
python_callable=insert_data_to_db,
dag=dag
By: C BALAJIM.Tech
Asst. Prof.
1. Inserting Data into a Relational Database (PostgreSQL)
Steps:
● Processor: PutSQL
● SQL Query: INSERT INTO users (name, age) VALUES (?, ?)
● Database Connection Pooling Service: DBCPConnectionPool
Configuration:
Steps:
By: C BALAJIM.Tech
Asst. Prof.
● Processor: ExecuteSQL
● SQL Query: SELECT * FROM users;
● Database Connection Pooling Service: DBCPConnectionPool
Configuration:
By: C BALAJIM.Tech
Asst. Prof.
EXPERIMENT-5
python
Code
import pandas as pd
By: C BALAJIM.Tech
Asst. Prof.
import seaborn as sns
df = pd.read_csv('your_data.csv')
# 1. Basic Information
print("Dataset Info:")
# 2. Statistical Summary
print("\nStatistical Summary:")
plt.suptitle('Data Distribution')
plt.show()
By: C BALAJIM.Tech
Asst. Prof.
# 5. Correlation Matrix
plt.figure(figsize=(10, 8))
plt.title('Correlation Heatmap')
plt.show()
plt.figure(figsize=(10, 8))
sns.boxplot(data=df)
plt.show()
sns.pairplot(df)
plt.show()
By: C BALAJIM.Tech
Asst. Prof.
2. df.describe(): Shows the statistical summary for numerical
columns (mean, standard deviation, min/max, etc.).
3. df.isnull().mean(): Identifies missing values and shows the
percentage of missing data for each column.
4. df.hist(): Plots histograms for each numerical column.
5. Correlation heatmap: Shows the correlation between numerical
columns in the dataset.
6. Boxplots: Identify potential outliers in the dataset.
7. Pairplot: Provides pairwise plots to examine relationships
between columns.
○
2. Handling Duplicates:
By: C BALAJIM.Tech
Asst. Prof.
Removing duplicates:
python
Code
df.drop_duplicates(inplace=True)
○
3. Handling Data Type Issues:
df['numeric_column'] = pd.to_numeric(df['numeric_column'],
errors='coerce') # Convert non-numeric values to NaN
○
4. Handling Inconsistent Data (e.g., string values with
leading/trailing spaces):
○
5. Normalizing/Scaling Data:
Min-Max Scaling:
python
Code
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['scaled_column'] = scaler.fit_transform(df[['column_to_scale']])
Renaming Columns:
python
By: C BALAJIM.Tech
Asst. Prof.
Code
df.rename(columns={'old_name': 'new_name'}, inplace=True)
6.
7. Replacing values:
python
Code
import pandas as pd
# Load dataset
df = pd.read_csv('your_data.csv')
df.dropna(inplace=True)
df.fillna(df.mean(), inplace=True)
df['date_column'] = pd.to_datetime(df['date_column'])
scaler = MinMaxScaler()
df['normalized_column'] = scaler.fit_transform(df[['numeric_column']])
df['column_name'] = df['column_name'].str.strip()
df.to_csv('cleaned_data.csv', index=False)
By: C BALAJIM.Tech
Asst. Prof.
2. Use PythonOperator to clean data (remove missing values,
normalize, etc.).
3. Schedule the DAG to run periodically.
python
Code
import pandas as pd
def clean_data():
# Load data
df = pd.read_csv('your_data.csv')
df.dropna(inplace=True)
df.fillna(df.mean(), inplace=True)
By: C BALAJIM.Tech
Asst. Prof.
# Normalize a numeric column
scaler = MinMaxScaler()
df['normalized_column'] =
scaler.fit_transform(df[['numeric_column']])
df['column_name'] = df['column_name'].str.strip()
df.to_csv('cleaned_data.csv', index=False)
dag = DAG(
'data_cleaning_dag',
start_date=datetime(2024, 1, 1),
catchup=False
clean_data_task = PythonOperator(
task_id='clean_data',
python_callable=clean_data,
By: C BALAJIM.Tech
Asst. Prof.
dag=dag
clean_data_task
By: C BALAJIM.Tech
Asst. Prof.