0% found this document useful (0 votes)
5 views7 pages

Untitled Document

Uploaded by

Anmol Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Untitled Document

Uploaded by

Anmol Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Here is the fully detailed and complete version of your development environment setup, merging

all components and examples:

---

Development Environment Setup

1. Python and Libraries Setup

a. Python Version
- Windows: Python 3.10.11
- Ubuntu: Python 3.10.12
- Usage: Python is the base language used for various data processing libraries, machine
learning frameworks, and other utilities.

Example:

```bash
# Windows
(venv) PS C:\Users\Olive\IdeaProjects\Anmol> python --version
Python 3.10.11

# Ubuntu
$ python3 --version
Python 3.10.12
```

#### b. **Installed Python Packages**


- **Packages**: FastAvro, PyArrow, NumPy, psutil, PySpark, etc.
- **Usage**: These packages are critical for data analysis, system monitoring, and big data
processing.

**Example**:

```bash
# List installed packages
(venv) PS C:\Users\Olive\IdeaProjects\Anmol> pip list
Package Version
---------- --------
fastavro 1.9.7
numpy 2.1.1
psutil 6.0.0
pyarrow 17.0.0
pyspark 3.5.2
```

### 2. Python Libraries and Their Usage

#### a. **PyArrow**
- **Version**: 17.0.0
- **Usage**: In-memory data representation, highly efficient for data conversion between
different formats like Pandas and Spark.

**Example**:

```python
import pyarrow as pa

data = {'column1': [1, 2, 3], 'column2': ['A', 'B', 'C']}


table = pa.Table.from_pydict(data)
print(table)
```

#### b. **Avro**
- **Version**: 1.10.2
- **Usage**: A data serialization system used for Hadoop and other systems, enabling data
exchange between different formats.

**Example**:

```python
import avro.schema
from avro.datafile import DataFileReader
from avro.io import DatumReader

schema = avro.schema.parse(open("example.avsc", "rb").read())


with open("data.avro", "rb") as f:
reader = DataFileReader(f, DatumReader())
for record in reader:
print(record)
reader.close()
```

#### c. **JSON**
- **Version**: Built-in Python library
- **Usage**: JSON is essential for data interchange between APIs and processing JSON
formatted data.
**Example**:

```python
import json

# Create a dictionary and convert it to JSON format


data = {'name': 'Alice', 'age': 30}
json_data = json.dumps(data)
print(json_data)

# Load JSON back to Python dict


loaded_data = json.loads(json_data)
print(loaded_data)
```

#### d. **PySpark**
- **Version**: 3.5.2
- **Usage**: PySpark is used for big data processing via Apache Spark, facilitating parallel
computation and large-scale analytics.

**Example**:

```python
from pyspark.sql import SparkSession

if __name__ == "__main__":
spark = SparkSession.builder \
.appName("Check Spark SQL Version") \
.master("local[*]").getOrCreate()

# Print Spark SQL version


print("Spark SQL Version:", spark.version) # Output: Spark SQL Version: 3.5.2
spark.stop()
```

---

### 3. Cloudera Environment Setup

#### a. **Cloudera Version**


- **Version**: Cloudera 5.12.0
- **Usage**: Managing and deploying Hadoop-based data processing systems.

**Example**:
```bash
$ cloudera-manager-server --version
Cloudera Manager Server: 5.12.0
```

#### b. **Hadoop Version**


- **Version**: Hadoop 2.6.0-cdh5.12.0
- **Usage**: Distributed storage system for managing large datasets across multiple nodes.

**Example**:

```bash
$ hadoop version
Hadoop 2.6.0-cdh5.12.0
```

#### c. **YARN Version**


- **Version**: 2.6.0-cdh5.12.0
- **Usage**: Resource management and job scheduling in Hadoop.

**Example**:

```bash
$ yarn version
Hadoop 2.6.0-cdh5.12.0
```

#### d. **HDFS Version**


- **Version**: 2.6.0-cdh5.12.0
- **Usage**: Distributed file system for scalable data storage.

**Example**:

```bash
$ hdfs version
HDFS 2.6.0-cdh5.12.0
```

#### e. **Hive Version**


- **Version**: Hive 1.1.0-cdh5.12.0
- **Usage**: Data warehousing system for large-scale data analysis with SQL-like queries.

**Example**:
```bash
$ hive --version
Hive 1.1.0-cdh5.12.0
```

#### f. **Sqoop Version**


- **Version**: Sqoop 1.4.6-cdh5.12.0
- **Usage**: Tool to transfer data between Hadoop and relational databases.

**Example**:

```bash
$ sqoop version
Sqoop 1.4.6-cdh5.12.0
```

#### g. **Spark Version**


- **Version**: Spark 2.3.1
- **Usage**: Large-scale data processing, batch, and real-time analytics.

**Example**:

```bash
$ spark-submit --version
Spark 2.3.1
```

#### h. **Pig Version**


- **Version**: Pig 0.12.0-cdh5.12.0
- **Usage**: High-level data processing language for analyzing large datasets.

**Example**:

```bash
$ pig -version
Pig 0.12.0-cdh5.12.0
```

#### i. **HBase Version**


- **Version**: HBase 1.2.0-cdh5.12.0
- **Usage**: Non-relational, distributed database for large data storage with random access.

**Example**:
```bash
$ hbase version
HBase 1.2.0-cdh5.12.0
```

#### j. **Impala Version**


- **Version**: Impala Shell v2.9.0-cdh5.12.0
- **Usage**: SQL query execution engine for real-time, low-latency queries on HDFS data.

**Example**:

```bash
$ impala-shell --version
Impala Shell v2.9.0-cdh5.12.0
```

---

### 4. Additional Cloudera Environment Information

#### a. **AWS CLI Version**


- **Version**: aws-cli/1.16.188
- **Usage**: Managing AWS services from the command line.

**Example**:

```bash
$ aws --version
aws-cli/1.16.188 Python/2.6.6
```

#### b. **Bash Version**


- **Version**: GNU bash, version 4.1.2
- **Usage**: Command-line shell for executing commands and scripts.

**Example**:

```bash
$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
```

#### c. **Linux Distribution Information**


- **Linux**: Ubuntu 22.04.3 LTS
- **Usage**: Operating system details for system management and compatibility checks.

**Example**:

```bash
$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
```

---

This setup details all the essential components, versions, and examples for your development
environment, ensuring a comprehensive overview of the tools and technologies used across
different platforms. Let me know if you need further details or explanations!

You might also like