0% found this document useful (0 votes)
110 views5 pages

Open Source Tools For Data Engineering - LinkedIn

Uploaded by

adede2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views5 pages

Open Source Tools For Data Engineering - LinkedIn

Uploaded by

adede2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

7 Unlo
Home My Network Jobs Messaging Notifications Me For Business

Open source tools for


Data Engineering
Midhun Pottammal
Data Engineer and Full Stack Expert | Hadoop, 14 articles Follow
Spark, Kafka, Python, and NoSQL (Hive, Hbas…

February 14, 2024

Open Immersive Reader

Data Integration

1. Apache NiFi: A powerful and easy-to-use tool for


moving data between systems.

2. Airbyte: An open-source data integration platform that


helps you replicate your data in your warehouses,
lakes, and databases.

3. Meltano: An open-source data integration tool that


simplifies the process of extracting, loading, and
transforming data.

4. Apache Inlong: A platform for real-time data ingestion


and complex event processing.

5. Apache SeaTunnel: A data transfer tool for efficiently


moving large volumes of data.

Storage

1. HDFS: The Hadoop Distributed File System, designed


for storing large files across multiple machines.

2. Apache Ozone: A scalable, redundant, and distributed


object store for Hadoop.

3. Ceph: A distributed object, block, and file storage


platform.

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 1/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

4. MinIO: A high-performance, distributed object storage


server.

Data Lake Platform

1. Apache Hudi: A data lake solution for managing large


analytical datasets.

2. Apache Iceberg: A table format for storing huge, slow-


moving tabular data.

3. Delta: An open-source storage layer that brings ACID


transactions to Apache Spark.

4. Paimon: A data lake platform for managing and


analyzing data at scale.

Event Processing

1. Kafka: A distributed event streaming platform capable


of handling trillions of events a day.

2. Redpanda: A Kafka-compatible event streaming


platform with a focus on performance and scalability.

3. Pulsar: A cloud-native, distributed messaging and


streaming platform.

Data Processing & Computation

1. Apache Spark: An open-source, distributed computing


system that provides an interface for programming
entire clusters with implicit data parallelism and fault
tolerance.

2. Apache Flink: A framework and distributed processing


engine for stateful computations over unbounded and
bounded data streams.

3. Vaex: A Python library for lazy, out-of-core


DataFrames.

4. Ray: A fast and simple framework for building and


running distributed applications.

5. Dask: A flexible parallel computing library for analytic


computing.

6. Polars: A blazingly fast DataFrame library implemented


in Rust and using Apache Arrow.

Database

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 2/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

OLTP:SQL — RDBMS(MySQL, Postgres), In


Memory(Apache Ignite)NoSQL — KV(Aerospike),
Document (MongoDB), Graph(Neo4J),
Multimodel(ArangoDb)

HTAP:NewSQL — stonedb, TiDB

OLAP:Oflline — Columnar(Databend), Time Series


(TimeScale)Realtime — Realtime OLAP (Druid, Pinot,
Clickhouse, StarRocks), Search Engine, Streaming
Database (Materialize, RisingWave)

Visualization

1. Superset: A modern, enterprise-ready business


intelligence web application.

2. Rath

3. Redash: A visualization and dashboarding tool.

4. Metabase: An easy way to generate charts and


dashboards, ask simple ad hoc queries without using
SQL, and see detailed information about rows in your
Database.

Data Infrastructure

Kubernetes: An open-source container orchestration


platform.

Ambari: A software project designed to enable system


administrators to provision, manage, and monitor a
Hadoop cluster.

Workflow Management & DataOps

1. Airflow: A platform to programmatically author,


schedule, and monitor workflows.

2. Dagster: A data orchestrator for machine learning,


analytics, and ETL.

3. Kestra: A workflow orchestrator for data pipeline


management.

4. Temporal: An open-source, stateful microservices


orchestration platform.

5. Mage: A workflow engine for orchestrating data


pipelines.

6. Windmill: A platform for building and running data


pipelines.

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 3/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

7. DolphinScheduler: A distributed and easy-to-expand


visual DAG workflow scheduling system, dedicated to
solving the complex dependencies in data processing,
making the scheduling system out of the box for data
processing.

Monitoring

Prometheus + Mimir & Grafan + Loki

EFK

Metadata Management

1. Datahub: An open-source metadata platform for the


modern data stack.

2. Amundsen: A data discovery and metadata platform.

3. Marquez: An open-source metadata service for the


collection, aggregation, and visualization of a data
ecosystem's metadata.

Report this

Published by

Midhun Pottammal 14
Data Engineer and Full Stack Expert | Hadoop, Spark, Kafka, Python, and NoSQ… Follow
Published • 2mo articles

🌟 Exciting Tools in the World of Data Engineering! 🌟

#DataEngineering #OpenSource #TechTrends #DataEngineering #OpenSource


#TechTrends #DataIntegration #Storage #DataLake #DataProcessing #Database
#EventProcessing #Visualization #DataInfrastructure #WorkflowManagement
#DataOps #Monitoring #MetadataManagement

Like Comment Share 17

Reactions

+5

0 Comments

Add a comment…

Midhun Pottammal
Data Engineer and Full Stack Expert | Hadoop, Spark, Kafka, Python, and NoSQL (Hive,
Hbase, Iceberg) | Specialised in Informatica, Nifi, Cloudera CDP, and Databricks

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 4/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

Follow

More from Midhun Pottammal

Apache Iceberg Schema Benefit of Data Observability:


Evolution Unlocking the Insights 🚗
Midhun Pottammal on Linke… Midhun Pottammal on Linke…

Star Schema vs Snowflake


Schema: Key Differences
Between The Two

Midhun Pottammal on Linke…

See all 14 articles

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 5/5

You might also like