Awesome Data Engineering

A curated list of awesome things related to Data Engineering.

Awesome - Credit goes this repo

A Curated Collection of Awesome Data Engineering Resources

Data engineering is a rapidly evolving field, and staying up-to-date with the latest tools, techniques, and best practices can be a challenge. This curated list aims to provide a comprehensive overview of resources that can help you level up your data engineering game.

Open-Source Data Engineering Projects

Datastores

Apache Calcite: A powerful SQL parser and query optimizer, serving as a foundation for building datastores.
Apache Cassandra: A highly scalable NoSQL database known for its fault tolerance and distributed architecture.
Apache Druid: A real-time analytics database designed for fast ingestion and querying of large volumes of time-series data.
Apache HBase: A distributed, scalable, big data store modeled after Google's Bigtable.
Apache Pinot: A real-time distributed OLAP datastore optimized for low-latency analytics.
ClickHouse: A column-oriented database management system (DBMS) designed for online analytical processing (OLAP).
InfluxDB: A purpose-built open-source time-series database optimized for storing and analyzing time-stamped data.
MinIO: A high-performance object storage server compatible with Amazon S3.
PostgreSQL: The world's most advanced open-source relational database.

Formats

Apache Avro: A robust data serialization system used for efficient data storage and exchange.
Apache Parquet: A columnar storage format optimized for big data processing.
Apache ORC: Another columnar storage format offering high compression and efficient query performance.
Apache Thrift: A framework for defining and creating services in a variety of programming languages.
Apache Arrow: A cross-language development platform for in-memory data, providing efficient data structures and zero-copy data sharing.

Learning Resources

Books

Designing Data-Intensive Applications by Martin Kleppmann: A comprehensive guide to the principles and practices of building scalable, reliable, and maintainable data systems.

Online Courses

Data Engineering Nanodegree (Udacity): A structured program covering the fundamentals of data engineering, including data modeling, pipelines, and infrastructure.
The Complete Guide to Data Engineering (Udemy): A comprehensive course covering a wide range of data engineering topics, from data ingestion and storage to processing and analysis.

Communities

r/dataengineering (Reddit): A thriving community of data engineers where you can ask questions, share knowledge, and discuss the latest trends.
Data Engineering Stack Exchange: A platform for asking and answering technical questions related to data engineering.

Tools and Frameworks

ETL/ELT

Apache Airflow: A popular workflow management platform for orchestrating complex data pipelines.
dbt (data build tool): A transformation workflow tool that enables data analysts and engineers to transform data in their warehouses more effectively.

Data Observability

Monte Carlo: A data reliability platform that uses machine learning to detect and prevent data downtime.
Bigeye: A data observability platform that helps you monitor the health of your data pipelines and identify potential issues.

Additional Resources

Awesome Data Engineering (GitHub): A curated list of data engineering tools and resources maintained by the community.
Data Engineering Podcast: A podcast featuring interviews with data engineering experts and discussions of the latest trends in the field.

Conclusion

This curated list is just a starting point for exploring the vast landscape of data engineering. By leveraging these resources, you can stay informed about the latest tools, techniques, and best practices, and continuously enhance your data engineering skills.