Pg_lakehouse: Turning PostgreSQL into a Data Lakehouse
The world of database extensions has made it possible for developers to enjoy PostgreSQL in various forms. From Citus for distributed databases to Hydra for columnstore reporting, and even pg_vector for vector engines, the possibilities are endless. But what about turning PostgreSQL into a data lakehouse? Thanks to pg_lakehouse from ParadeDB, this is now a reality.
What is pg_lakehouse?
pg_lakehouse is an extension that allows PostgreSQL to assume the role of DuckDB, making it an ideal solution for analytical workloads. It uses the foreign data wrapper (FDW) API to connect to various object stores and table formats, making it a powerful tool for data scientists and analysts.
How does it work?
pg_lakehouse supports a wide range of object stores, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and more. It also supports various table formats, such as Parquet, CSV, Apache Iceberg, and Delta Lake. By creating a foreign table and passing in the path of the object store, you can easily query your data from within PostgreSQL.
Example Use Case
Let’s take an example of querying a dataset of 3 million NYC taxi trips from January 2024, hosted in a public us-east-1 S3 bucket provided by ParadeDB. With pg_lakehouse, you can easily connect to the S3 bucket and query the data using SQL.
- Querying parquet files from S3 *
Future Development
pg_lakehouse is still under heavy development, with a long roadmap ahead. Some of the upcoming features include write support, which will enable developers to centralize data lake operations inside Postgres. Additionally, support for Apache Iceberg tables and wider object store coverage are also on the horizon.
- The world of PostgreSQL extensions *
In conclusion, pg_lakehouse is a game-changer for data scientists and analysts who want to leverage the power of PostgreSQL for their analytical workloads. With its ability to connect to various object stores and table formats, it’s an ideal solution for anyone looking to turn their PostgreSQL database into a data lakehouse.
- pg_lakehouse architecture *