-
-
Notifications
You must be signed in to change notification settings - Fork 23
Output Parquet files #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've been looking into transitioning to a lakehouse architecture and I'm thinking that it might not be practical to go through the effort of making this transition just yet. Postgresql + export seems like it will get us to the point to where we can have a public web UI. Meanwhile, we can more intentionally design the lakehouse with @vasilmax's work on schema and definitions. Here are my thoughts: PostgreSQL as Central Datastore, Parquet as Periodic Snapshots
If this works for everyone, I'll move forward with getting a nicely populated db formed with some snapshots. If we want to move to Parquet and, let's say DuckDB now, I think we can, it will just take more time. |
Adding in some questions that might help with decision making. I think either way can work with enough effort. Questions
SummaryI feel there are two key points of strain in what you proposed: there's one database and one person running the database. I think this could be understood as a "key person and technology dependency" risk, where there's a bottleneck on implementation that could disempower a team of people (and technologies) from acting together towards improving the project. It could be important to consider alternatives depending on what the project goals are. SuggestionsSome suggestions and recommendations - everything here is optional and up to other opinions too. Community
Technical
|
Thank you! All excellent and valid points. Community & Governance Technical Discussion My proposal for the interim Postgres + Parquet export approach is driven by the immediate need to unblock collaborative development by providing shared data. I see it as a pragmatic first step that allows us to learn and iterate towards the more robust DuckDB/lakehouse architecture we both agree on. To address your specific technical questions within this "temporary bridge" context:
Why I'm suggesting this phased approach (Postgres now, DuckDB soon)
In summary I fully agree that DuckDB is the better long-term direction for ease of local setup, analytical power, and potentially as part of the production lakehouse. My proposal is simply to use our current tools as a temporary bridge to:
This means the 'key person/technology dependency' risks you rightly identified are time-bound and managed with the understanding that this is temporary phase, not a destination. Looking forward to more discussion = ) |
Jon is going to provide a data-dump ASAP |
As title says
The text was updated successfully, but these errors were encountered: