Skip to content

Output Parquet files #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jring-o opened this issue May 5, 2025 · 4 comments
Open

Output Parquet files #32

jring-o opened this issue May 5, 2025 · 4 comments
Assignees

Comments

@jring-o
Copy link
Collaborator

jring-o commented May 5, 2025

As title says

@jring-o jring-o self-assigned this May 5, 2025
@d33bs d33bs changed the title Output Parque files Output Parquet files May 7, 2025
@jring-o
Copy link
Collaborator Author

jring-o commented May 9, 2025

I've been looking into transitioning to a lakehouse architecture and I'm thinking that it might not be practical to go through the effort of making this transition just yet. Postgresql + export seems like it will get us to the point to where we can have a public web UI. Meanwhile, we can more intentionally design the lakehouse with @vasilmax's work on schema and definitions.

Here are my thoughts:

PostgreSQL as Central Datastore, Parquet as Periodic Snapshots

  • How it works:
    • We maintain one central non-experimental PostgreSQL database (persistent, can be run on-demand by the core team for processing). This is the live, evolving source of truth.
    • All non-experimental ingestions, API operations, and recipe executions that write data happen against this central PostgreSQL.
    • Periodically, we export the data from this central PostgreSQL into a set of Parquet files.
    • These Parquet files are then shared for their local analysis.
    • This database can be run by a single person, or on a cheap cloud solution. No one other than core contributors will need to access it. Really only one person will need access to it.
    • Contributors run local instances of MOSS for experiments. Once they find an ingestion they'd like to include in the central database, they submit it to the central database que.
    • Meanwhile users, through a web UI, submit ingestion requests into a que
    • The person/people managing the central database runs the ingestion que periodically and exports updated Parquet snapshots every T time (week? month? twice a month?)

If this works for everyone, I'll move forward with getting a nicely populated db formed with some snapshots.

If we want to move to Parquet and, let's say DuckDB now, I think we can, it will just take more time.

@d33bs
Copy link
Collaborator

d33bs commented May 9, 2025

Adding in some questions that might help with decision making. I think either way can work with enough effort.

Questions

This database can be run by a single person, or on a cheap cloud solution. No one other than core contributors will need to access it. Really only one person will need access to it.

  • Who will run the database and what happens if they're unwell, have schedule conflicts, or otherwise unavailable? No one can be available all the time (and I feel we should have empathy for one another enough that we don't place someone in a challenging spot).
  • How will core contributors have enough confidence that they can use the database without interfering with "production"? I.e. how would we know that someone isn't running the exact same ingestion we are and collide with their work?
  • For the one person who has access to the database: what will their responsibilities be and how will we navigate decisions that involve evolving MOSS, the schema, or related (how could we ensure good group-based decision making?).

Contributors run local instances of MOSS for experiments. Once they find an ingestion they'd like to include in the central database, they submit it to the central database que.

  • Who will work on securing the database for external connection? Generally it is inadvisable to open public internet connections to a database like Postgres due to the risks involved (DDoS, security holes, brute-force attacks). To secure it in a way that makes it hardened against these risks involves additional security effort (internal / external network configuration, patching, virtual compute configuration for others, etc).
  • Running local instances of the MOSS database imply some form of workplan (documentation with complete directions) or infrastructure-as-code (IaC) which someone can run to build and run the same database which is running in "production" (we'd want to be sure the same functionality exists at both ends). This adds code and development complexity which could be important to consider.

Summary

I feel there are two key points of strain in what you proposed: there's one database and one person running the database. I think this could be understood as a "key person and technology dependency" risk, where there's a bottleneck on implementation that could disempower a team of people (and technologies) from acting together towards improving the project. It could be important to consider alternatives depending on what the project goals are.

Suggestions

Some suggestions and recommendations - everything here is optional and up to other opinions too.

Community

  • We add a software governance structure that empowers a team of people to take action on the project (and imbues the project with communal responsibility). I think this can change how we act and what we expect of the project in a way that benefits the goals (including what's discussed in this issue). At a minimum, I think this could be a set of loose expectations:
    • Focus on small chunks of work which are described in issues.
    • Issues have PR's which focus on just one issue at a time (being reasonable about it too when there are multiple)
    • PR's require review approval from one other maintainer.
    • PR reviews include thoughtful and kind feedback for the author(s)("be excellent to one another").
    • Large decisions are captured in "special" issues (we could use labels) or architecture decision records (ADR's) that become a part of the documentation of the project. By "large" I mean something which the community feels is big enough to warrant that much energy in discussion. These could possibly be informed by rejected PR's or similar contention where we need a different format to decide and find agreement.

Technical

  • We shift the existing code to use a DuckDB database which should be mostly PostgresSQL compliant, thereby enabling people to have a local database without much setup cost (it'd be pinned to the DuckDB version and a list of extensions).
  • DuckDB could be used to create Parquet exports if we wanted because it exports directly to Arrow and has COPY TO instructions which allow for direct exports.
  • We could also use DuckDB in production instead of Parquet (perhaps this could mean many DuckDB databases or leveraging MotherDuck).

@jring-o
Copy link
Collaborator Author

jring-o commented May 9, 2025

Thank you! All excellent and valid points.

Community & Governance
Absolutely on board with all your community suggestions. Looking forward to the community transition as discussed on Wednesday's call! This framework will be essential for guiding decisions, including the technical ones we're discussing.

Technical Discussion

My proposal for the interim Postgres + Parquet export approach is driven by the immediate need to unblock collaborative development by providing shared data. I see it as a pragmatic first step that allows us to learn and iterate towards the more robust DuckDB/lakehouse architecture we both agree on.

To address your specific technical questions within this "temporary bridge" context:

  1. Who will run the database and what happens if they're unwell, etc.?

    • You're right to flag the "one person" as a risk. My intention wasn't to saddle one individual indefinitely. For this temporary phase, this could be managed by a small, designated subset of core maintainers (e.g., 2-3 people including myself, if others are willing) who have access. The operational load is intended to be very low – periodic, scripted tasks (e.g., monthly or bi-monthly), not constant firefighting. This distributes the immediate, minimal load and provides backup.
    • The critical point is that this central Postgres instance is not a production query engine for the community. Its sole purpose is to be the 'write-master' for ingesting new data and the source for Parquet snapshot generation. The Parquet files are the primary artifacts for community consumption and local analysis.
  2. How will core contributors have enough confidence that they can use the database without interfering with "production"?

    • Since direct interaction with the central Postgres is limited to a few people running the ingestion queue, the risk of collision is focused there. The ingestion queue itself would be visible (e.g., a shared document, a GitHub project board, or a simple table in a shared space). We can establish a simple protocol (e.g., assign items, or PR-based additions to the queue) to prevent duplicate work on the central instance.
    • Contributors doing experiments would use local MOSS instances, seeded with the Parquet snapshots or not, ensuring they aren't touching the central ingestion pipeline.
  3. Responsibilities for the one person (or small group) with access & decision making

    • Responsibilities for this interim phase: Primarily, running the agreed-upon ingestion queue and executing the Parquet export script periodically.
    • Decision Making: Crucially, decisions on evolving MOSS, the schema, or what goes into the central database would not be unilateral. This is where the community governance you proposed becomes vital, even during this interim Postgres phase. The "people running the central DB" are executors of community-agreed changes, not gatekeepers of the schema.
  4. Who will work on securing the database for external connection?

    • Agreed. The central Postgres would not be exposed publicly. Access would be restricted. The Parquet files would be distributed via a secure, read-only mechanism (like the server @gpavlov2016 set up). The security effort is thus minimized for this temporary setup.
  5. Running local instances of MOSS

    • You're right, clear documentation or simple setup scripts for local MOSS instances are needed, regardless of the backend
    • We have had someone follow the current documentation to successfully set up MOSS and run some ingestions locally, so they're not terrible as is. But there were some issues and they can definitely be made better.
    • For now, our key goal is to give contributors access to some MOSS structure data (Parquet snapshots) for analysis and model development. Setting up a local Postgres would be for those specifically wanting to test new ingestion processes or to experiment with schema changes before proposing them for the central queue.

Why I'm suggesting this phased approach (Postgres now, DuckDB soon)

  1. We have an existing, working Postgres end-to-end system for ingestion. Leveraging this now gets common data (via Parquet) into everyone's hands the fastest, unblocking immediate collaborative development and experimentation.
  2. This interim phase allows the whole team to interact with a baseline dataset and the current system. This hands-on experience will provide invaluable insights for designing the schema, identifying pain points in current processes, and understanding data characteristics. This knowledge will make the eventual transition to DuckDB and the lakehouse architecture more informed and successful. We learn more about what we need before we build the final how.
  3. The shift to DuckDB and a proper lakehouse is significant. It will take time, experimentation, and focused effort (as you've outlined with your DuckDB suggestions). This interim approach buys us that time for @vasilmax's schema work and collective design, while still enabling immediate progress on data-driven collaboration.

In summary

I fully agree that DuckDB is the better long-term direction for ease of local setup, analytical power, and potentially as part of the production lakehouse. My proposal is simply to use our current tools as a temporary bridge to:

  • Immediately provide common data.
  • Collectively learn from this data and our current processes
  • Thoughtfully design and implement the DuckDB/lakehouse architecture, making it more robust and fit-for-purpose based on shared experience.

This means the 'key person/technology dependency' risks you rightly identified are time-bound and managed with the understanding that this is temporary phase, not a destination.

Looking forward to more discussion = )

@jring-o
Copy link
Collaborator Author

jring-o commented May 14, 2025

Jon is going to provide a data-dump ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants