Skip to content

packaging #2438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

packaging #2438

wants to merge 10 commits into from

Conversation

madhur-ob
Copy link
Collaborator

No description provided.

Now allows:
@Environment(vars=config_expr("..."))
  - This was broken since vars was a ConfigValue and not an updatable dictionary

@project(name=config_expr("get_my_name(cfg)"))
  - This was broken because when evaluating the config_expr, we did not
    consider the globals at the point of call but instead inside the file
    defining config_expr.

Tests are deployed internally.
  - fixes a case where configs and Runner/Deployer were not compatible due to
    configs not being processed in the same cwd

  - fixes an issue with constructing the config values (introduced in
    previous commit)

  - fixes the error message when both default and specified config files
    are not found
Previously something like:

RETRIES = config_expr("settings").retry

class MyFlow(FlowSpec):
  @retries(times=RETRIES)
  @step
  def start(self):
    ...

  @retries(times=RETRIES)
  @step
  def end(self):
    ...

would fail. Also, something like:

FOO = config_expr("settings").foo.bar
BAR = FOO.BAZ
BLAH = FOO.BAZ2

would also fail.

This fixes these issues by copying DelayEvaluator everytime and also
making the `__call__` function reentrant.
# The JSON module in Python3 deals with Unicode. Tar gives bytes.
info_str = (
self._tar.extractfile(os.path.basename(INFO_FILE)).read().decode("utf-8")
info_str = MFEnv.get_archive_content(self._tar, MetaFile.INFO_FILE).decode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be cleaned up better - MFEnv.from_archive(self._tar).get_info()

@@ -14,18 +14,14 @@
from .parameters import current_flow
from .user_configs.config_decorators import CustomStepDecorator
from .user_configs.config_parameters import (
ConfigValue,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used

unicode
except NameError:
unicode = str
basestring = str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can pull out Python2 changes in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THere are definitely other changes. We can revert this but I think we can leave this here for now and yes, do a separate PR to pull out all the other ones.

Comment on lines +316 to +333
tarball. This hook can return one of two things:
- a generator yielding a tuple of `(file_path, arcname)` to add files to
the code package. `file_path` is the path to the file on the local filesystem
and `arcname` is the path relative to the packaged code.
- a generator yielding a tuple of `(content, arcname, type)` where:
- type is a AddToPackageType
- for CODE_FILE:
- content: path to the file to include
- arcname: path relative to the code directory in the package
- for CODE_MODULE:
- content: name of the module
- arcame: None (ignored)
- for CONFIG_FILE:
- content: path to the file to include
- arcname: path relative to the config directory in the package
- for CONFIG_CONTENT:
- content: bytes to include
- arcname: path relative to the config directory in the package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite funky - returning one of two things with different types. need to fix this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest? As I mentioned:

  • we can make it a breaking change (change the signature of add_package)
  • we can add a different method (feels wonky too)
  • ??

# moving here. The reason is that this is needed to read extension information which needs
# to happen before mfenv gets packaged.

MFENV_DIR = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change MFENV to MFCODE. we already have --environment and @Environment - MFENV will only lead to more confusion here and everywhere.

Comment on lines +87 to +90
if version not in get_filname_map:
raise ValueError(
"Unsupported packaging version '%s'. Please update Metaflow" % version
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we actually ever hit this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes -- if a newer version of Metaflow has a newer packaging method and an old client is used to access stuff.

except:
# python 2
f.name = f.__func__.func_name
f.wrappers = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not related to this PR. It slipped through.

# here that is then used in MFEnv (it logically belongs in MFEnv but that file can't
# be loaded just yet).
global _info_file_content
global _info_file_present
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this extra global

_UNINITIALISED = object()         # sentinel
_info_file_content = _UNINITIALISED


def read_info_file():
    """Return the parsed info.json once, or None if the file is absent."""
    global _info_file_content

    if _info_file_content is _UNINITIALISED:
        info_filename = get_filename(MetaFile.INFO_FILE)
        if info_filename is not None:
            with open(info_filename, "r", encoding="utf-8") as f:
                _info_file_content = json.load(f)
        else:
            _info_file_content = None

    return _info_file_content

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make that change. Technically there is an extra global too (the _UNINITIALIZED) but it can be shared.

class MetaFile(Enum):
INFO_FILE = "INFO"
CONFIG_FILE = "CONFIG_PARAMETERS"
INCLUDED_DIST_INFO = "INCLUDED_DIST_INFO"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is INCLUDED_DIST_INFO - hard to understand what this implies - is there an EXCLUDED_DIST_INFO too somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can clarify: it means the information about any distribution that is included. So no, there is no excluded one. I can just hame it DIST_INFO.

)


class MetaFile(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future generations looking at the code, it will be immensely useful to call out in detail what is included in these files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can add this. We didn't add any file (except dist info) but can make it more clear as part of making the code better.

)


class MetaFile(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, do we need to have three separate files? any value in merging all/some?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely have them all as one. I vaguely remember you being in favor of separate files in the past but I have no strong preference.



def generic_get_filename(
name: Union[MetaFile, str], is_meta: Optional[bool] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_meta is not actually used. consider dropping

def v1_get_filename(
name: Union[MetaFile, str],
meta_info: Dict[str, Any],
is_meta: Optional[bool] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work if i am interested in a file that unfortunately clashes with an internal reserved name? also, is_meta actually useful? what if we simply had another method that provided the contents of INFO file and it included everything you need?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps by setting is_meta to false actively somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check on that. The idea of putting stuff in mfconf is that they won't clash with code.


def read_included_dist_info():
global _included_dist_info
global _included_dist_present
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Called to add custom files needed for this environment. This hook will be
called in the `MetaflowPackage` class where metaflow compiles the code package
tarball. This hook can return one of two things:
- a generator yielding a tuple of `(file_path, arcname)` to add files to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit funky, we shouldn't return two types of tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already addressed above with a question.

return None


class MFEnv:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can MetaflowPackage instead be the way that rest of the code accesses functionality here? will make it easier for all the decorator implementations which already have access to the package object to get access to info files etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading through the code, MFEnv seems like an internal implementation detail for MetaflowPackage. If we can keep it as such throughout the code, that will be great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it's a implementation detail but more a part of it but yes, I can move some of those methods back out to MetaflowPackage. That is not a problem.

@@ -159,11 +159,11 @@ def runtime_init(self, flow, graph, package, run_id):
os.path.join(self.metaflow_dir.name, "metaflow"),
)

info = os.path.join(get_metaflow_root(), os.path.basename(INFO_FILE))
info = MFEnv.get_filename(MetaFile.INFO_FILE)
Copy link
Collaborator

@savingoyal savingoyal May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to do package.get_info_file_path() so that we don't have to bother decorator authors to worry about what is MFEnv or MetaFile or InfoFile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to specialize too much I guess but I don't mind having separate utility functions get_info_file_path, get_dist_file_path etc if you prfer that. I was trying to make it easier to add files later.

@@ -159,11 +159,11 @@ def runtime_init(self, flow, graph, package, run_id):
os.path.join(self.metaflow_dir.name, "metaflow"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a future TODO: the path mangling here is something that the package module can handle by default. the same gymnastics are needed every single time for similar decorators.

@@ -93,6 +93,7 @@ def skip_metaflow_dependencies():
return skip_pkgs

def sync_uv_project(datastore_type):
# TODO: NEED TO POINT UV TO FILES IN THE CONFIG DIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this needs to be addressed before we merge this PR? cc @saikonen

@@ -9,7 +9,6 @@
from itertools import takewhile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a different PR, cc @madhur-ob

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is a change I did. What was this supposed to do?

MFENV_DIR = (
".mfenv" # Directory containing "system" code (metaflow and user dependencies)
)
MFCONF_DIR = ".mfconf" # Directory containing Metaflow's configuration files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider changing the name to something else since it is neither metaflow config not the new config feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can have MFCODE_DIR and MFINFO_DIR ?

".mfenv" # Directory containing "system" code (metaflow and user dependencies)
)
MFCONF_DIR = ".mfconf" # Directory containing Metaflow's configuration files
MFENV_MARKER = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be just in the info file? will simplify things quite a bit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would make it less flexible. The marker tells you how to access the info file. The info file is not in the same place already.

CODE_MODULE = 0x502
CODE_METAFLOW = 0x504
CONFIG_FILE = 0x601
CONFIG_CONTENT = 0x202
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to distinguish between all of them? wouldn't just user code, system code, info be sufficient minimally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check. We need to distinguish between files and content I think in some of the things. I can check if I can get rid of that. I think we also nee to know something is a module to extract the dist info.


@classmethod
def _extract_archive_mfenv_info(
cls, archive: Any, packaging_backend=TarPackagingBackend
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any methods that need access to archive can be safely moved under MetaflowPackage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check

)


class PackagedDistribution(metadata.Distribution):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider moving this to a different file so that we can avoid an import from .mfnv in cli.py

_mappings = {}

@classmethod
def get_filename(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair to assume that some methods work against local filesystem and some against the code package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I've tried to name them differently.

return None


def read_included_dist_info():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider renaming to read_dist_info()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants