-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out external feature columns in consistency job comparison #599
base: main
Are you sure you want to change the base?
Conversation
@hzding621 this seems legit. Could you take a look? Then we could fix the conflicts |
// External parts / contextual features don't get logged in the online data but do appear in the comparison table. | ||
// We need to remove them from the comparison df otherwise the comparison metric computation will fail. | ||
// We need to prepend `ext_contextual_` to the feature name since that's how it appears in the comparison df. | ||
val externalFeatureColumns = joinConf.getExternalFeatureCols.map(col => s"ext_contextual_$col") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the column name logic seems wrong. getExternalFeatureCols
currently has no usages in the codebase and doesn't handle prefix correctly.
ext_contextual_
is the prefix for only contextual features which is a special case of external features, the general case external features has prefix based on its source name.
So instead we should use:
Option(joinConf.onlineExternalParts).toSeq.flatMap(_.toScala.map(_.fullName))
@@ -88,10 +88,18 @@ class ConsistencyJob(session: SparkSession, joinConf: Join, endDate: String) ext | |||
if (unfilledRanges.isEmpty) return null | |||
val allMetrics = unfilledRanges.map { unfilled => | |||
val comparisonDf = tableUtils.sql(unfilled.genScanQuery(null, joinConf.metaData.comparisonTable)) | |||
// External parts / contextual features don't get logged in the online data but do appear in the comparison table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the root cause description might be wrong.
External / contextual parts do get logged in the online data and show up in the log table. They also show up in the comparison table but the values will be NULL.
However, currently CompareBaseJob.compare
will only compare columns that appear both on the left and on the right. In the current code, we pass in loggedDfNoExternalCols
to CompareBaseJob.compare
, and therefore already remove the external / contextual parts. So IIUC the current code shouldn't fail for this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz see my other comments
Summary
Filters out contextual features from external parts from the comparison dataframe before computing consistency metrics with the logged data.
Why / Goal
The logged data from the online chronon system does not include the contextual features and the consistency job will fail if you don't filter out these features first.
Test Plan
Checklist
Reviewers