Pandera
Recently I have been using pandera at
$WORK
.
It is a package that allows one to validate and test against dataframe-like objects.
The following is a quick reminder for myself how I have been using it.
Given a pandas dataframe called df
, you can infer a schema as follows.
import pandera as pa
schema = pa.infer_schema(df)
print(schema.to_script())
This schema is a structure of the dataframe you have provided.
Invariably I have to go back and update this schema, as it will have picked up specific qualities of the particular dataframe I provided, rather than the generic structure.
I have found this process to be very useful as it forces you to think deeply about what you would like to validate against.
Once corrected I take this schema and add it to the repo I'm working in. I usually have been having one module per schema. I also have been adding the following function to each module (even though I know there is a better way of doing this):
def validate(df, logger):
# Returns the df if validated else None
try:
return schema.validate(df)
except pa.errors.SchemaError as e:
logger.error(f"Failed to validate: {e}")
return None