Joe Gurr

Pandera

Recently I have been using pandera at $WORK.

It is a package that allows one to validate and test against dataframe-like objects.

The following is a quick reminder for myself how I have been using it.

Given a pandas dataframe called df, you can infer a schema as follows.

import pandera as pa

schema = pa.infer_schema(df)

print(schema.to_script())

This schema is a structure of the dataframe you have provided.

Invariably I have to go back and update this schema, as it will have picked up specific qualities of the particular dataframe I provided, rather than the generic structure.

I have found this process to be very useful as it forces you to think deeply about what you would like to validate against.

Once corrected I take this schema and add it to the repo I'm working in. I usually have been having one module per schema. I also have been adding the following function to each module (even though I know there is a better way of doing this):

def validate(df, logger):
    # Returns the df if validated else None
    try:
        return schema.validate(df)
    except pa.errors.SchemaError as e:
        logger.error(f"Failed to validate: {e}")
        return None