How to validate data by running a Checkpoint
This guide will help you ValidateThe act of applying an Expectation Suite to a Batch. your data by running a CheckpointThe primary means for validating data in a production deployment of Great Expectations..
The best way to Validate data with Great Expectations is using a Checkpoint. Checkpoints identify what Expectation SuitesA collection of verifiable assertions about data. to run against which Data AssetA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. and BatchA selection of records from a Data Asset. (described by a Batch RequestsProvided to a Datasource in order to create a Batch.), and what ActionsA Python class with a run method that takes a Validation Result and does something with it to take based on the results of those tests.
Succinctly: Checkpoints are used to test your data and take action based on the results.
Prerequisites
- Completion of the Quickstart guide.
- A working installation of Great Expectations.
- Configured a Data Context
- Configured an Expectations Suite
- Configured a Checkpoint
You can run the Checkpoint from the CLICommand Line Interface in a Terminal shell or using Python.
- Python
- Terminal
If you already have created and saved a Checkpoint, then the following code snippet will retrieve it from your context and run it:
# context = gx.get_context()
result = context.run_checkpoint(
checkpoint_name="version-0.16.16 my_checkpoint",
batch_request={
"datasource_name": "taxi_source",
"data_asset_name": "yellow_tripdata",
},
run_name=None,
)
if not result["success"]:
print("Validation failed!")
sys.exit(1)
print("Validation succeeded!")
If you do not have a Checkpoint, the pre-requisite guides mentioned above will take you through the necessary steps. Alternatively, this concise example below shows how to connect to data, create an expectation suite using a validator, and create a checkpoint (saving everything to the Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. along the way).
# setup
import sys
import great_expectations as gx
context = gx.get_context()
# starting from scratch, we add a datasource and asset
datasource = context.sources.add_pandas_filesystem(
name="version-0.16.16 taxi_source", base_directory=data_directory
)
asset = datasource.add_csv_asset(
"yellow_tripdata",
batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv",
order_by=["-year", "month"],
)
# use a validator to create an expectation suite
validator = context.get_validator(
datasource_name="taxi_source", data_asset_name="version-0.16.16 yellow_tripdata"
)
validator.expect_column_values_to_not_be_null("pickup_datetime")
context.add_expectation_suite("yellow_tripdata_suite")
# create a checkpoint
checkpoint = gx.checkpoint.SimpleCheckpoint(
name="version-0.16.16 my_checkpoint",
data_context=context,
expectation_suite_name="version-0.16.16 yellow_tripdata_suite",
)
# add (save) the checkpoint to the data context
context.add_checkpoint(checkpoint=checkpoint)
cp = context.get_checkpoint(name="version-0.16.16 my_checkpoint")
assert cp.name == "my_checkpoint"
If you have already created and saved a Checkpoint, then you can run the Checkpoint using the CLI.
great_expectations checkpoint run my_checkpoint
Additional notes
This command will return posix status codes and print messages as follows:
+-------------------------------+-----------------+-----------------------+
| **Situation** | **Return code** | **Message** |
+-------------------------------+-----------------+-----------------------+
| all validations passed | 0 | Validation succeeded! |
+-------------------------------+-----------------+-----------------------+
| one or more validation failed | 1 | Validation failed! |
+-------------------------------+-----------------+-----------------------+