How to use auto-initializing Expectations
This guide will walk you through the process of using a auto-initializing ExpectationsA verifiable assertion about data. to automate parameter estimation when you are creating Expectations interactively by using a BatchA selection of records from a Data Asset. or Batches that have been loaded into a ValidatorUsed to run an Expectation Suite against data..
Steps
Setup
This guide assumes that you are creating and editing expectations in a Jupyter Notebook. This process is covered in the guide: How to create and edit expectations with instant feedback from a sample batch of data.
Additionally, this guide assumes that you are using a multi-batch Batch RequestProvided to a Datasource in order to create a Batch. to provide your sample data. (Auto-initializing Expectations will work when run on a single Batch, but they really shine when run on multiple Batches that would have otherwise needed to be individually processed if a manual aproach were taken.)
1. Determine if your Expectation is auto-initializing
Not all Expectations are auto-initializng. In order to be a auto-initializing Expectation, an Expectation must have parameters that can be estimated. As an example: ExpectColumnToExist
only takes in a Domain
(which is the column name) and checks whether the column name is in the list of names in the table's metadata. This would be an example of an Expectation that would not work under the auto-initializing framework.
An example of Expectations that would work under the auto-initializing framework would be the ones that have numeric ranges, like ExpectColumnMeanToBeBetween
, ExpectColumnMaxToBeBetween
, and ExpectColumnSumToBeBetween
.
To check whether the Expectation you are interested in works under the auto-initializing framework, run the is_expectation_auto_initializing()
method of the Expectation
class.
For example:
from great_expectations.expectations.expectation import Expectation
Expectation.is_expectation_self_initializing(name="version-0.15.50 expect_column_to_exist")
will return False
and print the message:
The Expectation expect_column_to_exist is not able to be auto-initialized.
However, the command:
Expectation.is_expectation_self_initializing(name="version-0.15.50 expect_column_mean_to_be_between")
will return True
and print the message:
The Expectation expect_column_mean_to_be_between is able to be auto-initialized. Please run by using the auto=True parameter.
For the purposes of this guide, we will be using expect_column_mean_to_be_between
as our example Expectation.
2. Run the expectation with auto=True
Say you are interested in constructing an Expectation that captures the average distance of taxi trips across all of 2018. You have a DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. that provides 12 Batches (one for each month of the year) and you know that expect_colum_mean_to_be_between
is the Expectation you want to implement.
The manual way
The Expectation expect_column_mean_to_be_between()
has the following parameters:
- column (str): The column name.
- min_value (float or None): The minimum value for the column mean.
- max_value (float or None): The maximum value for the column mean.
- strict_min (boolean): If True, the column mean must be strictly larger than min_value, default=False
- strict_max (boolean): If True, the column mean must be strictly smaller than max_value, default=False
Without the auto-initialization framework you would have to get the values for min_value
and max_value
for your series of 12 Batches by calculating the mean value for each Batch and using calculated mean
values to determine the min_value
and max_value
parameters to pass your Expectation. This, although not difficult, would be a monotonous and time consuming task.
Using auto=True
Auto-initializing Expectations automate this sort of calculation across batches. To perform the same calculation described above (the mean ranges across the 12 Batches in the 2018 taxi data) the only thing you need to do is run the Expectation with auto=True
expectation_result = validator.expect_column_mean_to_be_between(
column="trip_distance", auto=True
)
Now the Expectation will calculate the min_value
(2.83) and max_value
(3.06) using all of the Batches that are loaded into the Validator. In our case, that means all 12 Batches associated with the 2018 taxi data.
3. Save your Expectation with the calculated values
Now that the Expectation's upper and lower bounds have come from the Batches, you can save your Expectation SuiteA collection of verifiable assertions about data. and move on.
validator.save_expectation_suite(discard_failed_expectations=False)
Additional information
To view the full scripts that were used in this page, see them on GitHub: