Compare two tables with the Onboarding Data Assistant
In this guide, you will utilize a Data AssistantA utility that asks questions about your data, gathering information to describe what is observed, and then presents Metrics and proposes Expectations based on the answers. to create an Expectation SuiteA collection of verifiable assertions about data. that can be used to gauge whether two tables are identical. This workflow can be used, for example, to validate migrated data.
Prerequisites
- Completion of the Quickstart guide.
- A working installation of Great Expectations.
- A minimum of two configured Datasources and Assets
- A basic understanding of how to configure Expectation in Great Expectations
- Completion of the Data AssistantsA utility that asks questions about your data, gathering information to describe what is observed, and then presents Metrics and proposes Expectations based on the answers. overview
Set-Up
In this workflow, we will be making use of the OnboardingDataAssistant
to profile against a BatchRequestProvided to a Datasource in order to create a Batch. representing our source data, and validate the resulting suite against a BatchRequest
representing our second set of data.
To begin, we'll need to import Great Expectations and instantiate our Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.:
import great_expectations as gx
context = gx.get_context()
Depending on your use-case, workflow, and directory structures, you may need to update you context root directory as follows:
context = gx.get_context(
context_root_dir='/my/context/root/directory/great_expectations'
)
Create Batch Requests
In order to profile our first table and validate our second table, we need to set up our Batch Requests pointing to each set of data.
In this guide, we will use a MySQL Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. as our source data -- the data we trust to be correct.
mysql_batch_request = mysql_datasource.get_asset(
"mysql_taxi_data"
).build_batch_request()
From this data, we will create an Expectation SuiteA collection of verifiable assertions about data. and use that suite to validate our second table, pulled from a PostgreSQL Data Source.
pg_batch_request = pg_datasource.get_asset("postgres_taxi_data").build_batch_request()
Profile source data
We can now use the OnboardingDataAssistant
to profile our MySQL data defined in the mysql_batch_request
above.
data_assistant_result = context.assistants.onboarding.run(
batch_request=mysql_batch_request,
exclude_column_names=[
"VendorID",
"improvement_surcharge",
"tip_amount",
"extra",
"fare_amount",
"payment_type",
"trip_distance",
"total_amount",
],
)
And use the results from the Data Assistant to build and save an Expectation Suite:
expectation_suite_name = "compare_two_tables"
expectation_suite = data_assistant_result.get_expectation_suite(
expectation_suite_name=expectation_suite_name
)
context.add_or_update_expectation_suite(expectation_suite=expectation_suite)
exclude_column_names
?
Some SQL dialects handle data types in different ways, and this can cause precision mismatches on some numbers.
In our hypothetical use case these inconsistencies are tolerated, and therefore Expectations are not set against the columns likely to generate the errors.
This is an example of how an Expectation Suite created by the Data Assistant can be customized. For more on these configurations, see our guide on the `OnboardingDataAssistant.
Checkpoint Set-Up
Before we can validate our second table, we need to define a CheckpointThe primary means for validating data in a production deployment of Great Expectations..
We will pass both the pg_batch_request
and the Expectation Suite defined above to this checkpoint.
checkpoint = context.add_or_update_checkpoint(
name="comparison_checkpoint",
validations=[
{
"batch_request": pg_batch_request,
"expectation_suite_name": expectation_suite_name,
}
],
)
Validation
Finally, we can use our Checkpoint to validate that our two tables are identical:
checkpoint_result = checkpoint.run()
If we now inspect the results of this Checkpoint (results["success"]
), we can see that our Validation was successful!
By default, the Checkpoint above also updates your Data Docs, allowing you to further inspect the results of this workflow.
Congratulations!
🎉 You've just compared two tables across Datasources! 🎉