Version: 0.17.19

Compare two tables with the Onboarding Data Assistant

In this guide, you will utilize a Data AssistantA utility that asks questions about your data, gathering information to describe what is observed, and then presents Metrics and proposes Expectations based on the answers. to create an Expectation SuiteA collection of verifiable assertions about data. that can be used to gauge whether two tables are identical. This workflow can be used, for example, to validate migrated data.

Prerequisites

Completion of the Quickstart guide.
A working installation of Great Expectations.
A minimum of two configured Datasources and Assets
A basic understanding of how to configure Expectation in Great Expectations
Completion of the Data AssistantsA utility that asks questions about your data, gathering information to describe what is observed, and then presents Metrics and proposes Expectations based on the answers. overview

Set-Up

In this workflow, we will be making use of the OnboardingDataAssistant to profile against a BatchRequestProvided to a Datasource in order to create a Batch. representing our source data, and validate the resulting suite against a BatchRequest representing our second set of data.

To begin, we'll need to import Great Expectations and instantiate our Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.:

import great_expectations as gx

context = gx.get_context()

note

Depending on your use-case, workflow, and directory structures, you may need to update you context root directory as follows:

context = gx.get_context(
    context_root_dir='/my/context/root/directory/great_expectations'
)

Create Batch Requests

In order to profile our first table and validate our second table, we need to set up our Batch Requests pointing to each set of data.

In this guide, we will use a MySQL Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. as our source data -- the data we trust to be correct.

mysql_batch_request = mysql_datasource.get_asset(
    "mysql_taxi_data"
).build_batch_request()

From this data, we will create an Expectation SuiteA collection of verifiable assertions about data. and use that suite to validate our second table, pulled from a PostgreSQL Data Source.

pg_batch_request = pg_datasource.get_asset("postgres_taxi_data").build_batch_request()

Profile source data

We can now use the OnboardingDataAssistant to profile our MySQL data defined in the mysql_batch_request above.

data_assistant_result = context.assistants.onboarding.run(
    batch_request=mysql_batch_request,
    exclude_column_names=[
        "VendorID",
        "improvement_surcharge",
        "tip_amount",
        "extra",
        "fare_amount",
        "payment_type",
        "trip_distance",
        "total_amount",
    ],
)

And use the results from the Data Assistant to build and save an Expectation Suite:

expectation_suite_name = "compare_two_tables"
expectation_suite = data_assistant_result.get_expectation_suite(
    expectation_suite_name=expectation_suite_name
)
context.add_or_update_expectation_suite(expectation_suite=expectation_suite)

exclude_column_names?

In the previous example, specific columns were excluded to prevent Expectations from being set against them.

Some SQL dialects handle data types in different ways, and this can cause precision mismatches on some numbers.

In our hypothetical use case these inconsistencies are tolerated, and therefore Expectations are not set against the columns likely to generate the errors.

This is an example of how an Expectation Suite created by the Data Assistant can be customized. For more on these configurations, see our guide on the `OnboardingDataAssistant.

Checkpoint Set-Up

Before we can validate our second table, we need to define a CheckpointThe primary means for validating data in a production deployment of Great Expectations..

We will pass both the pg_batch_request and the Expectation Suite defined above to this checkpoint.

checkpoint = context.add_or_update_checkpoint(
    name="comparison_checkpoint",
    validations=[
        {
            "batch_request": pg_batch_request,
            "expectation_suite_name": expectation_suite_name,
        }
    ],
)

Validation

Finally, we can use our Checkpoint to validate that our two tables are identical:

checkpoint_result = checkpoint.run()

If we now inspect the results of this Checkpoint (results["success"]), we can see that our Validation was successful!

By default, the Checkpoint above also updates your Data Docs, allowing you to further inspect the results of this workflow.

Congratulations!
🎉 You've just compared two tables across Datasources! 🎉

Prerequisites​

Set-Up​

Create Batch Requests​

Profile source data​

Checkpoint Set-Up​

Validation​