How to configure a RuntimeDataConnector
This guide demonstrates how to configure a RuntimeDataConnector and only applies to the V3 (Batch Request) API. A RuntimeDataConnector
allows you to specify a BatchA selection of records from a Data Asset. using a Runtime Batch RequestProvided to a Datasource in order to create a Batch., which is used to create a Validator. A ValidatorUsed to run an Expectation Suite against data. is the key object used to create ExpectationsA verifiable assertion about data. and ValidateThe act of applying an Expectation Suite to a Batch. datasets.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Understand the basics of Datasources in the V3 (Batch Request) API
- Learned how to configure a Data Context using test_yaml_config
A RuntimeDataConnector is a special kind of Data Connector that enables you to use a RuntimeBatchRequest to provide a Batch's data directly at runtime. The RuntimeBatchRequest can wrap an in-memory dataframe, a filepath, or a SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id
from an AirFlow DAG run). The batch identifiers that must be passed in at runtime are specified in the RuntimeDataConnector's configuration.
Steps
1. Instantiate your project's DataContext
Import these necessary packages and modules:
- YAML
- Python
import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest
2. Set up a Datasource
All of the examples below assume you’re testing configuration using something like:
- YAML
- Python
datasource_yaml = """
name: taxi_datasource
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
<DATACONNECTOR NAME GOES HERE>:
<DATACONNECTOR CONFIGURATION GOES HERE>
"""
context.test_yaml_config(yaml_config=datasource_config)
datasource_config = {
"name": "taxi_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "PandasExecutionEngine",
},
"data_connectors": {
"<DATACONNECTOR NAME GOES HERE>": {
"<DATACONNECTOR CONFIGURATION GOES HERE>"
},
},
}
context.test_yaml_config(yaml.dump(datasource_config))
If you’re not familiar with the test_yaml_config
method, please check out: How to configure Data Context components using test_yaml_config
3. Add a RuntimeDataConnector to a Datasource configuration
This basic configuration can be used in multiple ways depending on how the RuntimeBatchRequest
is configured:
- YAML
- Python
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
"""
datasource_config = {
"name": "taxi_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "PandasExecutionEngine",
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
},
},
}
Once the RuntimeDataConnector is configured you can add your DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. using:
context.add_datasource(**datasource_config)
Example 1: RuntimeDataConnector for access to file-system data:
At runtime, you would get a Validator from the Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. by first defining a RuntimeBatchRequest
with the path
to your data defined in runtime_parameters
:
batch_request = RuntimeBatchRequest(
datasource_name="taxi_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you
runtime_parameters={"path": "<PATH TO YOUR DATA HERE>"}, # Add your path here.
batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},
)
Next, you would pass that request into context.get_validator
:
validator = context.get_validator(
batch_request=batch_request,
create_expectation_suite_with_name="<MY EXPECTATION SUITE NAME>",
)
Example 2: RuntimeDataConnector that uses an in-memory DataFrame
At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest
with the DataFrame passed into batch_data
in runtime_parameters
:
import pandas as pd
path = "<PATH TO YOUR DATA HERE>"
df = pd.read_csv(path)
batch_request = RuntimeBatchRequest(
datasource_name="taxi_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you
runtime_parameters={"batch_data": df}, # Pass your DataFrame here.
batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},
)
Next, you would pass that request into context.get_validator
:
batch_request=batch_request,
expectation_suite_name="<MY EXPECTATION SUITE NAME>",
)
print(validator.head())
Additional Notes
To view the full script used in this page, see it on GitHub: