SparkDFExecutionEngine
- class great_expectations.execution_engine.SparkDFExecutionEngine(*args, persist=True, spark_config=None, force_reuse_spark_context=True, **kwargs)#
SparkDFExecutionEngine instantiates the ExecutionEngine API to support computations using Spark platform.
This class holds an attribute spark_df which is a spark.sql.DataFrame.
Constructor builds a SparkDFExecutionEngine, using provided configuration parameters.
- Parameters
*args – Positional arguments for configuring SparkDFExecutionEngine
persist – If True (default), then creation of the Spark DataFrame is done outside this class
spark_config – Dictionary of Spark configuration options
force_reuse_spark_context – If True then utilize existing SparkSession if it exists and is active
**kwargs – Keyword arguments for configuring SparkDFExecutionEngine
For example:
name: str = "great_expectations-ee-config"
spark_config: Dict[str, str] = {
"spark.app.name": name,
"spark.sql.catalogImplementation": "hive",
"spark.executor.memory": "512m",
}
execution_engine = SparkDFExecutionEngine(spark_config=spark_config)
spark_session: SparkSession = execution_engine.spark- get_compute_domain(domain_kwargs: dict, domain_type: Union[str, great_expectations.core.metric_domain_types.MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None) Tuple[pyspark.DataFrame, dict, dict] #
Uses a DataFrame and Domain kwargs (which include a row condition and a condition parser) to obtain and/or query a Batch of data.
Returns in the format of a Spark DataFrame along with Domain arguments required for computing. If the Domain is a single column, this is added to 'accessor Domain kwargs' and used for later access.
- Parameters
domain_kwargs (dict) – a dictionary consisting of the Domain kwargs specifying which data to obtain
domain_type (str or MetricDomainTypes) – an Enum value indicating which metric Domain the user would like to be using, or a corresponding string value representing it. String types include "identity", "column", "column_pair", "table" and "other". Enum types include capitalized versions of these from the class MetricDomainTypes.
accessor_keys (str iterable) – keys that are part of the compute Domain but should be ignored when describing the Domain and simply transferred with their associated values into accessor_domain_kwargs.
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the Domain within the compute domain
- Return type
A tuple including
- get_domain_records(domain_kwargs: dict) pyspark.DataFrame #
Uses the given Domain kwargs (which include row_condition, condition_parser, and ignore_row_if directives) to obtain and/or query a batch.
- Parameters
domain_kwargs (dict) –
- Returns
A DataFrame (the data on which to compute returned in the format of a Spark DataFrame)