How to quickly explore Expectations in a notebook
Building ExpectationsA verifiable assertion about data. as you conduct exploratory data analysis is a great way to ensure that your insights about data processes and pipelines remain part of your team's knowledge.
This guide will help you quickly get a taste of Great Expectations, without even setting up a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.. All you need is a notebook and some data.
- Installed Great Expectations (e.g.
pip install great_expectations
) - Have access to a notebook (e.g.
jupyter notebook
,jupyter lab
, etc.) - Obtained a sample of data to use for exploration
Unlike most how-to guides, these instructions do not assume that you have already configured a Data Context by running great_expectations init
. Once you're comfortable with these basic concepts, you will almost certainly want to unlock the full power of Great Expectations by configuring a Data Context. Please check out the instructions in the Getting started tutorial when you're ready to start.
Steps
All of these steps take place within your notebook:
1. Import Great Expectations.
import great_expectations as ge
2. Load some data.
The simplest way to do this is with read_csv
.
my_df = ge.read_csv("my_data_directory/titanic.csv")
This method behaves exactly the same as pandas.read_csv
, so you can add parameters to parse your file:
my_df = ge.read_csv(
"my_data_directory/my_messy_data.csv",
sep="\t",
skiprows=3
)
Similarly wrapped versions of other pandas methods (read_excel
, read_table
, read_parquet
, read_pickle
, read_json
, etc.) are also available. Please see the great_expectations.utils
module for details.
If you wish to load data from somewhere else (e.g. from a SQL database or blob store), please fetch a copy of the data locally. Alternatively, you can configure a Data Context with Datasources, which will allow you to take advantage of more of Great Expectations' advanced features.
As alternatives, if you have already instantiated :
- a
pandas.Dataframe
, you can usefrom_pandas
:
my_df = ge.from_pandas(
my_pandas_dataframe
)
This method will convert your boring old pandas DataFrame
into a new and exciting great_expectations PandasDataset
. The two classes are absolutely identical, except that PandasDataset
has access to Great Expectations' methods.
- a
Spark DataFrame
, you can useSparkDFDataset
:
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
my_df = SparkDFDataset(my_spark_dataframe)
This method will create an object with access to Great Expectations' methods, such as ProfilingResultsPageRenderer
.
3. Explore your data and add Expectations
Each of the methods in step 1 will produce my_df
, a PandasDataset
. PandasDataset
is a subclass of pandas.DataFrame
, which means that you can use all of pandas' normal methods on it.
my_df.head()
my_df.Sex.value_counts()
my_df[my_df.Sex=="male"].head()
# etc., etc.
In addition, my_df
has access to a wide array of Expectations. You can see the full list in the Expectation Gallery. By convention, every ExpectationA verifiable assertion about data. method name starts with the name expect_...
, so you can quickly access the full list with tab-based autocomplete:
When you invoke an Expectation, it will immediately be ValidatedThe act of applying an Expectation Suite to a Batch. against your data. The returned object will contain the result and a list of unexpected values. This instant feedback helps you zero in on unexpected data very quickly, taking a lot of the guesswork out of data exploration.
Hint: it's common to encounter data issues where most cases match, but you can't guarantee 100% adherence. In these cases, consider using a mostly
parameter. This parameter is an option for all Expectations that are applied on a row-by-row basis, and allows you to control the level of wiggle room you want built into your data Validation.
Note how success
switches from false
to true
once mostly=.99
is added.
4. Review your Expectations.
As you run Expectations in your notebook, my_df
will build up a running list of Expectations. By default, Great Expectations will recognize and replace duplicate Expectations, so that only the most recent version is stored. (See "Determining duplicate results" below for details.)
You can get the config file for your Expectations by running:
my_df.get_expectation_suite()
which will return an Expectation SuiteA collection of verifiable assertions about data. object.
By default, get_expectation_suite()
only returns Expectations with success=True
on their most recent Validation. You can override this behavior with:
my_df.get_expectation_suite(discard_failed_expectations=False)
5. Save your Expectation Suite.
Expectation Suites can be serialized as JSON objects, so you can save your Expectation Suite like this:
import json
with open( "my_expectation_file.json", "w") as my_file:
my_file.write(
json.dumps(my_df.get_expectation_suite().to_json_dict())
)
As you develop more Expectation Suites, you'll probably want some kind of system for naming and organizing them, not to mention matching them up with data, validating them, and keeping track of Validation ResultsGenerated when data is Validated against an Expectation or Expectation Suite..
When you get to this stage, we recommend following the getting started tutorial to set up a Data Context. You can get through the basics in less than half an hour, and setting up a Data Context will unlock many additional power tools within Great Expectations.
Additional notes
Adding notes and metadata
You can also add notes and structured metadata to Expectations:
>> my_df.expect_column_values_to_match_regex(
"Name",
"^[A-Za-z\, \(\)\']+$",
meta = {
"notes": {
"content": [ "A simple experimental regex for name matching." ],
"format": "markdown",
"source": "max@company.com"
}
)
Determining duplicate results
As a general rule,
If a given Expectation has no
column
parameters, it will replace another Expectation(s) of the same type.Example:
expect_table_row_count_to_equal(100)
will overwrite
expect_table_row_count_to_equal(200)
If a given Expectation has one or more
column
parameters, it will replace another Expectation(s) of the same type with the same column parameter(s).Example:
expect_column_values_to_be_between(
column="percent_agree",
min_value=0,
max_value=100,
)will overwrite
expect_column_values_to_be_between(
column="percent_agree",
min_value=10,
max_value=90,
)or
expect_column_values_to_be_between(
column="percent_agree",
min_value=0,
max_value=100,
mostly=.80,
)but not
expect_column_values_to_be_between(
column="percent_agreement",
min_value=0,
max_value=100,
mostly=.80,
)and not
expect_column_mean_to_be_between(
column="percent",
min_value=65,
max_value=75,
)