Initialize a Data Context
- You need a Python environment where you can install Great Expectations and other dependencies, e.g. a virtual environment.
Set up your machine for the tutorial
For this tutorial, we will use a simplified version of the NYC taxi ride data.
Clone the ge_tutorials repository to download the data and directories with the final versions of the tutorial, which you can use for reference:
git clone https://github.com/superconductive/ge_tutorials
cd ge_tutorials
The repository you cloned contains several directories with final versions for our tutorials. The final version for this tutorial is located in the getting_started_tutorial_final_v3_api/
folder. You can use the final version as a reference or to explore a complete deploy of Great Expectations, but you do not need it for this tutorial.
Install Great Expectations and dependencies
Great Expectations requires Python 3 and can be installed using pip. If you haven’t already, install Great Expectations by running:
pip install great_expectations
You can confirm that installation worked by running
great_expectations --version
This should return something like:
great_expectations, version 0.13.43
For detailed installation instructions, see How to install Great Expectations locally.
Other deployment patterns
This tutorial deploys Great Expectations locally. Note that other options (e.g. running Great Expectations on an EMR Cluster) are also available. You can find more information in the Reference Architectures section of the documentation.
Create a Data Context
In Great Expectations, your Data Context manages your project configuration, so let’s go and create a Data Context for our tutorial project!
When you installed Great Expectations, you also installed the Great Expectations command line interface (CLI). It provides helpful utilities for deploying and configuring Data Contexts, plus a few other convenience methods.
To initialize your Great Expectations deployment for the project, run this command in the terminal from the ge_tutorials/
directory:
great_expectations init
You should see this:
Using v3 (Batch Request) API
___ _ ___ _ _ _
/ __|_ _ ___ __ _| |_ | __|_ ___ __ ___ __| |_ __ _| |_(_)___ _ _ ___
| (_ | '_/ -_) _` | _| | _|\ \ / '_ \/ -_) _| _/ _` | _| / _ \ ' \(_-<
\___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
|_|
~ Always know what to expect from your data ~
Let's create a new Data Context to hold your project configuration.
Great Expectations will create a new directory with the following structure:
great_expectations
|-- great_expectations.yml
|-- expectations
|-- checkpoints
|-- plugins
|-- .gitignore
|-- uncommitted
|-- config_variables.yml
|-- data_docs
|-- validations
OK to proceed? [Y/n]: <press Enter>
About the great_expectations/
directory structure
After running the init
command, your great_expectations/
directory will contain all of the important components of a local Great Expectations deployment. This is what the directory structure looks like
great_expectations.yml
contains the main configuration of your deployment.- The
expectations/
directory stores all your Expectations as JSON files. If you want to store them somewhere else, you can change that later. - The
plugins/
directory holds code for any custom plugins you develop as part of your deployment. - The
uncommitted/
directory contains files that shouldn’t live in version control. It has a .gitignore configured to exclude all its contents from version control. The main contents of the directory are:uncommitted/config_variables.yml
, which holds sensitive information, such as database credentials and other secrets.uncommitted/data_docs
, which contains Data Docs generated from Expectations, Validation Results, and other metadata.uncommitted/validations
, which holds Validation Results generated by Great Expectations.