Skip to main content
Version: 0.17.19

Connect to filesystem source data

Use the information provided here to connect to source data stored on Amazon S3, Google Cloud Storage (GCS), Microsoft Azure Blob Storage, or local filesystems. Great Expectations (GX) uses the term source data when referring to data in its original format, and the term source data system when referring to the storage location for source data.

Amazon S3 source data

Connect to source data on Amazon S3.

The following examples connect to .csv data. However, GX supports most of the Pandas read methods.

Prerequisites

Import GX and instantiate a Data Context

Run the following Python code to import GX and instantiate a Data Context:

import great_expectations as gx

context = gx.get_context()

Create a Data Source

The following information is required when you create an Amazon S3 Data Source:

  • name: The Data Source name. In the following examples, this is "my_s3_datasource"

  • bucket_name: The Amazon S3 bucket name.

  • boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.

  1. Run the following Python code to define name, bucket_name and boto3_options:

    datasource_name = "my_s3_datasource"
    bucket_name = "my_bucket"
    boto3_options = {}
    Additional options for boto3_options

    The parameter boto3_options allows you to pass the following information:

    • endpoint_url: specifies an S3 endpoint. You can use an environment variable such as "${S3_ENDPOINT}" to securely include this in your code. The string "${S3_ENDPOINT}" will be replaced with the value of the corresponding environment variable.
    • region_name: Your AWS region name.
  2. Run the following Python code to pass name, bucket_name, and boto3_options as parameters when you create your Data Source::

    datasource = context.sources.add_pandas_s3(
    name=datasource_name, bucket=bucket_name, boto3_options=boto3_options
    )

Add data to the Data Source as a Data Asset

Run the following Python code:

asset_name = "my_taxi_data_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
data_asset = datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, s3_prefix=s3_prefix
)

Your Data Asset will connect to all files that match the regex that you provide. Each matched file will become a Batch inside your Data Asset.

For example:

Let's say that your S3 bucket has the following files:

  • "yellow_tripdata_sample_2021-11.csv"
  • "yellow_tripdata_sample_2021-12.csv"
  • "yellow_tripdata_sample_2023-01.csv"

If you define a Data Asset using the full file name with no regex groups, such as "yellow_tripdata_sample_2023-01\.csv" your Data Asset will contain only one Batch, which will correspond to that file.

However, if you define a partial file name with a regex group, such as "yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv" your Data Asset will contain 3 Batches, one corresponding to each matched file. You can then use the keys year and month to indicate exactly which file you want to request from the available Batches.

Next steps

For more information about storing credentials for use with GX, see How to configure credentials.