The replica.yml File¶

Building a SnowShu replica is essentially configured in a single file.

Note

The replica.yml file can be named any valid .yml file, by default SnowShu will look for a replica.yml file in the execution directory. To specify the file name use the --replica-file flag.

Sample replica.yml File¶

Your initial replica file will look something like this (hint: you can run snowshu init to get a generated sample file):

version: "1"
credpath: "/snowshu/credentials.yml"
name: development-replica
short_description: Used for end-to-end testing
long_description: This replica includes our full test suite with simple sampling.
threads: 15
target:
  adapter: postgres
  adapter_args:
   pg_extensions:
     - uuid-ossp
source:
  profile: default
  sampling: default
  include_outliers: True
  general_relations:
    databases:
    - pattern: SNOWSHU_DEVELOPMENT
      schemas:
      - pattern: '.*'
        relations:
        - '^(?!.+_VIEW).+$'
  specified_relations:
  - database: SNOWSHU_DEVELOPMENT
    schema: SOURCE_SYSTEM
    relation: ORDERS
    unsampled: True
  - database: SNOWSHU_DEVELOPMENT
    schema: SOURCE_SYSTEM
    relation: '^ORDER_ITEMS$'
    relationships:
      bidirectional:
      - local_attribute: PRODUCT_ID
        database: ''
        schema: ''
        relation: PRODUCTS
        remote_attribute: ID
      directional:
      - local_attribute: ORDER_ID
        database: SNOWSHU_DEVELOPMENT
        schema: SOURCE_SYSTEM
        relation: ORDERS
        remote_attribute: ID
  - database: SNOWSHU_DEVELOPMENT
    schema: EXTERNAL_DATA
    relation: SOCIAL_USERS_IMPORT
    sampling:
      default:
        margin_of_error: 0.05
        confidence: 0.95
       min_sample_size: 300

This file tells SnowShu all kinds of things, including: - which relations (tables, views etc) to sample - where relationships exist between relations - what type of target replica to use - how to go about sampling

Anatomy of replica.yml¶

The file consists of 2 primary sections, the header and the source.

Header¶

The header section of the replica.yml file is basically everything that is not part of the source directive. In our example, this would be the header:

version: "1"
credpath: "/snowshu/credentials.yml"
name: development-replica
short_description: Used for end-to-end testing
long_description: This replica includes our full test suite with simple sampling.
threads: 15
target:
  adapter: postgres
  adapter_args:
   pg_extensions:
     - uuid-ossp

Let’s disect each of the components:

version (Required) is the replica file version, and tells SnowShu how to consume this file. Currently it should always be set to 1.
credpath (Required) is the path to a valid credentials.yml file (where the source database secrets are kept). Can be relative or absolute.
name (Required) will translate to the final name of the replica to be generated. The name should be short and distinctive.
short_description (Optional) tells users a little bit about the replica you are creating.
long_description (Optional) provides users with a detailed explanation of the replica you are creating.
threads (Optional) tells SnowShu the max number of threads that can be used when multiprocessing. When not set SnowShu may run much slower :(.
target (Required) Specifies the adapter to use when creating a replica.
- adapter (Required) For Snowflake, BigQuery and Redshift this should be postgres.
- adapter_args (Optional) Some targets may require additional configuration, especially when emulating a different source type. These keys and values are specific to the target type. Currently, only pg_extensions is supported.

Source¶

The source section of the replica.yml file is “where the magic happens”. This section is comprised of 3 parts: - the overall source settings - the general sampling configuration - the specified sampling configurations

Overall Source Settings¶

In our example, this portion of the source directive would be the overall source settings:

...
source:
  profile: default
  sampling: default
  include_outliers: True

The components of the overall source settings, disected:

profile (Required) is the name of the profile found in credentials.yml to execute with. In this example we are using a profile named “default”.
sampling (Required) is the name of the sampling method to be used. Samplings combine both the number of records sampled and the way in which they are selected. Current sampling options are default (uses Bernoulli sampling and Cochran’s sizing), or brute_force (Uses a fixed % and Bernoulli).
include_outliers (Optional) determines if SnowShu should look for records that do not respect specified relationships, and ensure they are included in the sample. Defaults to False.
max_number_of_outliers (Optional) specifies the maximum number of outliers to include when they are found. This helps keep a bad relationship (such as an incorrect assumption on a trillion row table) from exploding the replica. Default is 100.

General Sampling Configuration¶

With your overall source settings configured, you can set your general sampling configuration. The general sampling is the most broad (and least configurable) data sampling hierarchy. For example:

...
general_relations:
  databases:
  - pattern: SNOWSHU_DEVELOPMENT
    schemas:
    - pattern: '.*'
      relations:
      - '^(?!.+_VIEW).+$'

General relations accepts a nested structure of database->schema(s)->relation(s). The configuration accepts both plain text relation names and regex strings (python re syntax). For example, the pattern above matches all relations (tables and views) in the database SNOWSHU_DEVELOPMENT in any schema, where the name does not end in “VIEW” (or “view”, “vIew” etc).

This nested pattern of relations follows all the specs outlined in the Overall Source Settings.

Specified Sampling Configuration¶

The specified sampling configurations are the most… specific. If a relation appears in both the general sampling configuation and a specified sampling configuration, the specified sampling will win out. They are also evaluated top-down, so a relation appearing in more than one specified configuration will have either the cumulative value (for relationships) or the last value (for flags).

Specified relations look like this:

...
specified_relations:
 - database: SNOWSHU_DEVELOPMENT
   schema: SOURCE_SYSTEM
   relation: ORDERS
   unsampled: True
 - database: SNOWSHU_DEVELOPMENT
   schema: SOURCE_SYSTEM
   relation: '^ORDER_ITEMS$'
   relationships:
     bidirectional:
     - local_attribute: PRODUCT_ID
       database: ''
       schema: ''
       relation: PRODUCTS
       remote_attribute: ID
     directional:
     - local_attribute: ORDER_ID
       database: SNOWSHU_DEVELOPMENT
       schema: SOURCE_SYSTEM
       relation: ORDERS
       remote_attribute: ID
 - database: SNOWSHU_DEVELOPMENT
   schema: EXTERNAL_DATA
   relation: SOCIAL_USERS_IMPORT
   sampling:
     default:
       margin_of_error: 0.05
       confidence: 0.95
       min_sample_size: 300

Each specified relation must have the following:

database (Required) is the name or valid regex for the specified relation database.
schema (Required) is the name or valid regex for the specified relation schema.
relation (Required) is the name or valid regex for the specified relation.

Note

specified relations can represent one or many many relations, based on the pattern provided.

They can then contain one or more of these options: - unsampled (Optional) tells SnowShu to pull the entire relation. Good for tiny reference tables, very bad for big stores of data. - sampling (Optional) allows you to override the higher-level configuration and set specifics for that sampling.

The primary use of specified relations is to create relationships. This is accomplished through the relationships directive of a specified relation.

A Relationships Primer¶

One of the more gnarly parts of generating sample data for testing is the issue of referential integrity.. Say you have a table, say USERS, and another table ORDERS with a column user_id in it. In the full data set, every row of ORDERS will have a valid user_id from the USERS table - and you can test your software by checking to make sure your final output of ORDERS has a valid user_id that can be found in USERS. However, when we sample this is no longer the case. Not all the rows selected by the sample from one table can be referenced by the other - and this breaks our tests.

SnowShu handles this complexity by enforcing relationships.

a directional relationship is where the records for one table (ORDERS in the example above) must have referential integrity to another (USERS). a bidirectional relationship is where both tables must have referential integrity to each other (ie USER_ADDRESSES and USERS must only have references that exist in each other).

Specified relations can have more than one of each type of relationship. For each relationship the following must be defined:

database (Required) is the name or valid regex for the database that the specified relation will have a relationship with, or a blank string (more on that below).
schema (Required) is the name or valid regex for the schema that the specified relation will have a relationship with, or a blank string (more on that below).
relation (Required) is the name or valid regex for the relation that the specified relation will have a relationship with, or a blank string (more on that below).
local_attribute (Required) is the name of the column in the specified relation that has an fkey relationship. Cannot be regex, needs to be the actual column name.
remote_attribute (Required) is the name of the column in the relation that the specified relation has an fkey relationship with. Cannot be regex, needs to be the actual column name.

So in this example:

...
 - database: SNOWSHU_DEVELOPMENT
   schema: SOURCE_SYSTEM
   relation: ORDER_ITEMS
   relationships:
     bidirectional:
     - local_attribute: PRODUCT_ID
       database: ''
       schema: ''
       relation: PRODUCTS
       remote_attribute: ID
     directional:
     - local_attribute: ORDER_ID
       database: SNOWSHU_DEVELOPMENT
       schema: SOURCE_SYSTEM
       relation: ORDERS
       remote_attribute: ID

The specified relation is SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS. When SnowShu builds this replica:

All the records in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS will be records with a product_id found in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.PRODUCTS.
All the records in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.PRODUCTS will be records with an id found in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS.
All the records in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS will be records with an order_id found in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDERS.
The records in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDERS may be records with an id not found in SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS.

A note on empty strings in relationships: When specifying a relationship, SnowShu will interpret empty strings in the database or schema to inherit from the specified relation under test. For example:

...
 - database: '[hamburger|hotdog]'
   schema: '[socks|shoes]'
   relation: giraffes
   relationships:
     bidirectional:
     - local_attribute: id
       database: ''
       schema: ''
       relation: condiments
       remote_attribute: giraffe_id

This will evaluate to:

hamburger.socks.giraffes will be related to hamburger.socks.condiments
hotdog.socks.giraffes will be related to hotdog.socks.condiments
hamburger.shoes.giraffes will be related to hamburger.shoes.condiments

etc etc.

Case (In)Sensitivity In Relations¶

Important

TLDR; In SnowShu replica files, identifiers are case insensitive unless:

they are mixed case (ie CamelCase)
they contain a space (ie Space Case)
they are specified by a regex string
the global option preserve_case:True is set.

SQL casing is simple, until it is complex. A general interpretation of the spec is that identifiers (such as table names, schema names and column names) should behave in a case-insensitive way; that is to say that USER_TABLE and user_table should both query the same table when written in SQL.

Most databases accomplish this case insensitivity by “folding”, or selecting a case and casting all identifiers to that case. The challenge is that not all databases fold in the same direction. The debate of which way databases should fold is not one we will have here (the spec calls for uppercase, but that is not universally adopted).

SnowShu does the best it can to interpret your “intentions”. If you specify USERS or users, (all one case) it will read that as case insensitive and grab either version in the source database. It will use the native default casing in the target database, so you can continue to write either form in your code without using double quotes.

In situtations where you specified a mixed casing like Users, SnowShu interprets this as intentional and will preserve the case. This means you will need to wrap the identifier in double quotes when querying for it. This is also true for situations where a space is included in the identifier.

Regex strings are interpreted exactly as-is. So if you want a case-insensitive regex string, you need to set that in the regex (ie (?i)).

You can also force the native source case to persist all the way to the target. This is great if your entire source is full of mixed cases and spaces, but is otherwise a generally bad idea. Set this flag in the Overall Source Settings with preserve_case: True.

Warning

It is usually a very bad idea to preserve case. SQL architectures generally depend heavily on the case-insensitive nature of the language, and breaking this means every single indentifier will likely need to be quoted in code and queries.