The replica.yml File¶
Building a SnowShu replica is essentially configured in a single file.
Note
The replica.yml file can be named any valid .yml
file, by default SnowShu will look for a replica.yml
file in the execution directory. To specify the file name use the --replica-file
flag.
Sample replica.yml File¶
Your initial replica file will look something like this (hint: you can run snowshu init
to get a generated sample file):
version: "1"
credpath: "/snowshu/credentials.yml"
name: development-replica
short_description: Used for end-to-end testing
long_description: This replica includes our full test suite with simple sampling.
threads: 15
target:
adapter: postgres
adapter_args:
pg_extensions:
- uuid-ossp
source:
profile: default
sampling: default
include_outliers: True
general_relations:
databases:
- pattern: SNOWSHU_DEVELOPMENT
schemas:
- pattern: '.*'
relations:
- '^(?!.+_VIEW).+$'
specified_relations:
- database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: ORDERS
unsampled: True
- database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: '^ORDER_ITEMS$'
relationships:
bidirectional:
- local_attribute: PRODUCT_ID
database: ''
schema: ''
relation: PRODUCTS
remote_attribute: ID
directional:
- local_attribute: ORDER_ID
database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: ORDERS
remote_attribute: ID
- database: SNOWSHU_DEVELOPMENT
schema: EXTERNAL_DATA
relation: SOCIAL_USERS_IMPORT
sampling:
default:
margin_of_error: 0.05
confidence: 0.95
min_sample_size: 300
This file tells SnowShu all kinds of things, including: - which relations (tables, views etc) to sample - where relationships exist between relations - what type of target replica to use - how to go about sampling
Anatomy of replica.yml¶
The file consists of 2 primary sections, the header and the source.
Header¶
The header section of the replica.yml
file is basically everything that is not part of the source
directive.
In our example, this would be the header:
version: "1"
credpath: "/snowshu/credentials.yml"
name: development-replica
short_description: Used for end-to-end testing
long_description: This replica includes our full test suite with simple sampling.
threads: 15
target:
adapter: postgres
adapter_args:
pg_extensions:
- uuid-ossp
Let’s disect each of the components:
version (Required) is the replica file version, and tells SnowShu how to consume this file. Currently it should always be set to
1
.credpath (Required) is the path to a valid
credentials.yml
file (where the source database secrets are kept). Can be relative or absolute.name (Required) will translate to the final name of the replica to be generated. The name should be short and distinctive.
short_description (Optional) tells users a little bit about the replica you are creating.
long_description (Optional) provides users with a detailed explanation of the replica you are creating.
threads (Optional) tells SnowShu the max number of threads that can be used when multiprocessing. When not set SnowShu may run much slower :(.
target (Required) Specifies the adapter to use when creating a replica.
adapter (Required) For Snowflake, BigQuery and Redshift this should be
postgres
.adapter_args (Optional) Some targets may require additional configuration, especially when emulating a different source type. These keys and values are specific to the target type. Currently, only pg_extensions is supported.
Source¶
The source section of the replica.yml
file is “where the magic happens”. This section is comprised of 3 parts:
- the overall source settings
- the general sampling configuration
- the specified sampling configurations
Overall Source Settings¶
In our example, this portion of the source directive would be the overall source settings:
...
source:
profile: default
sampling: default
include_outliers: True
The components of the overall source settings, disected:
profile (Required) is the name of the profile found in
credentials.yml
to execute with. In this example we are using a profile named “default”.sampling (Required) is the name of the sampling method to be used. Samplings combine both the number of records sampled and the way in which they are selected. Current sampling options are
default
(uses Bernoulli sampling and Cochran’s sizing), orbrute_force
(Uses a fixed % and Bernoulli).include_outliers (Optional) determines if SnowShu should look for records that do not respect specified relationships, and ensure they are included in the sample. Defaults to False.
max_number_of_outliers (Optional) specifies the maximum number of outliers to include when they are found. This helps keep a bad relationship (such as an incorrect assumption on a trillion row table) from exploding the replica. Default is 100.
General Sampling Configuration¶
With your overall source settings configured, you can set your general sampling configuration. The general sampling is the most broad (and least configurable) data sampling hierarchy. For example:
...
general_relations:
databases:
- pattern: SNOWSHU_DEVELOPMENT
schemas:
- pattern: '.*'
relations:
- '^(?!.+_VIEW).+$'
General relations accepts a nested structure of database->schema(s)->relation(s). The configuration accepts both plain text relation names and regex strings (python re syntax).
For example, the pattern above matches all relations (tables and views) in the database SNOWSHU_DEVELOPMENT
in any schema, where the name does not end in “VIEW” (or “view”, “vIew” etc).
This nested pattern of relations follows all the specs outlined in the Overall Source Settings.
Specified Sampling Configuration¶
The specified sampling configurations are the most… specific. If a relation appears in both the general sampling configuation and a specified sampling configuration, the specified sampling will win out. They are also evaluated top-down, so a relation appearing in more than one specified configuration will have either the cumulative value (for relationships) or the last value (for flags).
Specified relations look like this:
...
specified_relations:
- database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: ORDERS
unsampled: True
- database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: '^ORDER_ITEMS$'
relationships:
bidirectional:
- local_attribute: PRODUCT_ID
database: ''
schema: ''
relation: PRODUCTS
remote_attribute: ID
directional:
- local_attribute: ORDER_ID
database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: ORDERS
remote_attribute: ID
- database: SNOWSHU_DEVELOPMENT
schema: EXTERNAL_DATA
relation: SOCIAL_USERS_IMPORT
sampling:
default:
margin_of_error: 0.05
confidence: 0.95
min_sample_size: 300
Each specified relation must have the following:
database (Required) is the name or valid regex for the specified relation database.
schema (Required) is the name or valid regex for the specified relation schema.
relation (Required) is the name or valid regex for the specified relation.
Note
specified relations can represent one or many many relations, based on the pattern provided.
They can then contain one or more of these options: - unsampled (Optional) tells SnowShu to pull the entire relation. Good for tiny reference tables, very bad for big stores of data. - sampling (Optional) allows you to override the higher-level configuration and set specifics for that sampling.
The primary use of specified relations is to create relationships. This is accomplished through the relationships
directive of a specified relation.
A Relationships Primer¶
One of the more gnarly parts of generating sample data for testing is the issue of referential integrity.. Say you have a table, say USERS
, and another table ORDERS
with a column user_id
in it. In the full data set, every row of ORDERS
will have a valid user_id
from the USERS
table - and you can test your software by checking to make sure your final output of ORDERS
has a valid user_id
that can be found in USERS
. However, when we sample this is no longer the case. Not all the rows selected by the sample from one table can be referenced by the other - and this breaks our tests.
SnowShu handles this complexity by enforcing relationships.
a directional relationship is where the records for one table (ORDERS
in the example above) must have referential integrity to another (USERS
).
a bidirectional relationship is where both tables must have referential integrity to each other (ie USER_ADDRESSES
and USERS
must only have references that exist in each other).
Specified relations can have more than one of each type of relationship. For each relationship the following must be defined:
database (Required) is the name or valid regex for the database that the specified relation will have a relationship with, or a blank string (more on that below).
schema (Required) is the name or valid regex for the schema that the specified relation will have a relationship with, or a blank string (more on that below).
relation (Required) is the name or valid regex for the relation that the specified relation will have a relationship with, or a blank string (more on that below).
local_attribute (Required) is the name of the column in the specified relation that has an fkey relationship. Cannot be regex, needs to be the actual column name.
remote_attribute (Required) is the name of the column in the relation that the specified relation has an fkey relationship with. Cannot be regex, needs to be the actual column name.
So in this example:
...
- database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: ORDER_ITEMS
relationships:
bidirectional:
- local_attribute: PRODUCT_ID
database: ''
schema: ''
relation: PRODUCTS
remote_attribute: ID
directional:
- local_attribute: ORDER_ID
database: SNOWSHU_DEVELOPMENT
schema: SOURCE_SYSTEM
relation: ORDERS
remote_attribute: ID
The specified relation is SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS
. When SnowShu builds this replica:
All the records in
SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS
will be records with aproduct_id
found inSNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.PRODUCTS
.All the records in
SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.PRODUCTS
will be records with anid
found inSNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS
.All the records in
SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS
will be records with anorder_id
found inSNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDERS
.The records in
SNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDERS
may be records with anid
not found inSNOWSHU_DEVELOPMENT.SOURCE_SYSTEM.ORDER_ITEMS
.
A note on empty strings in relationships: When specifying a relationship, SnowShu will interpret empty strings in the database or schema to inherit from the specified relation under test. For example:
...
- database: '[hamburger|hotdog]'
schema: '[socks|shoes]'
relation: giraffes
relationships:
bidirectional:
- local_attribute: id
database: ''
schema: ''
relation: condiments
remote_attribute: giraffe_id
This will evaluate to:
hamburger.socks.giraffes
will be related tohamburger.socks.condiments
hotdog.socks.giraffes
will be related tohotdog.socks.condiments
hamburger.shoes.giraffes
will be related tohamburger.shoes.condiments
etc etc.
Case (In)Sensitivity In Relations¶
Important
TLDR; In SnowShu replica files, identifiers are case insensitive unless:
they are mixed case (ie
CamelCase
)they contain a space (ie
Space Case
)they are specified by a regex string
the global option
preserve_case:True
is set.
SQL casing is simple, until it is complex. A general interpretation of the spec is that identifiers (such as table names, schema names and column names) should behave in a case-insensitive way; that is to say that USER_TABLE
and user_table
should both query the same table when written in SQL.
Most databases accomplish this case insensitivity by “folding”, or selecting a case and casting all identifiers to that case. The challenge is that not all databases fold in the same direction. The debate of which way databases should fold is not one we will have here (the spec calls for uppercase, but that is not universally adopted).
SnowShu does the best it can to interpret your “intentions”. If you specify USERS
or users
, (all one case) it will read that as case insensitive and grab either version in the source database. It will use the native default casing in the target database, so you can continue to write either form in your code without using double quotes.
In situtations where you specified a mixed casing like Users
, SnowShu interprets this as intentional and will preserve the case. This means you will need to wrap the identifier in double quotes when querying for it. This is also true for situations where a space is included in the identifier.
Regex strings are interpreted exactly as-is. So if you want a case-insensitive regex string, you need to set that in the regex (ie (?i)
).
You can also force the native source case to persist all the way to the target. This is great if your entire source is full of mixed cases and spaces, but is otherwise a generally bad idea.
Set this flag in the Overall Source Settings with preserve_case: True
.
Warning
It is usually a very bad idea to preserve case. SQL architectures generally depend heavily on the case-insensitive nature of the language, and breaking this means every single indentifier will likely need to be quoted in code and queries.