While creating a data feed, the user is allowed to choose a destination where the data is ingested.
The destinations available will depend on the configured data warehouse with your Peak organization. For example, Redshift users have S3 and Redshift as options, while Snowflake users have S3 and Snowflake.
This article explains how to configure the feed destination for both Snowflake and Redshift data warehouses.
TABLE OF CONTENTS
- Checking your data warehouse type
- Snowflake data warehouses
- Redshift data warehouses
- Failed row threshold
- Schema Evolution
- Changing the data type of a schema
Checking your data warehouse type
The options available at the Destination stage are different depending on the type of data warehouse your organization is using.
For more information about the types of data warehouses Peak can use, see the Data Bridge articles.
To check your data warehouse type:
Go to Dock > Data Bridge.
The Data Bridge screen appears.Go to the Data Warehouse section and see which type of Storage is configured:
Snowflake data warehouses
If your organization uses a Snowflake data warehouse, the destination can be S3 (spark processing) or Snowflake, but not both.
S3 (spark processing)
This is Amazon S3 data storage.
Peak uses Apache Spark to process large, unstructured (CSV) datasets on Amazon S3.
This data is processed as an external table during the ingestion process, ready for you to query in SQL Explorer.
For more information on using SQL Explorer, see Querying a Snowflake data warehouse using SQL Explorer.
Snowflake
Choosing Snowflake as the destination will allow you to bring data into the Snowflake data warehouse that is set up for your organisation. Snowflake database is SQL-Query based and supports SQL for querying and analyzing data.
After data ingestion, the below-mentioned audit columns are added by default in the ingested table:
PEAKAUDITCREATEDAT
The time when the record is created in the table.
PEAKAUDITFILENAME
The path of the raw file in the data lake that contains this record.
PEAKAUDITREQUESTID
The feed run id to determine the specific feed run in which the record was ingested.
- PEAKAUDITUPDATECOUNTER
For the upsert load type, the counter is increased when a record is updated - PEAKAUDITUPDATEDAT
For upsert load type, this column has the date when the record was last updated
For more information on how Snowflake data warehouses work with Peak, see Using your Snowflake data warehouse with Peak.
Redshift data warehouses
For Redshift data warehouses, the destination can be either S3 (spark processing), Redshift, or both.
S3 (spark processing)
This is Amazon S3 data storage.
Peak uses Apache Spark to process large, unstructured (CSV) datasets on Amazon S3.
This data is processed as Redshift Spectrum during the ingestion process, ready for you to query it in SQL Explorer.
Redshift
Data stored using Redshift can be queried using SQL. This makes it possible to run frequent aggregations on really large datasets.
Redshift is a relational database and any data that is fed into it has to map exactly; column by column.
Any failed rows are flagged and written to a separate table.
After data ingestion, the below-mentioned audit columns are added by default in the ingested table:
Failed row threshold
- This is the number of acceptable failed rows before the feed is marked as failed.
- The threshold should reflect the total number of rows that are being written to the table and what is an acceptable proportion of fails before the quality of the data could be considered compromised.
- The acceptable error count is considered per file for the upsert load type.
- The acceptable error count is the aggregate count of errors from all the files for incremental, truncate and insert load types.
- For the upsert load type, if there are multiple files for ingestion and some of the files fail because failed rows >= threshold and some files are ingested successfully then the feed is marked as Partially ingested.
- Error messages for failed rows can be downloaded from feed logs.
- Feed is marked as failed if all rows in a file have corrupted data
- No values other than a positive integer are allowed for the threshold.
Property | Snowflake Destination | Redshift Destination |
---|---|---|
Default Threshold | 0 | 1000 |
Maximum allowed threshold | 100000 | 100000 |
Threshold in Database Connector | Not allowed to edit (Default is 0) | Allowed to edit |
Threshold in REST API, Webhook, Braze Currents connectors | Not Available | Not Available |
Schema Evolution
Schema evolution is handled in the cases where any column is added or deleted in the data in a feed run.
New column added
When a new column is identified during a feed run, the column is added in ingestion with datatype 'string' and the value for that column in all previous records is set to NULL.
Existing column deleted
When a column is identified to be deleted during a feed run, the column is not deleted from the table in destination and for the new feed runs the value in new records is set to NULL for new records.
Changing the data type of a schema
When specifying the destination for both Snowflake and Redshift destinations, you can change the data type of your schema.
This function is available for all connectors apart from Webhook and Braze currents connectors.
Choose the required column name or field name and click the dropdown icon next to the Suggested Data Type.
The following data types are available:
JSON
TIMESTAMP
INTEGER
DATE
NUMERIC
STRING
BOOLEAN
In the current release, TIMESTAMPTZ
is not supported.
Any data in this format will be ingested as a string by default.