While creating a data feed, the user is allowed to choose a destination where the data is ingested.

The destinations available will depend on the configured data warehouse with your Peak organization. For example, Redshift users have S3 and Redshift as options, while Snowflake users have S3 and Snowflake.


This article explains how to configure the feed destination for both Snowflake and Redshift data warehouses.


TABLE OF CONTENTS


Checking your data warehouse type


The options available at the Destination stage are different depending on the type of data warehouse your organization is using.

For more information about the types of data warehouses Peak can use, see the Data Bridge articles.

To check your data warehouse type:

  1. Go to Dock > Data Bridge.
    The Data Bridge screen appears.

  2. Go to the Data Warehouse section and see which type of Storage is configured:




Snowflake data warehouses

If your organization uses a Snowflake data warehouse, the destination can be S3 (spark processing) or Snowflake, but not both.



S3 (spark processing)


This is Amazon S3 data storage.

Peak uses Apache Spark to process large, unstructured (CSV) datasets on Amazon S3.

This data is processed as an external table during the ingestion process, ready for you to query in SQL Explorer.


For more information on using SQL Explorer, see Querying a Snowflake data warehouse using SQL Explorer.


Snowflake

Choosing Snowflake as the destination will allow you to bring data into the Snowflake data warehouse that is set up for your organisation. Snowflake database is SQL-Query based and supports SQL for querying and analyzing data.


After data ingestion, the below-mentioned audit columns are added by default in the ingested table:

  • PEAKAUDITCREATEDAT
    The time when the record is created in the table.

  • PEAKAUDITFILENAME
    The path of the raw file in the data lake that contains this record.

  • PEAKAUDITREQUESTID
    The feed run id to determine the specific feed run in which the record was ingested.

  • PEAKAUDITUPDATECOUNTER
    For the upsert load type, the counter is increased when a record is updated
  • PEAKAUDITUPDATEDAT
    For upsert load type, this column has the date when the record was last updated


For more information on how Snowflake data warehouses work with Peak, see Using your Snowflake data warehouse with Peak.



Redshift data warehouses

For Redshift data warehouses, the destination can be either S3 (spark processing), Redshift, or both.


S3 (spark processing)

This is Amazon S3 data storage.

Peak uses Apache Spark to process large, unstructured (CSV) datasets on Amazon S3.

This data is processed as Redshift Spectrum during the ingestion process, ready for you to query it in SQL Explorer.


The S3 (Spark) destination will only be available if tenant has an active Glue Catalog configured. If destination is disabled, please configure the Glue Catalog.


Redshift

Data stored using Redshift can be queried using SQL. This makes it possible to run frequent aggregations on really large datasets.

Redshift is a relational database and any data that is fed into it has to map exactly; column by column.

Any failed rows are flagged and written to a separate table.


After data ingestion, the below-mentioned audit columns are added by default in the ingested table:

  • peakauditcreatedat
    The time when the record is created in the table.
  • peakauditrequestid
    The feed run id to determine the specific feed run in which the record was ingested.
  • peakauditupdatecounter
    For the upsert load type, the counter is increased when a record is updated
  • peakauditupdatedat
    For upsert load type, this column has the date when the record was last updated

Failed row threshold

  • This is the number of acceptable failed rows before the feed is marked as failed.
  • The threshold should reflect the total number of rows that are being written to the table and what is an acceptable proportion of fails before the quality of the data could be considered compromised.
  • The acceptable error count is considered per file for the upsert load type.
  • The acceptable error count is the aggregate count of errors from all the files for incremental, truncate and insert load types.
  • For the upsert load type, if there are multiple files for ingestion and some of the files fail because failed rows >= threshold and some files are ingested successfully then the feed is marked as Partially ingested.
  • Error messages for failed rows can be downloaded from feed logs.
  • Feed is marked as failed if all rows in a file have corrupted data
  • No values other than a positive integer are allowed for the threshold.

Property Snowflake DestinationRedshift Destination
Default Threshold01000
Maximum allowed threshold100000100000
Threshold in Database ConnectorNot allowed to edit (Default is 0)Allowed to edit
Threshold in REST API, Webhook, Braze Currents connectorsNot AvailableNot Available



Schema Evolution

Schema evolution is handled in the cases where any column is added or deleted in the data in a feed run.

Please ensure that all the files in a feed run should maintain a consistent schema, i.e the same columns should be present in each file for CSV files. Inconsistent schema across files in the same feed run will lead to ingestion failure.

We support files with different schemas for NDJSON files, as long as each file itself has a consistent unified schema.


New column added

When a new column is identified during a feed run, the column is added in ingestion with datatype 'string' and the value for that column in all previous records is set to NULL.


Existing column deleted

When a column is identified to be deleted during a feed run, the column is not deleted from the table in destination and for the new feed runs the value in new records is set to NULL for new records.  


Changing the data type of a schema

When specifying the destination for both Snowflake and Redshift destinations, you can change the data type of your schema.

This function is available for all connectors apart from Webhook and Braze currents connectors.

Choose the required column name or field name and click the dropdown icon next to the Suggested Data Type.
The following data types are available:

  • JSON

  • TIMESTAMP

  • INTEGER

  • DATE

  • NUMERIC

  • STRING

  • BOOLEAN

In the current release, TIMESTAMPTZ is not supported. 
Any data in this format will be ingested as a string by default.