Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Amazon S3 connector, available on the Peak platform, is built to sync data from an Amazon S3 bucket to the Peak platform environment. 

This guide describes why and how to use this Amazon S3 connector.


Contents 


Why use Amazon S3 Connector?

  • Data Accessibility
    S3 is a widely used and scalable cloud storage service. Many organizations store their data in S3 buckets. The S3 connector allows the Peak platform to access and utilize this data directly, enhancing data accessibility. The connector offers a consistent, maintenance-free interface to access Amazon S3 data. All the complexities of the Amazon S3 integration are managed within the connector.
  • Codeless and Seamless Integration
    S3 connector streamlines the integration process, making it easier for users of the Peak platform to incorporate their S3-stored data without complex manual steps. S3 connector offers codeless access to live Amazon S3 data for both technical and non-technical users alike. Any user can configure an Amazon S3 connector feed easily to get access to their S3-stored data.
  • Automation
    This connector enables the automation of data retrieval processes. This is particularly valuable for users who want to eliminate manual data transfer tasks. Automation enhances efficiency and reduces the risk of errors associated with manual processes.
  • Data Security
    The S3 connector utilises interface endpoints for transferring the data. The use of S3 interface endpoints enhances security by ensuring that traffic between your VPC and S3 remains within the AWS network, avoiding exposure to the public internet. This helps protect sensitive data and reduces the attack surface.
  • Flexibility to pull data at own convenience
    The S3 connector gives flexibility to the user to pull data at their convenience. Data can be pulled in at scheduled intervals, or triggered via webhooks. This ensures that the Peak platform has the most up-to-date information, providing users with accurate and timely insights. 


How to use the Amazon S3 Connector?

Users can create Data feeds to use the Amazon S3 Connector to sync their S3-stored data with the Peak platform.

For more information on data feeds, see the Data Sources overview.


Pre-requisites

To connect your Amazon S3 bucket to the Peak platform, you need:

  • An S3 bucket containing files with supported file types and encodings.
  • An IAM role to provide the Peak AWS account access to this S3 bucket. More details on this in the Connection section.


Process Overview

To create an Amazon S3 feed, go to Dock > Data Sources and click the ADD FEED button.

Four stages need to be completed when adding an Amazon S3 feed.

  • Connection
    This step lets you configure the Amazon S3 bucket that stores your data.
  • Import Configuration
    This step lets you specify the details of the files that have to be fetched. It also provides options to select the load type and feed name.
  • Destination
    You can choose the destination for the data in this step.
  • Trigger
    This final step lets you specify the trigger for your feed and also add watchers if required.


Connection

You configure your Amazon S3 data lake to work with the Peak platform during this process.

You will need to create an IAM role in your AWS S3 account so that Peak can connect to your S3 bucket. The Peak platform generates the IAM policy that you will need to use while creating the IAM role. 

IAM (Identity and Access Management) roles are used to control access to AWS services, including Amazon S3. IAM roles can provide temporary security credentials that applications assume when they need to access S3. These temporary credentials are more secure than long-term access keys, as they have a limited lifespan and are automatically rotated.

IAM roles enable cross-account access, allowing entities from one AWS account (in this case, Peak's AWS account) to assume a role in another account (in this case, your AWS account where the S3 bucket is present). This is useful in scenarios where resources in one AWS account need to access S3 buckets in a different account.

To use a preconfigured connection:

  1. At the Connection stage, from the Select Connection dropdown, choose the required connection.
    The dropdown will be empty if there have not been any previous connections configured.
  2. Click NEXT to move to the next step.



To create a new connection:

  1. At the Connection stage, click NEW CONNECTION.



  2. Enter the required connection parameters.
    1. Connection Name
      Enter a connection name.
    2. Datalake Region
      Choose the region where the Amazon S3 bucket is physically storing the data.

      You can check the region of your bucket in the properties of the S3 bucket in the AWS console.
    3. Bucket Name
      Enter your Amazon S3 bucket name.
    4. Root Path
      Enter the directory path for which you want to provide read access. 

      If you want to provide access to the entire bucket, please enter /. If you want to provide access to a particular folder, please enter the folder path followed by a slash.




    5. After entering the above-mentioned details, click GENERATE POLICY.

      This generates an Amazon Identity and Access Management (IAM) policy so that Peak can access your S3 bucket. The generated policy asks for List and Get permissions for the objects present in the provided path. We follow the principle of least privilege, ensuring that we ask for only the required permissions.



    6. After the policy has been generated, go to your Amazon IAM web service.

    7. Create an IAM role in your AWS account and add Peak as a trusted entity.
      For more information, see 
      AWS IAM Configuration.
      If you want to re-use an existing IAM role, you need to edit the trust policy for the same role to trust the Peak AWS account.
    8. Once the IAM role is created in your account, copy the IAM role ARN and paste it into the IAM Role ARN field.
      Peak will use this IAM role to connect to your Amazon S3 bucket.
  3. Click TEST to test the connection.
    If successful, click SAVE to save the connection.
    If it is unsuccessful, check your connection details and try again.

  4. Select the newly created connection and click NEXT to move to the next step.

Users have the option to change any of the details like bucket name, region, root path by clicking UPDATE. Users will need to generate the policy again after updating these details. 


Connections, once created, can be edited later. 

All the connection details except the connection name are editable.


To edit an existing connection:


1. Select the connection that you want to edit.
2. Click on the edit icon available next to the connections drop-down.



3. Update the details that need to be edited. Click UPDATE, if you want to edit the bucket name, root path or the bucket region.



4. Test the connection once you have updated all the details.

5. If the test is successful, you can save the connection.

Import Configuration

Once a connection has been established and the path for the files entered, the feed has to be configured so that data is updated and filtered mostly suitably.

Amazon S3 applies server-side encryption with Amazon S3 managed keys (SSE-S3) as the base level of encryption for every bucket in Amazon S3. However, you can choose to configure buckets to use server side encryption with AWS KMS (Key Management Service) keys instead. Please note that S3 connector does not support customer-managed KMS keys.


1. Enter the folder path.
This field is pre-populated with the root path configured for the connection selected in the previous step. You can add the complete path of the folder where your source files reside. We examine files under the specified folder and all of its nested subfolders for files we can sync.


It is recommended that the folder should not have any files other than the ones that are to be ingested.




Multiple feeds can be created using the same folder path. It allows to you to slice and dice the same data in any way you'd like to.


2. Select the file type.

You can choose the type for the files that are present in the source folder. More details on supported file types are available here

We interpret every file we examine as the file type you select for the entire feed process. Please make sure everything we sync as part of a feed has the same file type.


3. After specifying the folder path and selecting the file type, click LOAD PREVIEW to load a preview of any one file present in the specified folder of the chosen file type.

Loading the preview will fail if the provided folder doesn't exist or doesn't contain any files of the chosen file type. It can also fail if it is not able to access the provided folder. Please verify that the ACL of the objects in the provided folder adhere to the requirements.

We recommend disabling Access Control Lists (ACLs) on each S3 bucket so that the bucket contents are controlled by the bucket's access control settings and not the original file owner's settings. If ACLs are not disabled for your bucket, please make sure that the object ACLs are configured to provide read access to Peak. For more information, see Access control list (ACL) overview.









4. If your chosen file type is CSV, you have to choose the separator from the list or provide a custom separator.



5. You can choose to provide a historical date that will be used to decide the initial discovery timestamp for the files. 

The historical date should be provided in the YYYY-MM-DD format. The provided date will be considered in UTC. Any files that have been uploaded to the provided folder path before this date will not be part of the ingestion process.

6. Select the load type for the feed. For more information on load types, see Data feed load types.
7. Provide a name for the feed.
The name should be meaningful.
Only use alphanumeric characters and underscores.
It must start with a letter.
It must not end with an underscore.
Use up to 50 characters.


Destination

  • The Destination stage enables you to choose where your data will be ingested.
  • You can choose multiple destinations for Redshift tenants including Redshift and S3 (external table). You can select either Snowflake or S3 (external table) for Snowflake tenants.
    For more information, see Choosing a Destination for a data connector.
  • Data types in the generated schema can be modified if needed.


Triggers

For a guide to setting triggers for your data feeds, see How to create a trigger for a data feed.


Supported File Types

All the files should use the UTF-8 encoding.The S3 connector supports the following three file types:

  • CSV
    Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.
    Users can choose the separator used in the files from the list - comma, tab, pipe or provide a custom separator. TSV and PSV are considered to be part of the CSV.
  • JSON
    JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).
    The top-level element is expected to be an array of JSON files.
  • NDJSON
    NDJSON is a collection of JSON objects separated by a newline character. .json extension is expected for NDJSON files.


    Compressed versions of these file types (csv.gz, json.gz) are also considered for ingestion purpose.