Data Import

Data Import

The Data Import feature allows you to send event data stored in your AWS S3, GCP, or GCS to Hackle. The Data Import feature imports data on a daily basis.

Supported Cloud Storage

CloudStorageSupported?
AWSS3Yes
AWSRedshiftNot yet
GCPGCSYes
GCPBigQueryNot yet

Preparation

The following tasks are required before data extraction

  • Create a storage to store the event data (AWS S3, GCP GCS, etc.).
  • Create and authorize a key to access the storage to store event data.
  • Process the event data into a standardized format and store it by day (e.g. 2023-01-01, 2023-01-02, etc.)

Generating keys and authorizing : GCP GCS

For GCP GCS, you can generate a key by referring to GCP IAM > Generating and Managing Service Account Keys.

The following authorizations are required when creating a key to access GCS.

storage.buckets.get
storage.objects.get
storage.objects.create
storage.objects.delete
storage.objects.list

Generating keys and authorizing : AWS S3

For AWS S3, you can refer to the following documents to create a key and grant the necessary permissions.

  1. Create an AWS IAM User by following the documentation in AWS Docs: Create an IAM User.
  2. follow AWS Docs: Creating an IAM Policy to create a policy and include the IAM Policy policy attached as code below. Then add the IAM Policy policy to the IAM Role created in the previous step.
  3. Follow AWS Docs: Creating an IAM Key to create a key
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "s3:GetObject",
              "s3:GetObjectVersion",
              "s3:DeleteObject",
              "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<bucket>/<prefix>/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<bucket>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "<prefix>/*",
                        "<prefix>/",
                        "<prefix>"                      
                    ]
                }
            }
        }
    ]
}

Data Import Format

Data import currently supports the Apache Parquet format. Below is a schema of the Parquet format data that is passed. It is processed and stored in the format described in the table below

Column CategoryColumn NameColumn TypeColumn Value (Example)Description
Insert IDinsert_idSTRING8fb8e088-9245-4fce-bb87-7e09d9917ed6Used to verify event duplication with UUID value.
Event Keyevent_keySTRINGpurchaseName of the event
Client TimestamptsTIMESTAMP2023-01-01 00:01:02.333 (UTC)Timestamp based on UTC (cutting below Millis)
Metric Valuemetric_valueDECIMAL(24, 6)0.0Use for value computation in analysis and experiments. (Save '0.0' if not needed)
IdentifiersidentifiersMap<String, String>{ "id": "8fb8e088-9245-4fce-bb87-7e09d9917ed6", "device_id": "89ABCDEF-01234567-89ABCDEF", "user_id": "49591", "session_id": "1659710029.4.1.1659710504.0" }Map containing user identifiers
- (Optional) 'user_id': Login user identifier (value corresponding to userId when sending Hackle SDK)
- (Required) 'id': Device identifier (value corresponding to id at Hackle SDK transmission)
- (Required) 'device_id' device identifier (value corresponding to deviceId at Hackle SDK transmission)
- (Optional, loading when using GA) 'ga_session_id', 'ga_device_id'

Identifiers key values are stored in Lowercase.
Event Propertiesevent_propertiesMap<String, String>{ "product_id: "33537", "product_category": "LEISURE", "order_id": "291994100" }Properties that contain event information

Property key values are stored in Lowercase.
User Propertiesuser_propertiesMap<String, String>{ "grade": "GOLD", "date_signed": "2022-07-01", "date_recent": "2023-01-17" }Properties that contain user information

Property key values are stored in Lowercase.
Platform Propertiesplatform_propertiesMap<String, String>`# Android example
{
"osname":"Android",
"appversion": "6.9.0",

"language":"ko",
"osversion":"12",
"devicevendor":"samsung",
"versionname":"6.77.0-DEBUG",
"platform":"Mobile",
"devicemodel":"SM-S908N"
}`

`# iOS example
{
"osname":"iOS",
"appversion": "6.9.3",

"language":"ko-KR",
"osversion":"16.0.2",
"devicevendor":"Apple",
"versionname":"6.77.0",
"platform":"Mobile",
"devicemodel":"iPhone14,2"
}`
Properties that contain platform information

- - (필수) osname (Android, iOS)
- (Required) version

Property key values are stored in Lowercase.

Below is a summary of the data formats described in the table above.

root
 |-- ts: timestamp (nullable = false)
 |-- event_key: string (nullable = false)
 |-- identifiers: string (nullable = false)
 |-- insert_id: string (nullable = false)
 |-- metric_value: decimal(24,6) (nullable = false)
 |-- user_properties: map<string, string> (nullable = false)
 |-- event_properties: map<string, string> (nullable = false)
 |-- platform_properties: map<string, string> (nullable = false)

Processing for Data Import

Process data according to the aforementioned Parquet Format and stored daily in the Bucket.

  • When the data processing is complete, create a '_SUCCESS' (Signal) file of 0 Byte.
  • Data processing includes D-1 data. For example, if you run the data import on January 2nd, you can process the data on January 1st.

The following is an example of a saved partition and file.

# 2023-01-01 data
gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-01/_SUCCESS
gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-01/000000000000.parquet
gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-01/000000000001.parquet

# 2023-01-02 data
gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-02/_SUCCESS
gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-02/000000000000.parquet
gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-02/000000000001.parquet

How to Request Data Import

Please contact Hackle team for data import requests. Below information is required to import the data

  • Key authorized for access
  • AWS S3, GCS Bucket name and partition path where data in the Bucket will be loaded (e.g. 'gcs://customer-data-hackle/test/prefix-custom/dt=2023-01-01`)
  • Data loading time (e.g. loading completed before 13:00 KST)