Guide: Set up GCP
This guide helps you set up your Google Cloud Platform (GCP) project on DataStori.
DataStori integrates with your GCP project using IAM Service Accounts and uses Google Cloud Run Jobs to run pipeline code.
Information and Resources Checklist
Before you begin, please keep at hand the following information and resources from your GCP account.
Networking
- VPC Network Name: The VPC where you want to run your code.
- Subnet Name: The subnet where you want to run your code.
- VPC Firewall Rules: Ensure that the VPC firewall rules allow necessary egress. Access control is through network tags.
Services
- Cloud Run: DataStori spins up Cloud Run Jobs to run pipelines and write the output to Google Cloud Storage. Please ensure that the Cloud Run API is enabled in your project.
- Google Cloud Storage (GCS): Data is stored here. Please be ready with the Bucket Name where you want the data to be stored.
- RDBMS (optional): Connection details of any relational database where pipeline output will be written in addition to Google Cloud Storage.
IAM / Service Accounts
We will create two service accounts - one for the Cloud Run Job to use (workload identity) and the second for DataStori to impersonate (to manage the jobs).
Step 1: Create a Service Account for the Job
This service account is used by the Cloud Run Job to access GCS.
- Navigate to IAM & Admin > Service Accounts > Create Service Account.
- Name it
datastori-job-runner-sa. - Grant this service account the Storage Object Admin (
roles/storage.objectAdmin) role so that it can write to your GCS bucket. - Click on Done and write down the email ID of this service account.
Step 2: Create a Service Account for DataStori
DataStori's infrastructure will authenticate as this service account.
- Create a service account named
datastori-integration-safrom IAM & Admin > Service Accounts > Create Service Account. - Do not grant this service account any roles directly in this step.
- Click on Done and write down the email ID of this service account.
Step 3: Create a Custom Role
This role contains the specific permissions DataStori needs to manage Cloud Run Jobs and use the job runner service account.
- Navigate to IAM & Admin > Roles > Create Role.
- Give it a title, for example "DataStori Job Manager".
- Add the following permissions:
* `run.jobs.run`
* `run.jobs.get`
* `run.jobs.list`
* `run.executions.get`
* `run.executions.list`
* `run.executions.delete`
* `iam.serviceAccounts.actAs` (Allows passing the runner SA to the job) - Click on Create.
Step 4: Bind Permissions
- At the Project Level: Go to IAM & Admin > IAM. Grant
datastori-integration-sathe custom "DataStori Job Manager" role you just created. - In the Job Runner Service Account: Go to the
datastori-job-runner-sayou created in Step 1. Go to the Permissions tab. Grant thedatastori-integration-sathe Service Account User (roles/iam.serviceAccountUser) role. This allows the integration account to "act as" the runner account. - Allow DataStori to Impersonate: Go to
datastori-integration-sa. In the Permissions tab, grant the DataStori main service account principal the Service Account Token Creator (roles/iam.serviceAccountTokenCreator) role. Write to ishan@datastori.io for DataStori's principal service account email ID.
Logging (Optional)
By default, DataStori writes pipeline logs to Google Cloud Logging. If you want to customize the logging destination, please share the details of your log sink configuration with ishan@datastori.io.
Summary
Please be ready with the following information to complete the GCP infrastructure setup.
- GCP Project ID
- VPC Network Name
- Subnet Name
- GCS Bucket Name
- GCS Bucket Region
- Email ID of the
datastori-integration-sa(Service Account for DataStori) - Email ID of the
datastori-job-runner-sa(Service Account for the job) - Cloud Logging sink details (optional)