<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Data – IMG.LY Blog</title><description>Posts tagged Data on the IMG.LY blog.</description><link>https://img.ly/blog/tag/data/</link><language>en-us</language><image><url>https://img.ly/apple-touch-icon.png</url><title>Data – IMG.LY Blog</title><link>https://img.ly/blog/tag/data/</link></image><atom:link href="https://img.ly/blog/tag/data/rss.xml" rel="self" type="application/rss+xml"/><generator>Astro</generator><lastBuildDate>Fri, 19 Jun 2026 11:26:07 GMT</lastBuildDate><ttl>60</ttl><item><title>How to Load Stripe Data into Google BigQuery</title><link>https://img.ly/blog/how-to-load-stripe-data-into-google-bigquery/</link><guid isPermaLink="true">https://img.ly/blog/how-to-load-stripe-data-into-google-bigquery/</guid><description>Discover how IMG.LY leverages Stripe&apos;s Data Pipeline to seamlessly transfer data into Google BigQuery using Google Cloud Functions.</description><pubDate>Thu, 18 Jul 2024 10:04:26 GMT</pubDate><content:encoded>&lt;p&gt;At IMG.LY, we recognize that leveraging data is essential for driving innovation and growth. To optimize our data for reporting, we consolidate multiple data sources, including Stripe billing and financial data, into Google BigQuery.&lt;/p&gt;
&lt;p&gt;IMG.LY is the leading provider of creative editing SDKs for &lt;a href=&quot;https://img.ly/products/video-sdk/?utm_source=imgly&amp;#x26;utm_medium=blog&amp;#x26;utm_campaign=stripebigquery&quot;&gt;video&lt;/a&gt;, &lt;a href=&quot;https://img.ly/products/photo-sdk/?utm_source=imgly&amp;#x26;utm_medium=blog&amp;#x26;utm_campaign=stripebigquery&quot;&gt;photo&lt;/a&gt;, and &lt;a href=&quot;https://img.ly/products/creative-sdk/?utm_source=imgly&amp;#x26;utm_medium=blog&amp;#x26;utm_campaign=stripebigquery&quot;&gt;design templates&lt;/a&gt;. While this article may not directly relate to media creation, we believe in empowering developers through knowledge sharing. Let’s dive in.&lt;/p&gt;
&lt;p&gt;Until now, we’ve relied on &lt;a href=&quot;https://www.fivetran.com&quot;&gt;Fivetran&lt;/a&gt; to fetch our data from Stripe and store it in Google BigQuery. Fivetran uses Stripe’s API, calling each endpoint, iterating over all resources, and storing the results in BigQuery (or any other supported data warehouse). While this generally works well, issues can arise. For instance, we sometimes create Stripe Subscriptions using inline pricing with &lt;a href=&quot;https://docs.stripe.com/api/subscription_items/create#create_subscription_item-price_data&quot;&gt;the &lt;code&gt;price_data&lt;/code&gt; parameter&lt;/a&gt;. This generates a new &lt;code&gt;Price&lt;/code&gt; object in Stripe on-the-fly and immediately sets it to &lt;code&gt;active: false&lt;/code&gt;. Consequently, the &lt;code&gt;Price&lt;/code&gt; object is not returned by Stripe API’s price endpoint, leading to missing data in our warehouse. Although Fivetran’s support was exceptional in resolving this issue within a day, it highlighted a potential flaw in relying solely on ETL services for data extraction.&lt;/p&gt;
&lt;p&gt;Recently, Stripe introduced &lt;a href=&quot;https://stripe.com/data-pipeline&quot;&gt;Data Pipeline&lt;/a&gt;, its own service for transferring Stripe data into a data warehouse. This ensures complete, reliable data without needing a third-party service to read Stripe’s API. Additionally, you can receive test environment data and access several tables not available via the API. For a comprehensive summary of the available data, &lt;a href=&quot;https://dashboard.stripe.com/stripe-schema&quot;&gt;refer to Stripe’s official data schema&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Currently, Stripe supports only Snowflake and Amazon Redshift as data warehouses. However, they’ve recently added the option to &lt;a href=&quot;https://docs.stripe.com/stripe-data/access-data-in-warehouse/cloud-storage/google-cloud-storage&quot;&gt;deliver data as Parquet files into Google Cloud Storage (GCS)&lt;/a&gt;. The next step for us was to import this data into Google BigQuery.&lt;/p&gt;
&lt;h2 id=&quot;setting-up-stripe-data-pipeline-with-google-cloud-storage&quot;&gt;Setting Up Stripe Data Pipeline with Google Cloud Storage&lt;/h2&gt;
&lt;p&gt;Stripe is renowned for its excellent developer experience, and this beta feature is no exception. Enabling it within the Stripe Dashboard is quick, and the &lt;a href=&quot;https://docs.stripe.com/stripe-data/access-data-in-warehouse/cloud-storage/google-cloud-storage&quot;&gt;documentation&lt;/a&gt; is straightforward. After following the instructions and enabling the feature, it takes a while for data to appear in GCS. Once available, a complete data dump is provided every 6 hours, structured as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At the root level, Stripe creates a folder representing the date and time of the latest transfer, e.g., &lt;code&gt;2024071600&lt;/code&gt; (&lt;code&gt;YYYYMMDDHH&lt;/code&gt;), representing the 12 am push on July 16, 2024.&lt;/li&gt;
&lt;li&gt;One level deeper, there are two folders: &lt;code&gt;livemode&lt;/code&gt; and &lt;code&gt;testmode&lt;/code&gt;, representing live and test data, respectively.&lt;/li&gt;
&lt;li&gt;Each folder contains one folder per data table, e.g., &lt;code&gt;subscriptions&lt;/code&gt; or &lt;code&gt;invoices&lt;/code&gt;. Additionally, a &lt;code&gt;coreapi_SUCCESS&lt;/code&gt; file indicates successful data transfer to your GCS bucket and readiness for consumption.&lt;/li&gt;
&lt;li&gt;Within the table folders are several Parquet files containing the actual data for each table.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;loading-the-data-from-google-cloud-storage-into-google-bigquery&quot;&gt;Loading the Data from Google Cloud Storage into Google BigQuery&lt;/h2&gt;
&lt;p&gt;There are multiple ways to transfer data from GCS to BigQuery. We opted for the following approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Using Google Cloud Scheduler to publish a message to Google Pub/Sub every 6 hours at 1 am, 7 am, 1 pm, and 7 pm.&lt;/li&gt;
&lt;li&gt;Creating a Google Cloud Function that listens for new messages on the above Pub/Sub topic. When a message is received, it triggers a Node.js script that loads the most recent data from GCS into BigQuery and deletes it from GCS.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let’s delve into the details.&lt;/p&gt;
&lt;h3 id=&quot;create-a-google-cloud-scheduler-job&quot;&gt;Create a Google Cloud Scheduler Job&lt;/h3&gt;
&lt;p&gt;First, create a new Cloud Scheduler job &lt;a href=&quot;https://console.cloud.google.com/cloudscheduler/jobs/new&quot;&gt;here&lt;/a&gt; with the following configuration:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Choose a name for this job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Region&lt;/strong&gt;: The region is not crucial for this task; we used &lt;code&gt;europe-west3&lt;/code&gt; since most of our services are in Germany.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Frequency&lt;/strong&gt;: We want the job to run every 6 hours at 1 am, 7 am, 1 pm, and 7 pm. Stripe publishes data every 6 hours, but it takes time to transfer it to GCS. We chose 1 hour later than Stripe’s push time, so our value is &lt;code&gt;0 1,7,13,19 * * *&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timezone&lt;/strong&gt;: Choose ‘Coordinated Universal Time (UTC)’.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Target type&lt;/strong&gt;: Choose ‘Pub/Sub’.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Select a Cloud Pub/Sub topic&lt;/strong&gt;: Select or create a new Pub/Sub topic using the default configuration. This is used to trigger the Cloud Function.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Message body&lt;/strong&gt;: For this task, we don’t look at the contents of the message, as such the content of this value doesn’t matter. We opted for a simple &lt;code&gt;load&lt;/code&gt; string.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt=&quot;&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot; sizes=&quot;(min-width: 566px) 566px, 100vw&quot; data-astro-image=&quot;constrained&quot; data-astro-image-pos=&quot;center&quot; width=&quot;566&quot; height=&quot;1240&quot; src=&quot;https://img.ly/_astro/Screenshot-2024-07-16-at-11.46.22_1dcdSQ.webp&quot; srcset=&quot;/_astro/Screenshot-2024-07-16-at-11.46.22_1dcdSQ.webp 566w&quot;&gt;&lt;/p&gt;
&lt;p&gt;Finally, click ‘Create’ to set up the scheduler. Now, a message is published to the selected Pub/Sub topic every 6 hours. Next, we need to respond to this message.&lt;/p&gt;
&lt;h3 id=&quot;create-a-google-cloud-function&quot;&gt;Create a Google Cloud Function&lt;/h3&gt;
&lt;p&gt;Create a Google Cloud Function triggered by Pub/Sub &lt;a href=&quot;https://console.cloud.google.com/functions/add&quot;&gt;here&lt;/a&gt; with the following configuration:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Environment&lt;/strong&gt;: Choose ‘2nd gen’.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Function name&lt;/strong&gt;: Choose a name for this Cloud function.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Region&lt;/strong&gt;: Select the region for the function, typically europe-west3 for our services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trigger type&lt;/strong&gt;: Choose ‘Cloud Pub/Sub’.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud Pub/Sub topic&lt;/strong&gt;: Select the Pub/Sub topic created in the previous step.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Adjust the ‘Runtime, build, connections and security settings’ based on your Cloud setup and the required processing power for Stripe data. Generally, the following settings work well:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Memory allocated&lt;/strong&gt;: ‘512 MiB’&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: ‘1’&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timeout&lt;/strong&gt;: ‘540’&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Minimum number of instances&lt;/strong&gt;: ‘0’ (to ensure the function shuts down when not in use)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Maximum number of instances&lt;/strong&gt;: ‘1’&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service account&lt;/strong&gt;: Use or create a service account with permissions to access the GCS bucket where Stripe data is stored and the BigQuery datasets to load the data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ingress settings&lt;/strong&gt;: Choose ‘Allow internal traffic only’.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt=&quot;&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot; sizes=&quot;(min-width: 554px) 554px, 100vw&quot; data-astro-image=&quot;constrained&quot; data-astro-image-pos=&quot;center&quot; width=&quot;554&quot; height=&quot;1286&quot; src=&quot;https://img.ly/_astro/Screenshot-2024-07-16-at-11.56.55_1Iq5S7.webp&quot; srcset=&quot;/_astro/Screenshot-2024-07-16-at-11.56.55_1Iq5S7.webp 554w&quot;&gt;&lt;/p&gt;
&lt;p&gt;Click ‘Next’ to provide the function’s code. Select:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Runtime&lt;/strong&gt;: ‘Node.js 20’&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Source code&lt;/strong&gt;: ‘Inline Editor’&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Entry point&lt;/strong&gt;: &lt;code&gt;loadStripeData&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the &lt;code&gt;package.json&lt;/code&gt;, add the BigQuery and Cloud Storage Node.js packages:&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;index.js&lt;/code&gt;, add the following code:&lt;/p&gt;

&lt;p&gt;This script does the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;For each environment (live and test), it searches for the latest folder containing a &lt;code&gt;coreapi_SUCCESS&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;For each table, it groups all related Parquet files and loads them into the BigQuery table using &lt;code&gt;WRITE_TRUNCATE&lt;/code&gt;, which overwrites existing data. Note that the location is specified as &lt;code&gt;EU&lt;/code&gt;, matching our BigQuery dataset and GCS bucket location. Adjust this parameter if your data is elsewhere.&lt;/li&gt;
&lt;li&gt;If all files for an environment are loaded without errors, the files are deleted from GCS. This step is optional; if you prefer to keep a backup, you can omit this part.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Click ‘Deploy’ to deploy your Cloud function.&lt;/p&gt;
&lt;h3 id=&quot;create-bigquery-datasets&quot;&gt;Create BigQuery Datasets&lt;/h3&gt;
&lt;p&gt;The final step is to create two datasets in Google BigQuery. Open &lt;a href=&quot;https://console.cloud.google.com/bigquery&quot;&gt;Google BigQuery&lt;/a&gt;, click on the three dots next to your project’s name, and select ‘Create dataset’. Enter a name and choose a location matching your GCS bucket’s location. Repeat this process for the test dataset.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Following these steps will ensure your Stripe data is imported into BigQuery and automatically updated every 6 hours. However, as Data Pipeline for GCS is still in beta, there are some limitations. For example, the schema of the Parquet files lacks type annotations for timestamps, so all timestamps in BigQuery are represented as &lt;code&gt;INTEGER&lt;/code&gt; instead of &lt;code&gt;TIMESTAMP&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Additionally, some tables, such as &lt;code&gt;subscription_item_change_events&lt;/code&gt;, are not currently transferred when syncing with Google Cloud Storage, although this issue is expected to be resolved soon. Meanwhile, we continue to use Fivetran in conjunction with the above method to sync Stripe data to Google BigQuery and plan to fully migrate to Data Pipeline once it exits the beta phase.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thank you for reading!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3,000+ creative professionals gain exclusive access and hear of our releases first—&lt;a href=&quot;https://share.hsforms.com/1IgAOV1wASXGPnFG4ZPLejg1hk3i&quot;&gt;subscribe&lt;/a&gt; to our newsletter and never miss out.&lt;/strong&gt;&lt;/p&gt;</content:encoded><dc:creator>Sascha</dc:creator><media:content url="https://blog.img.ly/2024/07/stripe-bigquery-how-to.jpg" medium="image"/><category>How-To</category><category>Business Intelligence</category><category>Data</category><category>Cloud</category><category>Insights</category></item></channel></rss>