Why you should move your ETL stack to Modal

All posts

Back

Technology

April 18, 2024•7 minute read

Kenny Ning@kenny_ning

Data Engineer

ETL diagram

ETL (Extract, Transform, Load) is the process of moving data from point A to point B.

Most commonly, ETL means moving data from some source system (e.g. a production database, Slack API) into an analytical data warehouse (e.g. Snowflake) where the data is easier to combine and analyze. Most data teams use a vendor like Fivetran or an orchestration platform like Airflow to do this.

Modal is a great solution for ETL if you are primarily looking for:

Cost savings on large-scale data transfers. Modal’s usage-based pricing means you pay for how much compute you use and not how many rows you sync, making it a far more cost-effective option for moving large amounts of data
An easy and flexible platform for custom code. Orchestration platforms like Airflow are notoriously difficult to set up and are often overkill for 95% of data jobs that call for a simple cron-like scheduling pattern. Modal is the easiest way to get those kinds of custom ETL jobs running.

In this post, I’ll walk through two examples inspired by how we do our internal analytics that clearly show the cost and flexibility advantages of using Modal for ETL.

Example 1: Copy 12m ClickHouse rows to Snowflake at .01% of the cost of Fivetran

We use ClickHouse to serve metrics on resource usage and run time for our customers’ jobs. We’d like to move this data into Snowflake so that we can combine this data with other information we have on our customers and answer questions like “what is the conversion from a Modal workspace creation to using 1 hour of compute?”.

First, we extract from ClickHouse using their native Python connector:

def extract_from_clickhouse(date):
    import clickhouse_connect

    query = f"""
        select
            timestamp_minute,
            workspace_id,
            billing_type,
            cpu_ns / 3600e9 as cpu_hr,
            mem_ns / 3600e9 as mem_hr,
            gpu_ns / 3600e9 as gpu_hr
        from metrics
        prewhere toDate(timestamp_minute) == '{date}'
    """

    client = clickhouse_connect.get_client(
        host=os.environ["CLICKHOUSE_HOST"]
        port=os.environ["CLICHOUSE_PORT"]
        username="default",
        password=os.environ["CLICKHOUSE_PASSWORD"],
        secure=True,
    )
    result = client.query(query)
    print(f"Fetched clickhouse data for {date}")
    return result.result_rows

This returns the query results as a list of tuples, where each tuple is a row. Then we batch load the results into Snowflake:

def load_to_snowflake(data: list[tuple], date):
    target_table = 'USAGE_BY_MINUTE'
    batch_size = 10000
    insert_sql = f"""
    insert into CLICKHOUSE.{target_table} (timestamp_minute, workspace_id, billing_type, cpu_hr, mem_hr, gpu_hr, inserted_at)
    values (%s, %s, %s, %s, %s, %s, current_timestamp())
    """
    for i in range(0, len(data), batch_size):
        batch = data[i : i + batch_size]
        print(f"Loading batch {date}:{i}-{i+batch_size}")
        cursor.executemany(insert_sql, batch)
        conn.commit()

    # Close the cursor and connection
    cursor.close()
    conn.close()

    print(f"Data inserted successfully.")

Here we are using Snowflake’s executemany function, which batch inserts 10,000 rows at a time. We set the batch size to 10,000 because Snowflake’s insert statement has a limit of 16,384 rows in a single call.

Now, we add some Modal 🪄magic🪄:

@stub.function(
    secrets=[
        modal.Secret.from_name("snowflake-secret"),
        modal.Secret.from_name("clickhouse-prod")
    ],
    timeout=3000
)
def run_etl(date):
    results = extract_from_clickhouse(date)
    load_to_snowflake(results, date)

@stub.local_entrypoint()
def main():
    dates = [
        '2024-04-07',
        '2024-04-08',
        '2024-04-09',
        '2024-04-10',
        '2024-04-11'
    ]
    run_etl.for_each(dates)

We use @stub.function to execute run_etl in the cloud with the following parameters:

Database credentials as environment variables via Secrets
A timeout of 50 minutes (default is 5 minutes)

In main(), we kick off 5 run_etl jobs in parallel by date using for_each to greatly speed up processing time.

Here are the statistics of an example run:

ETL example metrics

This job copied 12m rows from Clickhouse to Snowflake in 16 minutes using:

5 CPUs: at $0.192 / CPU hour that comes out to $0.26
4.4 GiB of memory: at $0.024 / GiB per hour that comes out to $0.03

Even if Fivetran had a ClickHouse connector (it doesn’t at the time of this writing), syncing 12m rows would cost ~$3300. The total cost of this Modal job is $0.29 (0.01% of Fivetran).

You could argue that the Modal job costs more in developer time compared to an ETL vendor. In my opinion (and hopefully yours too after reading the code snippets!), this example was quite simple; I’d estimate an analytics engineer could write this in less than a day and spend at most a few hours a month maintaining it.

And this is where the real cost savings come in: by making your engineers more productive. In this next example, we’ll show how easy it is to write your own custom data jobs on Modal.

Example 2: Enrich user data with the Github API

Most of our customers first sign up using their username. However, we also want to know what company they work for so we can see if they would be interested in our Team or Enterprise tier. One way to get that information is from a user’s Github profile:

ETL Github screenshot

The only thing we know about the ComfyUI creator...

First, we extract some user ids and associated Github usernames from our data warehouse:

def get_usernames():
    import snowflake.connector
    conn = snowflake.connector.connect(
        user="snowflake_user",
        password=os.environ["SNOWFLAKE_PASSWORD"],
        account=os.environ["SNOWFLAKE_ACCOUNT"],
    )
    cursor = conn.cursor()
    q = """
    select
        id,
        github_username

    from user

    where github_username is not null
    """
    cursor.execute(q)
    df = cursor.fetch_pandas_all()
    print(f"Got {df.shape[0]} rows.")
    cursor.close()
    conn.close()
    return df

Then, we write a function to query the Github API for a user’s company:

def get_company(user):
    from github import Auth, Github, GithubException

    auth = Auth.Token(os.environ['PAT'])

    g = Github(auth=auth)

    try:
        user = g.get_user(user)
    except GithubException:
        print(f"Request for {user} failed, skipping.")
        return None

    return user.company

Finally, we apply that function on our user data to get an enriched dataset with a user’s Github-listed company. To query the Github API, first create a personal access token and add it to Modal as a Secret:

def get_user_companies(df):
    print("Querying Github API...")
    df['company'] = df['GITHUB_USERNAME'].apply(get_company)
    return df

@stub.function(
    secrets=[
        modal.Secret.from_name("kenny-github-secret"),
        modal.Secret.from_name("snowflake-secret")
    ],
)
def main():
    users_df = get_usernames()
    enriched_df = get_user_companies(users_df)
    print(enriched_df.head())

Running this script gives us:

Got 100 rows.
Querying Github API...
Request for xxxx failed, skipping.
       ID GITHUB_USERNAME          company
0  us-abc        xxxxxxxx  Duke University
1  us-def      xxxxxxxxxx             None
2  us-ghi         xxxxxxx             None
3  us-jkl    xxxxxxxxxxxx             None
4  us-mno       xxxxxxxxx             None

Looks like we need to schedule some college tours, starting with Duke 🔵😈

Let’s say you want to schedule this to run every day. This is as simple as attaching a Period or Cron argument into @stub.function:

@stub.function(
    secrets=[
        modal.Secret.from_name("kenny-github-secret"),
        modal.Secret.from_name("snowflake-secret")
    ],
    # run this cloud function every day at 6am UTC
    schedule=modal.Cron("0 6 * * *")
)

The ETL vendors want you to be afraid of writing custom code, but hopefully this example shows you how easy it is to add your own custom logic to make simple, yet powerful data enrichments.

Traditional ETL solutions are still quite powerful when it comes to:

Common connectors with small-medium data volumes: we still have a lot of respect for companies like Fivetran, who have really nailed the user experience for the most common ETL use cases, like syncing Zendesk tickets or a production Postgres read replica into Snowflake. The only criticism we have is the pricing model, especially for larger data volumes.
Long-running, business-critical, multi-stage pipelines: this is where you will get the value from an orchestration platform like Airflow e.g. function caching, partial retries, granular observability metrics. For what it’s worth, Modal is also actively thinking about how to address some of these use cases better.

The data community is going through a sea change, where people are realizing that writing custom code is actually an asset and not a cost. It reduces your risk of vendor lock-in, expands your universe of data solutions, and is orders of magnitude cheaper. Powered by Modal, your ETL process can finally unlock the flexibility, speed, and cost savings necessary in the new modern data era.

More examples

Check out these other Modal examples for common data and analytics use cases:

Deploying a dbt project that transforms S3 data using the duckdb adapter
Hosting Streamlit apps
Sending daily reports to Google Sheets

Example 1: Copy 12m ClickHouse rows to Snowflake at .01% of the cost of Fivetran

Example 2: Enrich user data with the Github API

Conclusion (when to not use Modal for ETL)

More examples

Ship your first app in minutes