More about incrementals

It’s not all rainbows and butterflies when it comes to building incremental models. Imagine this: table_a has upstreams table_b, table_c, and table_d, all joined together. For any given row in the output table, how do you check if a record is considered “more recent” than the existing record, and needs to be updated?

Prerequisite: When did this table last refreshed?

Every table needs to have a commit_time column where it specifies the most recent insert/update datetime of this record.

Back to the table A,B,C,D example. We proceed to build table_a_tmp using table_b, table_c, and table_d. Since we have the commit_time of each row from each upstream table, the output of GREATEST(table_b.commit_time, table_c.commit_time, table_d.commit_time) will return the “most recent update” datetime of this particular row. If the output is more recent than the previous update datetime (checkpoint) of table_a, then it will be updated/inserted into table_a.

Of course I’m oversimplying the problem here. There’s still alot of possibility such as aggregates, window functions, deletions, etc. Maybe I’ll share the guide I wrote here sometime in the future. It still needs more incremental improvements 😆.

Ok I’m really sick of writing about incremental. Let’s stop here.

Culture of writing

Coming from a team with zero documentation, I can feel the power of documenting our works. I can’t emphasize how much has the docs written by my fellow SB-ers enabled me to work independently without disturbing others. I can simply search and read most of the info I need.

Not to mention, my boss said “documenting your tasks is part of actual work”. Give him a like.

Guess I’ll stop here today. Signing off.