Pages
At GSN Games, we’re obsessed with two things: measuring data and moving fast. We try to take new ideas from conception to production in 2 days. Traditional data warehousing solutions — you know, the ones that take 6 months to implement — would never work for us.
So here’s how we use Vertica and SQL to collect data and still move quickly.
We store all of our data in a single table, named events
. The first four columns look perfectly normal:
user_id
event_type_id
event_time
game_id
The next 60 columns look a little odd:
attr1
attr2
attr3
attr60
“Wait, that’s the kind of database that Dilbert’s pointy-haired boss would create! YOU’RE DOING IT WRONG!”
Please hold all of your nerdrage until the end of this article.
“Ok, but I’m extremely skeptical right now.”
As was I when I initially saw this design. You'll come around.
The first 10 columns are for integer
s, the next 10 for float
s, the next 10 for boolean
s, then next 10 for timestamp
s, and the last 20 are for string
s.
event_type_id
is a “discriminator column”: it tells us what kind of data is in each column in that row.
We have lots of different events that we record, and frequently add new ones. We keep track of what data is in what column with a big Google Spreadsheet that looks like this:
So for the PAYMENT_RECEIVED
event, attr1
stores the amount of money the user paid us, but for the TOKENS_ADJUSTED
event, attr1
stores the number of tokens won or lost in a game.
Note in the above screenshot how sparse the columns are. Most cells are unused (and therefore NULL). This would have terrible performance implications in a traditional, row-oriented database. But a column-oriented has no problem with sparse columns.
We use this Google doc to easily generate SQL views for each event type. So to the casual observer, it looks like a normal RDBMS, with one table for each event, but under the hood it's one table to rule them all, forged by Sauron.
“What's S-A-U-R-O-N?”
Tough crowd. Moving on...
This schema allows our game developers and platform engineers to launch new features without involving the BI team. Our app servers write to a flat .csv
file. So if a developer comes up with a new feature and wants to begin logging metrics, this is what they do:
The log files rollover every 10MB, at which point they are uploaded to S3 for ingestion by the data warehouse.
Every 15 minutes, the data warehouse downloads all the log files from S3, and uses the Vertica copy
command to write them to the events
table.
A typical load will add 5,000,000 rows to the table, which usually takes about 45 seconds to complete.
Et voilà! An engineer has successfully started logging new data into Vertica, and I didn't have to do anything.
“Wow, you’re lazy.”
Correct! Moving on...
We're heavy users of SQL for ad-hoc analysis. Standardized reports are great, but you can't possibly anticipate everyone's questions in advance.
Even though the events
table currently has close to 100 billion records, query operations are still blazingly fast. Here's a recent example:
A developer (let's call him "Shmeric") recently added a sweepstakes to our Facebook app Games by GSN. Every 24 hours, we give away a ton of tokens, and users can fill out a sweepstakes card to gain entry. He created a new event_type_id
of 137 and decided to store the timestamp of the upcoming drawing in the attr31
column.
He then asked: "do users submit more entries as the nightly drawing gets closer?". He expected that the number of entries per hour would be low when there were 24 hours until the next drawing, then steadily increase as the drawing got closer and closer. Here's the query:
select
hr as "Hours Until Drawing",
1.0*cumulative/first_value(cumulative) over(order by hr desc) as "Cumulative Percent",
1.0*entered/first_value(cumulative) over(order by hr desc) as "Hourly Percent"
from (
select
datediff('hour',attr31,event_time) as hr,
count(distinct fb_user_id) as entered,
sum(count(distinct fb_user_id)) over (order by datediff('hour',attr31,event_time) asc) as cumulative
from newapi.events
where event_type_id=137
and datediff('hour',attr31,event_time) between -24 and -1
and event_time between '2012-10-10' and date_trunc('day',sysdate)
group by 1
) z
order by 1 asc
“You know that query could be optimized if you—”
I'm sure it could, but it ran in 300ms, so I don't particularly care.
Time: First fetch (24 rows): 314.714 ms. All rows formatted: 314.791 ms
Here's the output using the R plot
function:
Interestingly, it turned out that the relationship was actually the opposite of what was predicted: the number of entries goes down as the drawing approaches (bottom left cell of the plot). Our team is currently at work tweaking the messaging to make it more clear to users who the sweepstakes works.
So in the above example, a developer built a feature, launched it, and measured it without requiring any new tables or load scripts.
Although the single table approach is unorthodox (and smells funny when you first encounter it), it has a number of advantages.
To be clear: this would never work in a traditional RDBMS. (What columns would you index? All of them?) But with Vertica, a query against tens of billions of records can return in under a second. Columnar databases rule.
Portman Wills leads the BI & Analytics team at GSN Digital.
He's always ready to talk about Vertica, data science, or last night's episode of The League. Drop him a line at pwills@gsn.com, connect on LinkedIn, or check out all of the current positions open at GSN.