Portman Wills

Pages

View My GitHub Profile

How we use Vertica at GSN

At GSN Games, we’re obsessed with two things: measuring data and moving fast. We try to take new ideas from conception to production in 2 days. Traditional data warehousing solutions — you know, the ones that take 6 months to implement — would never work for us.

So here’s how we use Vertica and SQL to collect data and still move quickly.

A very simple schema

We store all of our data in a single table, named events. The first four columns look perfectly normal:

The next 60 columns look a little odd:

“Wait, that’s the kind of database that Dilbert’s pointy-haired boss would create! YOU’RE DOING IT WRONG!”

Please hold all of your nerdrage until the end of this article.

“Ok, but I’m extremely skeptical right now.”

As was I when I initially saw this design. You'll come around.

The first 10 columns are for integers, the next 10 for floats, the next 10 for booleans, then next 10 for timestamps, and the last 20 are for strings.

event_type_id is a “discriminator column”: it tells us what kind of data is in each column in that row.

We have lots of different events that we record, and frequently add new ones. We keep track of what data is in what column with a big Google Spreadsheet that looks like this:

A screenshot of one section of our master column definition Google Spreadsheet

So for the PAYMENT_RECEIVED event, attr1 stores the amount of money the user paid us, but for the TOKENS_ADJUSTED event, attr1 stores the number of tokens won or lost in a game.

Note in the above screenshot how sparse the columns are. Most cells are unused (and therefore NULL). This would have terrible performance implications in a traditional, row-oriented database. But a column-oriented has no problem with sparse columns.

We use this Google doc to easily generate SQL views for each event type. So to the casual observer, it looks like a normal RDBMS, with one table for each event, but under the hood it's one table to rule them all, forged by Sauron.

“What's S-A-U-R-O-N?”

Tough crowd. Moving on...

Getting data in...

This schema allows our game developers and platform engineers to launch new features without involving the BI team. Our app servers write to a flat .csv file. So if a developer comes up with a new feature and wants to begin logging metrics, this is what they do:

  1. Create a new event type id
  2. Add a new row to the Google Doc
  3. Add some code to write lines to the standard log file

The log files rollover every 10MB, at which point they are uploaded to S3 for ingestion by the data warehouse.

Every 15 minutes, the data warehouse downloads all the log files from S3, and uses the Vertica copy command to write them to the events table. A typical load will add 5,000,000 rows to the table, which usually takes about 45 seconds to complete.

Et voilà! An engineer has successfully started logging new data into Vertica, and I didn't have to do anything.

“Wow, you’re lazy.”

Correct! Moving on...

Getting information out...

We're heavy users of SQL for ad-hoc analysis. Standardized reports are great, but you can't possibly anticipate everyone's questions in advance.

Even though the events table currently has close to 100 billion records, query operations are still blazingly fast. Here's a recent example:

A developer (let's call him "Shmeric") recently added a sweepstakes to our Facebook app Games by GSN. Every 24 hours, we give away a ton of tokens, and users can fill out a sweepstakes card to gain entry. He created a new event_type_id of 137 and decided to store the timestamp of the upcoming drawing in the attr31 column.

He then asked: "do users submit more entries as the nightly drawing gets closer?". He expected that the number of entries per hour would be low when there were 24 hours until the next drawing, then steadily increase as the drawing got closer and closer. Here's the query:


select 
  hr as "Hours Until Drawing",
  1.0*cumulative/first_value(cumulative) over(order by hr desc) as "Cumulative Percent",
  1.0*entered/first_value(cumulative) over(order by hr desc) as "Hourly Percent" 
from (
  select 
    datediff('hour',attr31,event_time) as hr,
    count(distinct fb_user_id) as entered,
    sum(count(distinct fb_user_id)) over (order by datediff('hour',attr31,event_time) asc) as cumulative
  from newapi.events 
  where event_type_id=137
  and datediff('hour',attr31,event_time) between -24 and -1
  and event_time between '2012-10-10' and date_trunc('day',sysdate)
  group by 1
) z
order by 1 asc
		

“You know that query could be optimized if you—”

I'm sure it could, but it ran in 300ms, so I don't particularly care.

Time: First fetch (24 rows): 314.714 ms. All rows formatted: 314.791 ms

Here's the output using the R plot function:

Interestingly, it turned out that the relationship was actually the opposite of what was predicted: the number of entries goes down as the drawing approaches (bottom left cell of the plot). Our team is currently at work tweaking the messaging to make it more clear to users who the sweepstakes works.

Wrapping up

So in the above example, a developer built a feature, launched it, and measured it without requiring any new tables or load scripts.

Although the single table approach is unorthodox (and smells funny when you first encounter it), it has a number of advantages.

To be clear: this would never work in a traditional RDBMS. (What columns would you index? All of them?) But with Vertica, a query against tens of billions of records can return in under a second. Columnar databases rule.

Portman Wills leads the BI & Analytics team at GSN Digital.
He's always ready to talk about Vertica, data science, or last night's episode of The League. Drop him a line at pwills@gsn.com, connect on LinkedIn, or check out all of the current positions open at GSN.