Analyze Big Data & Build Applications with Google BigQuery

Hi, I'm Michael Manoochehri

  • Google Developer Relations Engineer: BigQuery
  • I <3 Big Data
    • All the social networks!
    • Google+, Twitter (@nTangledMichael), etc

In this talk...

  • What is Google BigQuery?
  • Where Did BigQuery Come From?
  • How does BigQuery Work?
  • More Complex Queries on Big Dataset
  • Summary and Q&A

Data @ Google Scale

  • 1998: Google 'founded' - index contained 26 million pages
  • 2000: Google's index reached the one billion mark
  • 2004: Google's index contained 4.28 billion web pages and 880 million images
  • 2008: Index reached 1 trillion unique URLs

Original "Google Storage"

Current Google Data Center

"Small Data" fits on one machine...

...Relational Database might work

Scale Becomes a Problem

When Data Sizes are Small

A traditional, relational backend on a web application stack might work...

  • MySQL for data collection, transformation, and analysis
  • Gigabytes and Terabyte scale?
  • Collect Large Amounts of Data?
  • Ask Questions About your Data?

Dealing with Data

  • Datasets are growing with input from mobile application, social and other distributed media
  • There's growing pressure on app developers to deal with large data challenges
  • As data becomes large, the capacity for automated processing and analysis becomes challenging

What is BigQuery?

BigQuery is designed to excel at 2 things...

  • Scale: Billions of rows! Terabytes of data!
  • Query speed: Seconds instead of minutes... or hours...


Life of a BigQuery

What can you do with BigQuery

  • Create dashboards and reporting tools
  • Find patterns in large datasets
  • Create new analytics products
  • Business analysis tools
  • Lots of things we haven't thought of yet

QlikView US Natality Data viewer

Map Reduce Word Count


Map Reduce Word Count: Map Phase


Map Reduce Word Count: Shuffle Phase


Map Reduce Word Count: Reduce Phase


BigQuery Pricing

  • Charge for amount of data processed
    • $0.035 per GB processed
  • Charge for data stored in BQ as well
    • $0.12 per GB/month
  • 100 Gb free query quota

Massive Datasets: Right Technology for the Task

NoSQL DatabaseHigh Availability and Performance
Analysis toolsOptimized for aggregate queries
Ubiquitous StorageArchiving, availability, staging data intermediates

Build end-to-end Data Pipelines w/ Google's Cloud

App Engine DataStoreWeb Scale collection of user data streams: a non-relational datastore
Google Cloud StoragePermanent archive of raw CSV data: cloud-based storage
Google BigQueryAnalysis of very large datasets

App Engine MapReduce

  • Open Source MapReduce Library
  • Built on top of App Engine Task Queues, BlobStore and the Pipelines API
  • Allows you to concentrate on data transformation, not implementation details
  • Provides input/output tools for many "sources and sinks"

App Engine MapReduce

class MyPipeline(base_handler.PipelineBase):
  def run(self, parameter):
    output = yield mapreduce_pipeline.MapreducePipeline(
        "main.map_function",                            # A Mapper Function
        "main.reduce_function",                         # A Reduce Function
        "mapreduce.input_readers.DataStoreInputReader", # Data Source
        "mapreduce.output_writers.FileOutputWriter",    # Data Sink
        mapper_params={},                               # Custom Parameters for Mapper
        reducer_params={},                              # Custom Parameters for Reducer
        shards=16)                                      # Workers per Job 
    yield AnotherPipeline(output)


<Thank You!>

Michael Manoochehri.