Title: | Google 'BigQuery' Support for 'sparklyr' |
---|---|
Description: | A 'sparklyr' extension package providing an integration with Google 'BigQuery'. It supports direct import/export where records are directly streamed from/to 'BigQuery'. In addition, data may be imported/exported via intermediate data extracts on Google 'Cloud Storage'. |
Authors: | Mirai Solutions GmbH [aut], Martin Studer [cre], Nicola Lambiase [ctb], Omer Demirel [ctb] |
Maintainer: | Martin Studer <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 0.1.1 |
Built: | 2024-11-03 03:55:31 UTC |
Source: | https://github.com/miraisolutions/sparkbq |
Sets default values for several Google BigQuery related settings.
bigquery_defaults(billingProjectId, gcsBucket, datasetLocation = "US", serviceAccountKeyFile = NULL, type = "direct")
bigquery_defaults(billingProjectId, gcsBucket, datasetLocation = "US", serviceAccountKeyFile = NULL, type = "direct")
billingProjectId |
Default Google Cloud Platform project ID for billing purposes. This is the project on whose behalf to perform BigQuery operations. |
||||||||||||||||||||||
gcsBucket |
Google Cloud Storage (GCS) bucket to use for storing temporary files.
Temporary files are used when importing through BigQuery load jobs and exporting through
BigQuery extraction jobs (i.e. when using data extracts such as Parquet, Avro, ORC, ...).
The service account specified in |
||||||||||||||||||||||
datasetLocation |
Geographic location where newly created datasets should reside. "EU" or "US". Defaults to "US". |
||||||||||||||||||||||
serviceAccountKeyFile |
Google Cloud service account key file to use for authentication with Google Cloud services. The use of service accounts is highly recommended. Specifically, the service account will be used to interact with BigQuery and Google Cloud Storage (GCS). If not specified, Google application default credentials (ADC) will be used, which is the default. |
||||||||||||||||||||||
type |
Default BigQuery import/export type to use. Options include "direct",
"parquet", "avro", "orc", "json" and "csv". Defaults to "direct".
Please note that only "direct" and "avro" are supported for both importing and exporting. See the table below for supported type and import/export combinations.
|
A list
of set options with previous values.
https://github.com/miraisolutions/spark-bigquery https://cloud.google.com/bigquery/pricing https://cloud.google.com/bigquery/docs/dataset-locations https://cloud.google.com/bigquery/docs/authentication/service-account-file https://cloud.google.com/docs/authentication/ https://cloud.google.com/bigquery/docs/authentication/ https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
spark_read_bigquery
, spark_write_bigquery
,
default_billing_project_id
, default_gcs_bucket
,
default_dataset_location
Returns the default BigQuery import/export type. It defaults to "direct".
default_bigquery_type()
default_bigquery_type()
Returns the default Google BigQuery billing project ID.
default_billing_project_id()
default_billing_project_id()
Returns the default Google BigQuery dataset location. It defaults to "US".
default_dataset_location()
default_dataset_location()
https://cloud.google.com/bigquery/docs/dataset-locations
Returns the default Google BigQuery GCS bucket.
default_gcs_bucket()
default_gcs_bucket()
Returns the default service account key file to use.
default_service_account_key_file()
default_service_account_key_file()
https://cloud.google.com/bigquery/docs/authentication/service-account-file https://cloud.google.com/docs/authentication/ https://cloud.google.com/bigquery/docs/authentication/
This function reads data stored in a Google BigQuery table.
spark_read_bigquery(sc, name, billingProjectId = default_billing_project_id(), projectId = billingProjectId, datasetId = NULL, tableId = NULL, sqlQuery = NULL, type = default_bigquery_type(), gcsBucket = default_gcs_bucket(), serviceAccountKeyFile = default_service_account_key_file(), additionalParameters = NULL, memory = FALSE, ...)
spark_read_bigquery(sc, name, billingProjectId = default_billing_project_id(), projectId = billingProjectId, datasetId = NULL, tableId = NULL, sqlQuery = NULL, type = default_bigquery_type(), gcsBucket = default_gcs_bucket(), serviceAccountKeyFile = default_service_account_key_file(), additionalParameters = NULL, memory = FALSE, ...)
sc |
|
name |
The name to assign to the newly generated table (see also
|
billingProjectId |
Google Cloud Platform project ID for billing purposes.
This is the project on whose behalf to perform BigQuery operations.
Defaults to |
projectId |
Google Cloud Platform project ID of BigQuery dataset.
Defaults to |
datasetId |
Google BigQuery dataset ID (may contain letters, numbers and underscores).
Either both of |
tableId |
Google BigQuery table ID (may contain letters, numbers and underscores).
Either both of |
sqlQuery |
Google BigQuery SQL query. Either both of |
type |
BigQuery import type to use. Options include "direct", "avro",
"json" and "csv". Defaults to |
gcsBucket |
Google Cloud Storage (GCS) bucket to use for storing temporary files.
Temporary files are used when importing through BigQuery load jobs and exporting through
BigQuery extraction jobs (i.e. when using data extracts such as Parquet, Avro, ORC, ...).
The service account specified in |
serviceAccountKeyFile |
Google Cloud service account key file to use for authentication with Google Cloud services. The use of service accounts is highly recommended. Specifically, the service account will be used to interact with BigQuery and Google Cloud Storage (GCS). |
additionalParameters |
Additional spark-bigquery options. See https://github.com/miraisolutions/spark-bigquery for more information. |
memory |
|
... |
Additional arguments passed to |
A tbl_spark
which provides a dplyr
-compatible reference to a
Spark DataFrame.
https://github.com/miraisolutions/spark-bigquery https://cloud.google.com/bigquery/docs/datasets https://cloud.google.com/bigquery/docs/tables https://cloud.google.com/bigquery/docs/reference/standard-sql/ https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv https://cloud.google.com/bigquery/pricing https://cloud.google.com/bigquery/docs/dataset-locations https://cloud.google.com/docs/authentication/ https://cloud.google.com/bigquery/docs/authentication/
spark_read_source
, spark_write_bigquery
,
bigquery_defaults
Other Spark serialization routines: spark_write_bigquery
## Not run: config <- spark_config() sc <- spark_connect(master = "local", config = config) bigquery_defaults( billingProjectId = "<your_billing_project_id>", gcsBucket = "<your_gcs_bucket>", datasetLocation = "US", serviceAccountKeyFile = "<your_service_account_key_file>", type = "direct") # Reading the public shakespeare data table # https://cloud.google.com/bigquery/public-data/ # https://cloud.google.com/bigquery/sample-tables shakespeare <- spark_read_bigquery( sc, name = "shakespeare", projectId = "bigquery-public-data", datasetId = "samples", tableId = "shakespeare") ## End(Not run)
## Not run: config <- spark_config() sc <- spark_connect(master = "local", config = config) bigquery_defaults( billingProjectId = "<your_billing_project_id>", gcsBucket = "<your_gcs_bucket>", datasetLocation = "US", serviceAccountKeyFile = "<your_service_account_key_file>", type = "direct") # Reading the public shakespeare data table # https://cloud.google.com/bigquery/public-data/ # https://cloud.google.com/bigquery/sample-tables shakespeare <- spark_read_bigquery( sc, name = "shakespeare", projectId = "bigquery-public-data", datasetId = "samples", tableId = "shakespeare") ## End(Not run)
This function writes data to a Google BigQuery table.
spark_write_bigquery(data, billingProjectId = default_billing_project_id(), projectId = billingProjectId, datasetId, tableId, type = default_bigquery_type(), gcsBucket = default_gcs_bucket(), datasetLocation = default_dataset_location(), serviceAccountKeyFile = default_service_account_key_file(), additionalParameters = NULL, mode = "error", ...)
spark_write_bigquery(data, billingProjectId = default_billing_project_id(), projectId = billingProjectId, datasetId, tableId, type = default_bigquery_type(), gcsBucket = default_gcs_bucket(), datasetLocation = default_dataset_location(), serviceAccountKeyFile = default_service_account_key_file(), additionalParameters = NULL, mode = "error", ...)
data |
Spark DataFrame to write to Google BigQuery. |
billingProjectId |
Google Cloud Platform project ID for billing purposes.
This is the project on whose behalf to perform BigQuery operations.
Defaults to |
projectId |
Google Cloud Platform project ID of BigQuery dataset.
Defaults to |
datasetId |
Google BigQuery dataset ID (may contain letters, numbers and underscores). |
tableId |
Google BigQuery table ID (may contain letters, numbers and underscores). |
type |
BigQuery export type to use. Options include "direct", "parquet",
"avro", "orc". Defaults to |
gcsBucket |
Google Cloud Storage (GCS) bucket to use for storing temporary files.
Temporary files are used when importing through BigQuery load jobs and exporting through
BigQuery extraction jobs (i.e. when using data extracts such as Parquet, Avro, ORC, ...).
The service account specified in |
datasetLocation |
Geographic location where newly created datasets should reside. "EU" or "US". Defaults to "US". Only needs to be specified if the dataset does not yet exist. It is ignored if it is specified and the dataset already exists. |
serviceAccountKeyFile |
Google Cloud service account key file to use for authentication with Google Cloud services. The use of service accounts is highly recommended. Specifically, the service account will be used to interact with BigQuery and Google Cloud Storage (GCS). |
additionalParameters |
Additional spark-bigquery options. See https://github.com/miraisolutions/spark-bigquery for more information. |
mode |
Specifies the behavior when data or table already exist. One of "overwrite", "append", "ignore" or "error" (default). |
... |
Additional arguments passed to |
NULL
. This is a side-effecting function.
https://github.com/miraisolutions/spark-bigquery https://cloud.google.com/bigquery/docs/datasets https://cloud.google.com/bigquery/docs/tables https://cloud.google.com/bigquery/docs/reference/standard-sql/ https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc https://cloud.google.com/bigquery/pricing https://cloud.google.com/bigquery/docs/dataset-locations https://cloud.google.com/docs/authentication/ https://cloud.google.com/bigquery/docs/authentication/
spark_write_source
, spark_read_bigquery
,
bigquery_defaults
Other Spark serialization routines: spark_read_bigquery
## Not run: config <- spark_config() sc <- spark_connect(master = "local", config = config) bigquery_defaults( billingProjectId = "<your_billing_project_id>", gcsBucket = "<your_gcs_bucket>", datasetLocation = "US", serviceAccountKeyFile = "<your_service_account_key_file>", type = "direct") # Copy mtcars to Spark spark_mtcars <- dplyr::copy_to(sc, mtcars, "spark_mtcars", overwrite = TRUE) spark_write_bigquery( data = spark_mtcars, datasetId = "<your_dataset_id>", tableId = "mtcars", mode = "overwrite") ## End(Not run)
## Not run: config <- spark_config() sc <- spark_connect(master = "local", config = config) bigquery_defaults( billingProjectId = "<your_billing_project_id>", gcsBucket = "<your_gcs_bucket>", datasetLocation = "US", serviceAccountKeyFile = "<your_service_account_key_file>", type = "direct") # Copy mtcars to Spark spark_mtcars <- dplyr::copy_to(sc, mtcars, "spark_mtcars", overwrite = TRUE) spark_write_bigquery( data = spark_mtcars, datasetId = "<your_dataset_id>", tableId = "mtcars", mode = "overwrite") ## End(Not run)