Newest 'google-cloud-dataproc' Questions

1 vote

0 answers

14 views

Error upgrading Dataproc Serverless version from 2.1 to 2.2

I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error: Exception in thread "main" java.util.ServiceConfigurationError: org....

Chaos

11

asked 2 days ago

1 vote

0 answers

27 views

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...

Puredepatata

23

asked Sep 16 at 18:50

0 votes

0 answers

38 views

Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc

I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python. I have two GCP projects: gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table ...

Henry Xiloj Herrera

194

asked Sep 14 at 2:28

0 votes

0 answers

12 views

How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?

I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser. import pendulum from datetime import timedelta ...

GuilhermeMP

1

asked Sep 12 at 14:34

1 vote

0 answers

27 views

Spark first read from Cloud Storage is very slow on Dataproc Serverless

I'm running a simple Spark job on Dataproc Serverless, and the first step involves reading some CSV files from Cloud Storage in a standard way: dfOne = spark .read() .format("csv") .load(&...

uzarkov

27

asked Sep 11 at 15:07

1 vote

1 answer

37 views

Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)

I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....

Sekar Ramu

453

asked Aug 28 at 6:03

0 votes

1 answer

43 views

Pyspark performance problems when writing to table in Bigquery

I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...

aleretgub

43

asked Aug 23 at 12:02

0 votes

0 answers

34 views

What is this error-Initial job has not accepted any resources; check cluster UI to ensure that workers are registered and have sufficient resources

I have created a Dataproc cluster on Google Cloud. Below are my cluster config: Master node : Standard (1 master, N workers) Machine type: n2-standard-2 Number of GPU: 0 Primary disk ...

amarjeet kushwaha

1

asked Aug 22 at 10:36

2 votes

0 answers

33 views

Delta compatibility problem with PySpark in Dataproc GKE env

I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...

Pedro

21

asked Aug 8 at 20:45

2 votes

0 answers

22 views

How can I install html5lib on a dataproc cluster

I have a dataproc pipeline with which I do webscraping and store data in gcp. Task setting is something like this: create_dataproc_cluster = DataprocCreateClusterOperator( task_id='...

Sara

75

asked Jul 25 at 19:31

3 votes

0 answers

215 views

spark_catalog requires a single-part namespace in dbt python incremental model

Description: Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery. It ...

Carlos Veríssimo

31

asked Jul 23 at 8:55

2 votes

1 answer

101 views

How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'

Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config: conf = ( ...

dbkoop

61

asked Jul 15 at 2:33

1 vote

0 answers

24 views

Create external Hive table for multiline Json

I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result. Sample JSON file: { "name": "Adil Abro", "...

yac

11

asked Jun 30 at 22:40

1 vote

1 answer

67 views

not able to set the spark log properties programmatically via Python

I was trying to suppress the spark logging and specifying my own log4j.properties file. gcloud dataproc jobs submit spark \ --cluster test-dataproc-cluster \ --region europe-north1 \ --files gs://...

vikrant rana

4,557

asked Jun 27 at 13:10

1 vote

0 answers

50 views

Bigtable Read and Write using DataProc with compute engine results in Key not found

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...

Suga Raj

541

asked Jun 26 at 17:07

Collectives™ on Stack Overflow

Error upgrading Dataproc Serverless version from 2.1 to 2.2

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc

How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?

Spark first read from Cloud Storage is very slow on Dataproc Serverless

Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)

Pyspark performance problems when writing to table in Bigquery

What is this error-Initial job has not accepted any resources; check cluster UI to ensure that workers are registered and have sufficient resources

Delta compatibility problem with PySpark in Dataproc GKE env

How can I install html5lib on a dataproc cluster

spark_catalog requires a single-part namespace in dbt python incremental model

How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'

Create external Hive table for multiline Json

not able to set the spark log properties programmatically via Python

Bigtable Read and Write using DataProc with compute engine results in Key not found

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags