Skip to main content
1 vote
0 answers
14 views

Error upgrading Dataproc Serverless version from 2.1 to 2.2

I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error: Exception in thread "main" java.util.ServiceConfigurationError: org....
Chaos's user avatar
  • 11
1 vote
0 answers
27 views

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
Puredepatata's user avatar
0 votes
0 answers
38 views

Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc

I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python. I have two GCP projects: gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table ...
Henry Xiloj Herrera's user avatar
0 votes
0 answers
12 views

How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?

I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser. import pendulum from datetime import timedelta ...
GuilhermeMP's user avatar
1 vote
0 answers
27 views

Spark first read from Cloud Storage is very slow on Dataproc Serverless

I'm running a simple Spark job on Dataproc Serverless, and the first step involves reading some CSV files from Cloud Storage in a standard way: dfOne = spark .read() .format("csv") .load(&...
uzarkov's user avatar
  • 27
1 vote
1 answer
37 views

Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)

I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....
Sekar Ramu's user avatar
0 votes
1 answer
43 views

Pyspark performance problems when writing to table in Bigquery

I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...
aleretgub's user avatar
0 votes
0 answers
34 views

What is this error-Initial job has not accepted any resources; check cluster UI to ensure that workers are registered and have sufficient resources

I have created a Dataproc cluster on Google Cloud. Below are my cluster config: Master node : Standard (1 master, N workers) Machine type: n2-standard-2 Number of GPU: 0 Primary disk ...
amarjeet kushwaha's user avatar
2 votes
0 answers
33 views

Delta compatibility problem with PySpark in Dataproc GKE env

I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...
Pedro's user avatar
  • 21
2 votes
0 answers
22 views

How can I install html5lib on a dataproc cluster

I have a dataproc pipeline with which I do webscraping and store data in gcp. Task setting is something like this: create_dataproc_cluster = DataprocCreateClusterOperator( task_id='...
Sara 's user avatar
  • 75
3 votes
0 answers
215 views

spark_catalog requires a single-part namespace in dbt python incremental model

Description: Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery. It ...
Carlos Veríssimo's user avatar
2 votes
1 answer
101 views

How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'

Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config: conf = ( ...
dbkoop's user avatar
  • 61
1 vote
0 answers
24 views

Create external Hive table for multiline Json

I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result. Sample JSON file: { "name": "Adil Abro", "...
yac's user avatar
  • 11
1 vote
1 answer
67 views

not able to set the spark log properties programmatically via Python

I was trying to suppress the spark logging and specifying my own log4j.properties file. gcloud dataproc jobs submit spark \ --cluster test-dataproc-cluster \ --region europe-north1 \ --files gs://...
vikrant rana's user avatar
  • 4,557
1 vote
0 answers
50 views

Bigtable Read and Write using DataProc with compute engine results in Key not found

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
Suga Raj's user avatar
  • 541

15 30 50 per page
1
2 3 4 5
111