1,658
questions
1
vote
0
answers
14
views
Error upgrading Dataproc Serverless version from 2.1 to 2.2
I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error:
Exception in thread "main" java.util.ServiceConfigurationError: org....
1
vote
0
answers
27
views
How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?
I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
0
votes
0
answers
38
views
Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc
I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python.
I have two GCP projects:
gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table
...
0
votes
0
answers
12
views
How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?
I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser.
import pendulum
from datetime import timedelta
...
1
vote
0
answers
27
views
Spark first read from Cloud Storage is very slow on Dataproc Serverless
I'm running a simple Spark job on Dataproc Serverless, and the first step involves reading some CSV files from Cloud Storage in a standard way:
dfOne = spark
.read()
.format("csv")
.load(&...
1
vote
1
answer
37
views
Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)
I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....
0
votes
1
answer
43
views
Pyspark performance problems when writing to table in Bigquery
I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...
0
votes
0
answers
34
views
What is this error-Initial job has not accepted any resources; check cluster UI to ensure that workers are registered and have sufficient resources
I have created a Dataproc cluster on Google Cloud.
Below are my cluster config:
Master node : Standard (1 master, N workers)
Machine type: n2-standard-2
Number of GPU: 0
Primary disk ...
2
votes
0
answers
33
views
Delta compatibility problem with PySpark in Dataproc GKE env
I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta
The custom image docker file is this:
FROM us-central1-docker.pkg.dev/cloud-...
2
votes
0
answers
22
views
How can I install html5lib on a dataproc cluster
I have a dataproc pipeline with which I do webscraping and store data in gcp.
Task setting is something like this:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id='...
3
votes
0
answers
215
views
spark_catalog requires a single-part namespace in dbt python incremental model
Description:
Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery.
It ...
2
votes
1
answer
101
views
How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'
Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config:
conf = (
...
1
vote
0
answers
24
views
Create external Hive table for multiline Json
I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result.
Sample JSON file:
{
"name": "Adil Abro",
"...
1
vote
1
answer
67
views
not able to set the spark log properties programmatically via Python
I was trying to suppress the spark logging and specifying my own log4j.properties file.
gcloud dataproc jobs submit spark \
--cluster test-dataproc-cluster \
--region europe-north1 \
--files gs://...
1
vote
0
answers
50
views
Bigtable Read and Write using DataProc with compute engine results in Key not found
I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...