Skip to main content
83 votes
7 answers
81k views

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution. Looking into google cloud offering, it seems DataProc can also do the same thing. It also seems DataProc is little bit ...
KosiB's user avatar
  • 1,146
54 votes
7 answers
57k views

Google Cloud Platform: how to monitor memory usage of VM instances

I have recently performed a migration to Google Cloud Platform, and I really like it. However I can't find a way to monitor the memory usage of the Dataproc VM intances. As you can see on the ...
Daniele B's user avatar
  • 20.2k
23 votes
4 answers
3k views

job failing with ERROR: gcloud crashed (AttributeError): 'bool' object has no attribute 'lower' [closed]

We noticed our jobs are failing with the below error on the dataproc cluster. ERROR: gcloud crashed (AttributeError): 'bool' object has no attribute 'lower' If you would like to report this issue, ...
Poorna Noothalapati's user avatar
17 votes
2 answers
19k views

Error: permission denied on resource project when launching Dataproc cluster

I was successfully able to launch a dataproc cluster by manually creating one via gcloud dataproc clusters create.... However, when I try to launch one through a script (that automatically provisions ...
claudiadast's user avatar
16 votes
4 answers
14k views

Where is the Spark UI on Google Dataproc?

What port should I use to access the Spark UI on Google Dataproc? I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln Firewall is properly configured.
BAR's user avatar
  • 16.8k
16 votes
2 answers
5k views

Output from Dataproc Spark job in Google Cloud Logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available ...
Thomas Oldervoll's user avatar
13 votes
3 answers
10k views

While submit job with pyspark, how to access static files upload with --files argument?

for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py, I want ...
lucemia's user avatar
  • 6,547
13 votes
4 answers
22k views

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the ...
Yaniv Donenfeld's user avatar
12 votes
1 answer
14k views

How to install python packages in a Google Dataproc cluster

Is it possible to install python packages in a Google Dataproc cluster after the cluster is created and running? I tried to use "pip install xxxxxxx" in the master command line but it does not seem ...
Pablo Brenner's user avatar
12 votes
2 answers
10k views

Which HBase connector for Spark 2.0 should I use? [closed]

Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is ...
ogen's user avatar
  • 802
12 votes
1 answer
48k views

org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times

I am using Google Cloud Dataproc to do spark job and my editor is Zepplin. I was trying to write json data into gcp bucket. It succeeded before when I tried 10MB file. But failed with 10GB file. My ...
wwwwan's user avatar
  • 407
11 votes
4 answers
27k views

spark.sql.crossJoin.enabled for Spark 2.x

I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration ...
Stijn's user avatar
  • 459
11 votes
1 answer
13k views

Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster

I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured. I receive the following error when running "spark-shell" from the shell (locally on the ...
habitats's user avatar
  • 2,391
11 votes
2 answers
4k views

Dataproc + BigQuery examples - any available?

According to the Dataproc docos, it has "native and automatic integrations with BigQuery". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc ...
Graham Polley's user avatar
11 votes
1 answer
2k views

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://...
Luca Fiaschi's user avatar
  • 3,205

15 30 50 per page
1
2 3 4 5
111