1,658
questions
83
votes
7
answers
81k
views
What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?
I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit ...
54
votes
7
answers
57k
views
Google Cloud Platform: how to monitor memory usage of VM instances
I have recently performed a migration to Google Cloud Platform, and I really like it.
However I can't find a way to monitor the memory usage of the Dataproc VM intances. As you can see on the ...
23
votes
4
answers
3k
views
job failing with ERROR: gcloud crashed (AttributeError): 'bool' object has no attribute 'lower' [closed]
We noticed our jobs are failing with the below error on the dataproc cluster.
ERROR: gcloud crashed (AttributeError): 'bool' object has no attribute 'lower'
If you would like to report this issue, ...
17
votes
2
answers
19k
views
Error: permission denied on resource project when launching Dataproc cluster
I was successfully able to launch a dataproc cluster by manually creating one via gcloud dataproc clusters create.... However, when I try to launch one through a script (that automatically provisions ...
16
votes
4
answers
14k
views
Where is the Spark UI on Google Dataproc?
What port should I use to access the Spark UI on Google Dataproc?
I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln
Firewall is properly configured.
16
votes
2
answers
5k
views
Output from Dataproc Spark job in Google Cloud Logging
Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available ...
13
votes
3
answers
10k
views
While submit job with pyspark, how to access static files upload with --files argument?
for example, i have a folder:
/
- test.py
- test.yml
and the job is submited to spark cluster with:
gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py"
in the test.py, I want ...
13
votes
4
answers
22k
views
"No Filesystem for Scheme: gs" when running spark job locally
I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)
When running the job locally on my Mac machine, I am getting the ...
12
votes
1
answer
14k
views
How to install python packages in a Google Dataproc cluster
Is it possible to install python packages in a Google Dataproc cluster after the cluster is created and running?
I tried to use "pip install xxxxxxx" in the master command line but it does not seem ...
12
votes
2
answers
10k
views
Which HBase connector for Spark 2.0 should I use? [closed]
Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions.
The Spark 2.0 and the new DataSet API support is ...
12
votes
1
answer
48k
views
org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times
I am using Google Cloud Dataproc to do spark job and my editor is Zepplin. I was trying to write json data into gcp bucket. It succeeded before when I tried 10MB file. But failed with 10GB file. My ...
11
votes
4
answers
27k
views
spark.sql.crossJoin.enabled for Spark 2.x
I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration ...
11
votes
1
answer
13k
views
Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster
I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured.
I receive the following error when running "spark-shell" from the shell (locally on the ...
11
votes
2
answers
4k
views
Dataproc + BigQuery examples - any available?
According to the Dataproc docos, it has "native and automatic integrations with BigQuery".
I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc ...
11
votes
1
answer
2k
views
BigQuery connector for pyspark via Hadoop Input Format example
I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
https://...