You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" Databricks Certified Developer for Apache Spark Scala Practice Test 3 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
Databricks Certified Developer for Apache Spark Scala
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
You have a need to sort a dataframe df which has some null values on column a. You want the null values to appear last, and then the rest of the rows should be ordered ascending based on the column a. Choose the right code block to achieve your goal. It is not possible to sort, when there are null values on the specified column.
The code block shown below contains an error. The code block is intended to write a text file in the path. What should we add to part 1 in order to fix ? val a = Array(1002, 3001, 4002, 2003, 2002, 3004, 1003, 4006) val b = spark.createDataset(a).withColumn(“a“, col(“value“) % 1000).withColumn(“b“, col(“value“) % 1000) -1 df.write.text(my_file.txt)
Correct
When you write a text file, you need to be sure to have only one string column; otherwise, the write will fail:
Incorrect
When you write a text file, you need to be sure to have only one string column; otherwise, the write will fail:
Unattempted
When you write a text file, you need to be sure to have only one string column; otherwise, the write will fail:
Question 3 of 60
3. Question
The code block shown below should return a new DataFrame with a new column named casted whose value is the string equivalent of column a which is an integer column. This dataframe should contain all the previously existing columns from DataFrame df. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: df._1_(_2_, _3_)
Correct
Read the questions and responses carefully ! You will have many questions like this one, try to visualise it and write it down if it helps. There is always quotes in the column name and you need to you .cast to cast a column
Incorrect
Read the questions and responses carefully ! You will have many questions like this one, try to visualise it and write it down if it helps. There is always quotes in the column name and you need to you .cast to cast a column
Unattempted
Read the questions and responses carefully ! You will have many questions like this one, try to visualise it and write it down if it helps. There is always quotes in the column name and you need to you .cast to cast a column
Question 4 of 60
4. Question
Given the code block down below, a database test containing nulls, identify the error. val my_udf = (s: String) => { if (s != null) s.length() else 0 } spark.udf.register(“strlen“, my_udf) spark.sql(“select s from test where s is not null and strlen(s) > 1“) We need to create the function first and then pass it to udf.register This WHERE clause does not guarantee the strlen UDF to be invoked after filtering out nulls. So we will have null pointer
Correct
Spark SQL (including SQL and the DataFrame and Dataset APIs) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right short-circuiting semantics. To perform proper null checking, we recommend that you do either of the following: Make the UDF itself null-aware and do null checking inside the UDF itself Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch
Incorrect
Spark SQL (including SQL and the DataFrame and Dataset APIs) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right short-circuiting semantics. To perform proper null checking, we recommend that you do either of the following: Make the UDF itself null-aware and do null checking inside the UDF itself Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch
Unattempted
Spark SQL (including SQL and the DataFrame and Dataset APIs) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right short-circuiting semantics. To perform proper null checking, we recommend that you do either of the following: Make the UDF itself null-aware and do null checking inside the UDF itself Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch
Question 5 of 60
5. Question
Given the following dataframe val rawData = Seq( (“A“, 20), (“B“, 30), (“C“, 80) ) val df = spark.createDataFrame(rawData).toDF(“Letter“, “Number“) We want to store the sum of all numbers in a variable result. Choose the correct code block in order to achieve this goal.
Correct
Here is the explication; 1) df.groupBy().sum() This part is resulting type DataFrame[sum(Number): bigint] If we show the resulting dataframe it would be; +———–+ |sum(Number)| +———–+ | 130| +———–+ 2) df.groupBy().sum().collect() We do a collect on the previous summed dataframe; Remember collect returns a list of rows; (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.collect.html#pyspark.sql.DataFrame.collect) Here is the resulting list of row. [Row(sum(Number)=130)] 3) df.groupBy().sum().collect()[0] We have a list of rows, and we are only interested in the first row object which will give; Row(sum(Number)=130) 4) df.groupBy().sum().collect()[0][0] We have our row, but we are interested getting the value of it. For our example it is ‘130‘. The fields in rows can be accessed: a) like attributes (row.key) b) like dictionary values (row[key]) In this example, we chose to access it as a dictionary value, hence the second [0] (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Row.html#pyspark.sql.Row) Or another way to do it is; import pyspark.sql.functions as F df.agg(F.sum(“Number“)).collect()[0][0]
Incorrect
Here is the explication; 1) df.groupBy().sum() This part is resulting type DataFrame[sum(Number): bigint] If we show the resulting dataframe it would be; +———–+ |sum(Number)| +———–+ | 130| +———–+ 2) df.groupBy().sum().collect() We do a collect on the previous summed dataframe; Remember collect returns a list of rows; (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.collect.html#pyspark.sql.DataFrame.collect) Here is the resulting list of row. [Row(sum(Number)=130)] 3) df.groupBy().sum().collect()[0] We have a list of rows, and we are only interested in the first row object which will give; Row(sum(Number)=130) 4) df.groupBy().sum().collect()[0][0] We have our row, but we are interested getting the value of it. For our example it is ‘130‘. The fields in rows can be accessed: a) like attributes (row.key) b) like dictionary values (row[key]) In this example, we chose to access it as a dictionary value, hence the second [0] (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Row.html#pyspark.sql.Row) Or another way to do it is; import pyspark.sql.functions as F df.agg(F.sum(“Number“)).collect()[0][0]
Unattempted
Here is the explication; 1) df.groupBy().sum() This part is resulting type DataFrame[sum(Number): bigint] If we show the resulting dataframe it would be; +———–+ |sum(Number)| +———–+ | 130| +———–+ 2) df.groupBy().sum().collect() We do a collect on the previous summed dataframe; Remember collect returns a list of rows; (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.collect.html#pyspark.sql.DataFrame.collect) Here is the resulting list of row. [Row(sum(Number)=130)] 3) df.groupBy().sum().collect()[0] We have a list of rows, and we are only interested in the first row object which will give; Row(sum(Number)=130) 4) df.groupBy().sum().collect()[0][0] We have our row, but we are interested getting the value of it. For our example it is ‘130‘. The fields in rows can be accessed: a) like attributes (row.key) b) like dictionary values (row[key]) In this example, we chose to access it as a dictionary value, hence the second [0] (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Row.html#pyspark.sql.Row) Or another way to do it is; import pyspark.sql.functions as F df.agg(F.sum(“Number“)).collect()[0][0]
Question 6 of 60
6. Question
Which of the following code blocks returns a DataFrame with two new columns a and b from the existing column aSquared where the values of a and b is half of the column aSquared ?
Correct
Familiarize yourself with the syntax of withColumn and withColumnRenamed.
Incorrect
Familiarize yourself with the syntax of withColumn and withColumnRenamed.
Unattempted
Familiarize yourself with the syntax of withColumn and withColumnRenamed.
Question 7 of 60
7. Question
What is the best description of a catalog ?
Correct
The highest level abstraction in Spark SQL is the Catalog. The Catalog is an abstraction for the storage of metadata about the data stored in your tables as well as other helpful things like databases, tables, functions, and views
Incorrect
The highest level abstraction in Spark SQL is the Catalog. The Catalog is an abstraction for the storage of metadata about the data stored in your tables as well as other helpful things like databases, tables, functions, and views
Unattempted
The highest level abstraction in Spark SQL is the Catalog. The Catalog is an abstraction for the storage of metadata about the data stored in your tables as well as other helpful things like databases, tables, functions, and views
Question 8 of 60
8. Question
At which stage Catalyst optimizer generates one or more physical plans ?
The following statement will create a managed table dataframe.write.saveAsTable(“unmanaged_my_table“)
Correct
One important note is the concept of managed versus unmanaged tables. Tables store two important pieces of information. The data within the tables as well as the data about the tables; that is, the metadata. You can have Spark manage the metadata for a set of files as well as for the data. When you define a table from files on disk, you are defining an unmanaged table. When you use saveAsTable on a DataFrame, you are creating a managed table for which Spark will track of all of the relevant information.
Incorrect
One important note is the concept of managed versus unmanaged tables. Tables store two important pieces of information. The data within the tables as well as the data about the tables; that is, the metadata. You can have Spark manage the metadata for a set of files as well as for the data. When you define a table from files on disk, you are defining an unmanaged table. When you use saveAsTable on a DataFrame, you are creating a managed table for which Spark will track of all of the relevant information.
Unattempted
One important note is the concept of managed versus unmanaged tables. Tables store two important pieces of information. The data within the tables as well as the data about the tables; that is, the metadata. You can have Spark manage the metadata for a set of files as well as for the data. When you define a table from files on disk, you are defining an unmanaged table. When you use saveAsTable on a DataFrame, you are creating a managed table for which Spark will track of all of the relevant information.
Question 10 of 60
10. Question
There is a temp view named my_view. If I want to query this view within spark, which command I should choose ?
Correct
Global temp views are accessed via prefix global_temp And other tables are accessed without any prefixes.
Incorrect
Global temp views are accessed via prefix global_temp And other tables are accessed without any prefixes.
Unattempted
Global temp views are accessed via prefix global_temp And other tables are accessed without any prefixes.
Question 11 of 60
11. Question
What happens at a stage boundary in spark ?
Correct
At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by tasks in the child stage. Because they incur heavy disk and network I/O, stage boundaries can be expensive and should be avoided when possible.
Incorrect
At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by tasks in the child stage. Because they incur heavy disk and network I/O, stage boundaries can be expensive and should be avoided when possible.
Unattempted
At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by tasks in the child stage. Because they incur heavy disk and network I/O, stage boundaries can be expensive and should be avoided when possible.
Question 12 of 60
12. Question
Consider following dataframe val rawData = Seq( (1, 1000, “Apple“, 0.76), (2, 1000, “Apple“, 0.11), (1, 2000, “Orange“, 0.98), (1, 3000, “Banana“, 0.24), (2, 3000, “Banana“, 0.99) ) val df = spark.createDataFrame(rawData).toDF(“UserKey“, “ItemKey“, “ItemName“, “Score“) df = df.repartition(8) And we apply this code block df.rdd.getNumPartitions() What will we see ?
Correct
If you don‘t specify number of partitions normally, spark tries to set the number of partitions automatically based on your cluster but here we specified that we want to have 8 partitions after we created the dataframe.
Incorrect
If you don‘t specify number of partitions normally, spark tries to set the number of partitions automatically based on your cluster but here we specified that we want to have 8 partitions after we created the dataframe.
Unattempted
If you don‘t specify number of partitions normally, spark tries to set the number of partitions automatically based on your cluster but here we specified that we want to have 8 partitions after we created the dataframe.
Question 13 of 60
13. Question
If we want to create a constant string 1 as a new column new_column in the dataframe df, which code block should we select ?
Correct
The second argument for DataFrame.withColumn should be a Column so you have to use a literal to add constant value 1:
Incorrect
The second argument for DataFrame.withColumn should be a Column so you have to use a literal to add constant value 1:
Unattempted
The second argument for DataFrame.withColumn should be a Column so you have to use a literal to add constant value 1:
Question 14 of 60
14. Question
The code block shown below contains an error. Identify the error. val squared = (s: Long) => { s * s } spark.udf.register(“square“, squared) spark.range(1, 20).createOrReplaceTempView(“test“) spark.sql(select id, square(id) as id_squared from temp_test) There is no column id created in the database. There is no error in the code.
Correct
You need to query the right table. Read carefully the questions !
Incorrect
You need to query the right table. Read carefully the questions !
Unattempted
You need to query the right table. Read carefully the questions !
Question 15 of 60
15. Question
If spark is running in cluster mode, which of the following statements about nodes is correct ?
Correct
In cluster mode, a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster manager then launches the driver process on a worker node inside the cluster, in addition to the executor processes.
Incorrect
In cluster mode, a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster manager then launches the driver process on a worker node inside the cluster, in addition to the executor processes.
Unattempted
In cluster mode, a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster manager then launches the driver process on a worker node inside the cluster, in addition to the executor processes.
Question 16 of 60
16. Question
The code block shown below intends to return a new DataFrame with column old renamed to new but it contains an error. Identify the error. df.withColumnRenamed(old, new) You need to reverse parameters and add quotes. So correct code block is df.withColumnRenamed(new, old)
Correct
You need to be really familiar with the syntax of withColumn, withColumnRenamed for the exam. Learn them very well.
Incorrect
You need to be really familiar with the syntax of withColumn, withColumnRenamed for the exam. Learn them very well.
Unattempted
You need to be really familiar with the syntax of withColumn, withColumnRenamed for the exam. Learn them very well.
Question 17 of 60
17. Question
The code block shown below should return a new DataFrame with 25 percent of random records from dataframe df with replacement. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: df._1_(_2_, _3_, _4_)
Correct
Incorrect
Unattempted
Question 18 of 60
18. Question
Which of the following is correct for the cluster manager ?
Which of the following code blocks reads from a tsv file where values are separated with \t ?
Correct
With Spark 2.0+ we can use CSV connector to read a tsv file.
Incorrect
With Spark 2.0+ we can use CSV connector to read a tsv file.
Unattempted
With Spark 2.0+ we can use CSV connector to read a tsv file.
Question 21 of 60
21. Question
How to make sure that dataframe df has 8 partitions given that it has 4 partitions ?
Correct
correct syntax is; df.repartition(8) and you cannot increase the number of partitions with df.coalesce(8)
Incorrect
correct syntax is; df.repartition(8) and you cannot increase the number of partitions with df.coalesce(8)
Unattempted
correct syntax is; df.repartition(8) and you cannot increase the number of partitions with df.coalesce(8)
Question 22 of 60
22. Question
We have an unmanaged table my_table If we run the code block down below spark.sql(DROP TABLE IF EXISTS my_table) What will happen to data in my_table ?
Correct
If you are dropping an unmanaged table, no data will be removed but you will no longer be able to refer to this data by the table name.
Incorrect
If you are dropping an unmanaged table, no data will be removed but you will no longer be able to refer to this data by the table name.
Unattempted
If you are dropping an unmanaged table, no data will be removed but you will no longer be able to refer to this data by the table name.
If spark is running in client mode, which of the following statement about is NOT correct ?
Correct
Client mode is nearly the same as cluster mode except that the Spark driver remains on the client machine that submitted the application.
Incorrect
Client mode is nearly the same as cluster mode except that the Spark driver remains on the client machine that submitted the application.
Unattempted
Client mode is nearly the same as cluster mode except that the Spark driver remains on the client machine that submitted the application.
Question 25 of 60
25. Question
You have a need to transform a column named timestamp to a date format. Assume that the column timestamp is compatible with format date. You have written the code block down below, but it contains an error. Identify and fix it.
Choose the right code block in order to change add a new column to the following schema. import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StructField
Correct
Correct syntax is schema.add(“new_column“,StringType(),True).
Incorrect
Correct syntax is schema.add(“new_column“,StringType(),True).
Unattempted
Correct syntax is schema.add(“new_column“,StringType(),True).
Question 27 of 60
27. Question
Which property is used to allocate jobs to different resource pools to achieve resources scheduling within an application ?
Correct
If you would like to run multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application can give resources back to the cluster if they are no longer used, and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster. This feature is disabled by default and available on all coarse-grained cluster managers; that is, standalone mode, YARN mode, and Mesos coarse-grained mode. There are two requirements for using this feature. First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application. The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them. The Spark Fair Scheduler specifies resource pools and allocates jobs to different resource pools to achieve resource scheduling within an application. In this way, the computing resources are effectively used and the runtime of jobs is balanced, ensuring that the subsequently-submitted jobs are not affected by over-loaded jobs.
Incorrect
If you would like to run multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application can give resources back to the cluster if they are no longer used, and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster. This feature is disabled by default and available on all coarse-grained cluster managers; that is, standalone mode, YARN mode, and Mesos coarse-grained mode. There are two requirements for using this feature. First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application. The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them. The Spark Fair Scheduler specifies resource pools and allocates jobs to different resource pools to achieve resource scheduling within an application. In this way, the computing resources are effectively used and the runtime of jobs is balanced, ensuring that the subsequently-submitted jobs are not affected by over-loaded jobs.
Unattempted
If you would like to run multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application can give resources back to the cluster if they are no longer used, and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster. This feature is disabled by default and available on all coarse-grained cluster managers; that is, standalone mode, YARN mode, and Mesos coarse-grained mode. There are two requirements for using this feature. First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application. The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them. The Spark Fair Scheduler specifies resource pools and allocates jobs to different resource pools to achieve resource scheduling within an application. In this way, the computing resources are effectively used and the runtime of jobs is balanced, ensuring that the subsequently-submitted jobs are not affected by over-loaded jobs.
Question 28 of 60
28. Question
Given the following dataframe val rawData = Seq( (1, 1000, “Apple“, 0.76), (2, 1000, “Apple“, 0.11), (1, 2000, “Orange“, 0.98), (1, 3000, “Banana“, 0.24), (2, 3000, “Banana“, 0.99) ) val df = spark.createDataFrame(rawData).toDF(“UserKey“, “ItemKey“, “ItemName“, “Score“) df = df.repartition(8) We execute the following code block df.write.mode(“overwrite“).option(“compression“, “snappy“).save(“path“) Choose the correct number of files after a successful write operation.
Correct
We control the parallelism of files that we write by controlling the partitions prior to writing and therefore the number of partitions before writing equals to number of files created after the write operation. If you don‘t specify number of partitions normally, spark tries to set the number of partitions automatically based on your cluster but here we specified that we want to have 8 partitions after we created the dataframe.
Incorrect
We control the parallelism of files that we write by controlling the partitions prior to writing and therefore the number of partitions before writing equals to number of files created after the write operation. If you don‘t specify number of partitions normally, spark tries to set the number of partitions automatically based on your cluster but here we specified that we want to have 8 partitions after we created the dataframe.
Unattempted
We control the parallelism of files that we write by controlling the partitions prior to writing and therefore the number of partitions before writing equals to number of files created after the write operation. If you don‘t specify number of partitions normally, spark tries to set the number of partitions automatically based on your cluster but here we specified that we want to have 8 partitions after we created the dataframe.
Question 29 of 60
29. Question
Which of the followings is NOT a useful use cases of spark ?
Correct
It is preferable to process in parallel big data sets distributed across a cluster with spark
Incorrect
It is preferable to process in parallel big data sets distributed across a cluster with spark
Unattempted
It is preferable to process in parallel big data sets distributed across a cluster with spark
dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.
Incorrect
dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.
Unattempted
dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.
Question 31 of 60
31. Question
Which of the following code blocks merges two DataFrames df1 and df2 ?
Correct
DataFrames are immutable. This means users cannot append to DataFrames because that would be changing it. To append to a DataFrame, you must union the original DataFrame along with the new DataFrame.
Incorrect
DataFrames are immutable. This means users cannot append to DataFrames because that would be changing it. To append to a DataFrame, you must union the original DataFrame along with the new DataFrame.
Unattempted
DataFrames are immutable. This means users cannot append to DataFrames because that would be changing it. To append to a DataFrame, you must union the original DataFrame along with the new DataFrame.
Question 32 of 60
32. Question
Which of the following three operations are classified as a wide transformation ? Choose 3 answers:
Which of the following transformation is not evaluated lazily ?
Correct
All transformations are lazily evaluated in spark.
Incorrect
All transformations are lazily evaluated in spark.
Unattempted
All transformations are lazily evaluated in spark.
Question 34 of 60
34. Question
Given the following statements regarding caching: 1: The default storage level for a DataFrame is StorageLevel.MEMORY 2: The DataFrame class does have an unpersist() operation 3: The persist() method needs an action to load data from its source to materialize the DataFrame in cache Which one is NOT TRUE ?
Which of the following operations can be used to create a new DataFrame with only the column a from the existing DataFrame df ?
Correct
Correct answer here is select one column, we just select one column and get a new dataframe from it
Incorrect
Correct answer here is select one column, we just select one column and get a new dataframe from it
Unattempted
Correct answer here is select one column, we just select one column and get a new dataframe from it
Question 36 of 60
36. Question
Choose the correct code block to unpersist a table named table.
Correct
Correct usage is spark.catalog.uncacheTable(tableName) To remove the data from the cache, just call: spark.sql(“uncache table table_name“) or spark.catalog.uncacheTable(table_name) To unpersist dataframe use: spark.sql(“df.unpersist()“) Another thing to remember is when using DataFrame.persist() data on disk is always serialized.
Incorrect
Correct usage is spark.catalog.uncacheTable(tableName) To remove the data from the cache, just call: spark.sql(“uncache table table_name“) or spark.catalog.uncacheTable(table_name) To unpersist dataframe use: spark.sql(“df.unpersist()“) Another thing to remember is when using DataFrame.persist() data on disk is always serialized.
Unattempted
Correct usage is spark.catalog.uncacheTable(tableName) To remove the data from the cache, just call: spark.sql(“uncache table table_name“) or spark.catalog.uncacheTable(table_name) To unpersist dataframe use: spark.sql(“df.unpersist()“) Another thing to remember is when using DataFrame.persist() data on disk is always serialized.
Question 37 of 60
37. Question
Choose valid execution modes in the following responses.
Correct
An execution mode gives you the power to determine where the aforementioned resources are physically located when you go to run your application. You have three modes to choose from: Cluster mode, client mode and local mode. Standalone is one of the cluster manager types.
Incorrect
An execution mode gives you the power to determine where the aforementioned resources are physically located when you go to run your application. You have three modes to choose from: Cluster mode, client mode and local mode. Standalone is one of the cluster manager types.
Unattempted
An execution mode gives you the power to determine where the aforementioned resources are physically located when you go to run your application. You have three modes to choose from: Cluster mode, client mode and local mode. Standalone is one of the cluster manager types.
Question 38 of 60
38. Question
Choose the equivalent code block to: df.filter(col(“count“) < 2) Where df is a valid dataframe which has a column named count
Correct
Incorrect
Unattempted
Question 39 of 60
39. Question
If we want to store RDD as serialized Java objects in the JVM and if the RDD does not fit in memory, store the partitions that dont fit on disk, and read them from there when theyre needed, which storage level do we need to choose ?
Your manager gave you a task to remove sensitive data. Choose the correct code block down below to remove name and city from the dataframe. val data = Seq( (“Josh“, “Berlin, 25, “M“), (“Adam“, “Paris“, 34, “M“), ) val df = spark.createDataFrame(data).toDf(“name“,“city“,“age“,“gender“)
Correct
Correct useage of drop is the following:
Incorrect
Correct useage of drop is the following:
Unattempted
Correct useage of drop is the following:
Question 42 of 60
42. Question
Select the code block which counts distinct number of quantity for each invoiceNo in the dataframe df.
Correct
Incorrect
Unattempted
Question 43 of 60
43. Question
Which of the following code blocks changes the parquet file content given that there is already a file exist with the name that we want to write ?
Correct
Parquet is the default file format. If you dont include the format() method, the DataFrame will still be saved as a Parquet file. If we dont include mode overwrite our application will crash since there is already a file exist with the same name.
Incorrect
Parquet is the default file format. If you dont include the format() method, the DataFrame will still be saved as a Parquet file. If we dont include mode overwrite our application will crash since there is already a file exist with the same name.
Unattempted
Parquet is the default file format. If you dont include the format() method, the DataFrame will still be saved as a Parquet file. If we dont include mode overwrite our application will crash since there is already a file exist with the same name.
Question 44 of 60
44. Question
Which of the followings are true for driver ?
Correct
The driver is the machine in which the application runs. It is responsible for three main things: 1) Maintaining information about the Spark Application, 2) Responding to the users program, 3) Analyzing, distributing, and scheduling work across the executors.
Incorrect
The driver is the machine in which the application runs. It is responsible for three main things: 1) Maintaining information about the Spark Application, 2) Responding to the users program, 3) Analyzing, distributing, and scheduling work across the executors.
Unattempted
The driver is the machine in which the application runs. It is responsible for three main things: 1) Maintaining information about the Spark Application, 2) Responding to the users program, 3) Analyzing, distributing, and scheduling work across the executors.
Question 45 of 60
45. Question
Which of the following three DataFrame operations are classified as an action? Choose 3 answers:
We want to drop any rows that how a null value. Choose the correct order in order to achieve this goal. 1. df 2. drop 3. .na 4. .drop(“any“) 5. .dropna(“any“)
Correct
Correct syntax is df.na.drop(“any“) Example import org.apache.spark.sql.functions._ val rawData = Seq((“A“, “20“),(“B“, “30“),(“C“, null)) val df = spark.createDataFrame(rawData).toDF(“Letter“, “Number“) val df2 = df.na.drop(“any“) df2.show() Result +——+——+ | Letter|Number| +——+——+ | A| 20| | B| 30| +——+——+
Incorrect
Correct syntax is df.na.drop(“any“) Example import org.apache.spark.sql.functions._ val rawData = Seq((“A“, “20“),(“B“, “30“),(“C“, null)) val df = spark.createDataFrame(rawData).toDF(“Letter“, “Number“) val df2 = df.na.drop(“any“) df2.show() Result +——+——+ | Letter|Number| +——+——+ | A| 20| | B| 30| +——+——+
Unattempted
Correct syntax is df.na.drop(“any“) Example import org.apache.spark.sql.functions._ val rawData = Seq((“A“, “20“),(“B“, “30“),(“C“, null)) val df = spark.createDataFrame(rawData).toDF(“Letter“, “Number“) val df2 = df.na.drop(“any“) df2.show() Result +——+——+ | Letter|Number| +——+——+ | A| 20| | B| 30| +——+——+
Question 47 of 60
47. Question
Which of the following DataFrame operation is classified as a narrow transformation ?
Which of the following statements about Spark accumulator variables is true?
Correct
You need to name the accumulator in order to see in it in the spark ui For accumulator restarted tasks will not update the value in case of a failure. In transformations, each tasks update can be applied more than once if tasks or job stages are re-executed. Accumulators provide a mutable variable that a Spark cluster can safely update on a per-row basis. You can define your own custom accumulator class by extending org.apache.spark.util.AccumulatorV2 in Java or Scala or pyspark.AccumulatorParam in Python.
Incorrect
You need to name the accumulator in order to see in it in the spark ui For accumulator restarted tasks will not update the value in case of a failure. In transformations, each tasks update can be applied more than once if tasks or job stages are re-executed. Accumulators provide a mutable variable that a Spark cluster can safely update on a per-row basis. You can define your own custom accumulator class by extending org.apache.spark.util.AccumulatorV2 in Java or Scala or pyspark.AccumulatorParam in Python.
Unattempted
You need to name the accumulator in order to see in it in the spark ui For accumulator restarted tasks will not update the value in case of a failure. In transformations, each tasks update can be applied more than once if tasks or job stages are re-executed. Accumulators provide a mutable variable that a Spark cluster can safely update on a per-row basis. You can define your own custom accumulator class by extending org.apache.spark.util.AccumulatorV2 in Java or Scala or pyspark.AccumulatorParam in Python.
Question 50 of 60
50. Question
When joining two dataframes, if there is a need to evaluate the keys in both of the DataFrames or tables and include all rows from the right DataFrame as well as any rows in the left DataFrame that have a match in the righ DataFrame also If there is no equivalent row in the left DataFrame, we want to instert null: which join type should we select ? df1.join(person, joinExpression, joinType)
Correct
Correct answer is joinType = “right_outer“. For example df1.join(person, joinExpression, right_outer).show()
Incorrect
Correct answer is joinType = “right_outer“. For example df1.join(person, joinExpression, right_outer).show()
Unattempted
Correct answer is joinType = “right_outer“. For example df1.join(person, joinExpression, right_outer).show()
Question 51 of 60
51. Question
Which of the following statement is NOT true for broadcast variables ?
Correct
Broadcast variables are a way you can share an immutable value efficiently around the cluster without encapsulating that variable in a function closure. The normal way to use a variable in your driver node inside your tasks is to simply reference it in your function closures (e.g., in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure, it must be deserialized on the worker nodes many times (one per task)
Incorrect
Broadcast variables are a way you can share an immutable value efficiently around the cluster without encapsulating that variable in a function closure. The normal way to use a variable in your driver node inside your tasks is to simply reference it in your function closures (e.g., in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure, it must be deserialized on the worker nodes many times (one per task)
Unattempted
Broadcast variables are a way you can share an immutable value efficiently around the cluster without encapsulating that variable in a function closure. The normal way to use a variable in your driver node inside your tasks is to simply reference it in your function closures (e.g., in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure, it must be deserialized on the worker nodes many times (one per task)
Question 52 of 60
52. Question
Which of the following describes the relationship between cluster managers and worker nodes?
Given an instance of SparkSession named spark, and the following DataFrame named df: val simpleData = Seq( (“James“,“Sales“,“NY“,90000,34,10000), (“Michael“,“Sales“,“NY“,86000,56,20000), (“Robert“,“Sales“,“CA“,81000,30,23000), (“Maria“,“Finance“,“CA“,90000,24,23000), (“Raman“,“Finance“,“CA“,99000,40,24000), (“Scott“,“Finance“,“NY“,83000,36,19000), (“Jen“,“Finance“,“NY“,79000,53,15000), (“Jeff“,“Marketing“,“CA“,80000,25,18000), (“Kumar“,“Marketing“,“NY“,91000,50,21000) ) val df = spark.createDataFrame(simpleData).toDf(“employee_name“,“department“,“state“,“salary“,“age“,“bonus“) Choose the right code block which will produce the following result: +———-+———–+ |department|sum(salary) +———-+———–+ Sales |257000 Finance |351000 Marketing |171000 +———-+———–+
Correct
Incorrect
Unattempted
Question 54 of 60
54. Question
For the following dataframe if we want to fully cache the dataframe immediately, what code block should replace (x) ? val a = Array(1002, 3001, 4002, 2003, 2002, 3004, 1003, 4006) val df = spark.createDataset(a).withColumn(“x“, col(“value“) % 1000) df.cache() (x)
Correct
When you use cache() or persist(), the DataFrame is not fully cached until you invoke an action that goes through every record (e.g., count()). If you use an action like take(1), only one partition will be cached because Catalyst realizes that you do not need to compute all the partitions just to retrieve one record.
Incorrect
When you use cache() or persist(), the DataFrame is not fully cached until you invoke an action that goes through every record (e.g., count()). If you use an action like take(1), only one partition will be cached because Catalyst realizes that you do not need to compute all the partitions just to retrieve one record.
Unattempted
When you use cache() or persist(), the DataFrame is not fully cached until you invoke an action that goes through every record (e.g., count()). If you use an action like take(1), only one partition will be cached because Catalyst realizes that you do not need to compute all the partitions just to retrieve one record.
Question 55 of 60
55. Question
What is the first thing to try if garbage collection is a problem ?
Correct
JVM garbage collection can be a problem when you have large churn in terms of the RDDs stored by your program. When Java needs to evict old objects to make room for new ones, it will need to trace through all your Java objects and find the unused ones. The main point to remember here is that the cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objects (e.g. an array of Ints instead of a LinkedList) greatly lowers this cost. https://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
Incorrect
JVM garbage collection can be a problem when you have large churn in terms of the RDDs stored by your program. When Java needs to evict old objects to make room for new ones, it will need to trace through all your Java objects and find the unused ones. The main point to remember here is that the cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objects (e.g. an array of Ints instead of a LinkedList) greatly lowers this cost. https://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
Unattempted
JVM garbage collection can be a problem when you have large churn in terms of the RDDs stored by your program. When Java needs to evict old objects to make room for new ones, it will need to trace through all your Java objects and find the unused ones. The main point to remember here is that the cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objects (e.g. an array of Ints instead of a LinkedList) greatly lowers this cost. https://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
Question 56 of 60
56. Question
Lets suppose that we have a dataframe df with a column today which has a format YYYY-MM-DD. You want to add a new column to this dataframe week_later and you want its value to be one week after to column today. Select the correct code block.
Correct
Date_sub and date_add are some functions that exist in the following packages org.apache.spark.sql.functions.*
Incorrect
Date_sub and date_add are some functions that exist in the following packages org.apache.spark.sql.functions.*
Unattempted
Date_sub and date_add are some functions that exist in the following packages org.apache.spark.sql.functions.*
Question 57 of 60
57. Question
The code block down below intends to join df1 with df2 with inner join but it contains an error. Identify the error. d1.join(d2, d1.col(id) === df2.col(id), inner)
Correct
df1.join(df2, joinExpression, joinType)
Incorrect
df1.join(df2, joinExpression, joinType)
Unattempted
df1.join(df2, joinExpression, joinType)
Question 58 of 60
58. Question
What will cause a full shuffle knowing that dataframe df has 2 partitions ?
Correct
Coalse function avoids a full shuffle if it‘s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. And it cannot be used to increase the number of partitions.
Incorrect
Coalse function avoids a full shuffle if it‘s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. And it cannot be used to increase the number of partitions.
Unattempted
Coalse function avoids a full shuffle if it‘s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. And it cannot be used to increase the number of partitions.
Question 59 of 60
59. Question
Choose the correct code block to broadcast dfA and join it with dfB ?
Correct
There are other syntaxs also but to broadcast dfA but for this example you need to wrap it in keyword broadcast. Also the order of the join is important as you can see.
Incorrect
There are other syntaxs also but to broadcast dfA but for this example you need to wrap it in keyword broadcast. Also the order of the join is important as you can see.
Unattempted
There are other syntaxs also but to broadcast dfA but for this example you need to wrap it in keyword broadcast. Also the order of the join is important as you can see.
Question 60 of 60
60. Question
Which of the following describes a stage best ? User program built on Spark. Consists of a driver program and executors on the cluster.