CCA Spark and Hadoop Developer (CCA175) Exam Questions Total Questions: 118 – 2 Mock Exams
Practice Set 1
Time limit: 0
0 of 65 questions completed
Questions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Information
Click on Start Test
You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" CCA 175 Practice Test 1 "
0 of 65 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
CCA Spark and Hadoop Developer(CCA 175)
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Answered
Review
Question 1 of 65
1. Question
“Instructions: Convert snappy compressed avro data-files stored at hdfs location /user/cloudera/practice1/q1 into parquet file. Output Requirement: Result should be saved in /user/cloudera/practice1/q1/output/ Output should consist of only order_id,order_status Output file should be saved as Parquet file in gzip Compression. ================================================================= Important Information & Tips 1. Please look at the answer only once you solve the question. 2. Actual Exam is hands on exam and no answer is provided in exam.Below answer is just provided to guide you 3. To work with avro files on local cloudera vm, open spark-shell using pyspark –packages org.apache.spark:spark-avro_2.12:2.4.7 But exam environment is already setup with avro library, so in exam you just need to open spark-shell without –packages. 4. By default, parquet files are stored in snappy compression, “ spark. \ read. \ format(“avro“). \ load(“/user/cloudera/practice1/q1“). \ select(“order_id“,“order_status“). \ write. \ mode(“overwrite“). \ option(“compression“,“gzip“). \ parquet(“/user/cloudera/practice1/q1/output/“)
Correct
Incorrect
Unattempted
Question 2 of 65
2. Question
Instructions: Join the comma separated file located at below hdfs location to find out customers who have placed more than 4 orders. /user/cloudera/practice1/q2/orders /user/cloudera/practice1/q2/customers Input Schema Schema for customer File customer_id,customer_fname,……………………………………………… Schema for Order File order_id,order_date,order_customer_id,order_status Output Requirement: Order status should be COMPLETE Output should have customer_id,customer_fname,orders_count Save the results in json format. Result should be order by count of orders in ascending fashion. Result should be saved in /user/cloudera/practice1/q2/output ================================================================= Solution Using Dataframe from pyspark.sql.functions import col cusDS= spark. \ read. \ format(“csv“). \ load(“/user/cloudera/practice1/q2/customers“). \ select(col(“_c0“).alias(“customer_id“),col(“_c1“).alias(“customer_fname“)) spark. \ read. \ format(“csv“). \ load(“/user/cloudera/practice1/q2/orders“). \ select(col(“_c2“).alias(“customer_id“),col(“_c3“).alias(“status“)). \ filter(“status==‘COMPLETE‘“). \ groupBy(“customer_id“). \ count(). \ filter(“count > 4“). \ withColumnRenamed(“count“,“orders_count“). \ join(cusDS,“customer_id“). \ sort(“orders_count“). \ write. \ mode(“overwrite“). \ json(“/user/cloudera/practice1/q2/output“) Solution Using Spark SQL from pyspark.sql.functions import col cusDS= spark. \ read. \ format(“csv“). \ load(“/user/cloudera/practice1/q2/customers“). \ select(col(“_c0“).alias(“customer_id“),col(“_c1“).alias(“customer_fname“)) spark. \ read. \ format(“csv“). \ load(“/user/cloudera/practice1/q2/orders“). \ select(col(“_c2“).alias(“customer_id“),col(“_c3“).alias(“status“)). \ createOrReplaceTempView(“orders“) spark. \ sql(“select customer_id,count(*) as orders_count from orders group by customer_id having count(*)>4“). \ join(cusDS,“customer_id“). \ sort(“orders_count“). \ write. \ mode(“overwrite“). \ json(“/user/cloudera/practice1/q2/output“)
Correct
Incorrect
Unattempted
Question 3 of 65
3. Question
“Instructions: From provided parquet files located at below hdfs location, get maximum product_price in each product_category /user/cloudera/practice1/q3 Output Requirement: Final output should be saved in below hdfs location: /user/cloudera/practice1/q3/output Final output should have product_category_id and max_price separated by pipe delimiter Ouput should be saved in text format with Gzip compression Output should be ordered by maximum price descending. Output should be stored in a single file. “ “spark. \ read. \ parquet(““/user/cloudera/practice1/q3““). \ createOrReplaceTempView(““product““) “ “spark. \ sql(““select CONCAT(product_category_id,‘|‘,max(product_price)) from product group by product_category_id order by max(product_price) desc““). \ coalesce(1). \ write. \ mode(““overwrite““). \ option(““compression““,““gzip““). \ format(““text““). \ save(““/user/cloudera/practice1/q3/output““)
Correct
Incorrect
Unattempted
Question 4 of 65
4. Question
“Instructions: Provided customer tab delimited files at below HDFS location. Find all customers that lives ‘Caguas‘ city. Input folder is /user/cloudera/practice1/q4 Output Requirement: Result should be saved in /user/cloudera/practice1/q4/output Output file should be saved in avro format in deflate compression. Output Schema should be customer_id: Integer, customer_name:String, customer_city:String ================================================================= Important Information: You will not be provided with any answer choice in actual exam.Below answers are just provided to guide you. You can check avro file meta information using avro-tools to check for compression. Copy the part file on your local and run below command. avro-tools getmeta .avro from pyspark.sql.functions import col from pyspark.sql.types import * “ “spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice1/q4““). \ select(col(““_c0““).cast(““int““).alias(““customer_id““),col(““_c1““).alias(““customer_name““),col(““_c2““).alias(““customer_city““)). \ filter(““customer_city=‘Caguas‘““). \ write. \ mode(““overwrite““). \ option(““compression““,““deflate““).format(““avro““). \ save(““/user/cloudera/practice1/q4/output““)
Correct
Incorrect
Unattempted
Question 5 of 65
5. Question
“Instructions: Convert avro data-files stored at below hdfs location into tab delimited file /user/cloudera/practice1/q5 Output Requirement: Result should be saved in /user/cloudera/practice1/q5/output Output file should be saved as tab delimited file in bzip2 Compression. Output should consist of customer_id customer_fname(only first three letter) customer_lname Sample Output: 21 And Smith 111 Mar Jons from pyspark.sql.functions import substring “ “spark. \ read. \ format(““avro““). \ load(““/user/cloudera/practice1/q5““). \ withColumn(““customer_fname““, substring(col(““customer_fname““), 0, 3)). \ write. \ mode(““overwrite““). \ option(““sep““,““\t““). \ option(““compression““,““bzip2““). \ csv(““/user/cloudera/practice1/q5/output““)
Correct
Incorrect
Unattempted
Question 6 of 65
6. Question
“Instructions: Provided orders data in parquet format, Find out all PENDING_PAYMENT orders in March 2014. order_date format is in unix_timestamp Input file is stored at hdfs location : /user/cloudera/practice1/q7 Output Requirement: Output should be date and total pending order for that date. Order_date should be in yyyy-MM-dd format Output should be saved at hdfs location : /user/cloudera/practice1/q7/output Output should be json file format. Please Note In solution, we are dividing the order_date by 1000 because there is some issue in dataset. This might not be needed in final exam. from pyspark.sql.functions import from_unixtime,col,to_date “ “spark. \ read. \ format(““parquet““). \ load(““/user/cloudera/practice1/q7““). \ withColumn(““order_date““,to_date(from_unixtime(col(““order_date““)/1000))). \ filter(““order_date LIKE ‘2014-03%‘ and order_status=‘PENDING_PAYMENT‘““). \ groupBy(““order_date““).count(). \ withColumnRenamed(““count““,““pending_orders““). \ coalesce(4). \ write. \ mode(““overwrite““). \ format(““json““). \ save(““/user/spark/dataset/mock1/q7/output““)
Correct
Incorrect
Unattempted
Question 7 of 65
7. Question
“Instructions: Find out total number of orders placed by each customers in year 2013. Order status should be COMPLETE order_date format is in unix_timestamp Input customer & order files are stored as avro file at below hdfs location /user/cloudera/practice1/q8/orders /user/cloudera/practice1/q8/customers Output Requirement: Output should be stored in a hive table named ““customer_order““ Hive table should have three columns customer_fname,customer_lname and count. Hive tables should be partitioned by customer_state. Solution spark.conf.set(““hive.exec.dynamic.partition““, ““true““) spark.conf.set(““hive.exec.dynamic.partition.mode““, ““nonstrict““) from pyspark.sql.functions import from_unixtime,col “ “//Creating Customer DataSet cus = spark. \ read. \ format(““avro““). \ load(““/user/cloudera/practice1/q8/customers““) “ “spark. \ read. \ format(““avro““). \ load(““/user/cloudera/practice1/q8/orders““). \ withColumn(““order_date““,to_date(from_unixtime(col(““order_date““)/1000))). \ filter(““order_date LIKE ‘2013%‘ and order_status=‘COMPLETE‘““). \ groupBy(““order_customer_id““).count(). \ join(cus, col(““order_customer_id““) == col(““customer_id““)). \ select(““customer_fname““,““customer_lname““,““count““,““customer_state““). \ write. \ partitionBy(““customer_state““). \ format(““hive““). \ saveAsTable(““customer_order““)
Correct
Incorrect
Unattempted
Question 8 of 65
8. Question
“Instructions Get all customers who have placed order of amount more than 200. Input files are tab delimeted files placed at below HDFS location: /user/cloudera/practice2/q1/customers /user/cloudera/practice2/q1/orders /user/cloudera/practice2/q1/order_items Schema for Input Files //Customer Customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode //Orders Order_id,order_date,order_customer_id,order_status //Orders_Item Order_item_id,Order_item_order_id,order_item_product_id,Order_item_quantity,Order_item_subtotal,Order_item_product_price Output Requirements: Output should be placed in below HDFS Location /user/cloudera/practice2/q1/output Output file should be comma separated file Output should have customer_fname,customer_lname,customer_city,order_amount. Below header should be added to the output fname, lname,city,order_amount Solution from pyspark.sql.functions import col “ “custDF = spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice2/q1/customers““). \ select(col(““_c0““).alias(““custId““),col(““_c1““).alias(““fname““),col(““_c2““).alias(““lname““),col(““_c6““).alias(““city““)) “ “ordDF = spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice2/q1/orders““). \ select(col(““_c0““).alias(““ordId““),col(““_c2““).alias(““custId““)) “ “spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice2/q1/order_items““). \ select(col(““_c1““).alias(““ordId““),col(““_c4““).alias(““order_item_subtotal““).cast(““double““)). \ groupBy(col(““ordId““)). \ sum(““order_item_subtotal““). \ withColumnRenamed(““sum(order_item_subtotal)““,““order_amount““). \ join(ordDF,““ordId““).where(““order_amount> 200““). \ join(custDF ,““custId““). \ select(““fname““,““lname““,““city““,““order_amount““). \ “ “write. \ mode(““overwrite““). \ option(““header““,““true““). \ csv(““/user/cloudera/practice2/q1/output““)
Correct
Incorrect
Unattempted
Question 9 of 65
9. Question
“Instructions: Get Customers from metastore table named ““customers_hive““ whose customer_fname is like ““Rich““ and save the results in HDFS. Output Requirement: Output should be saved in hive table with table name ““customer_result““ Hive table should have two columns – customer_name & customer_address customer_name – customer_fname concatenated with customer_lname with tab customer_address – customer_street(only first 10 letters ) concatenated with customer_city & customer_state with colon customer_name should be in upper case customer_street should be in lower case & customer_state in upper case Sample Output Customer_name Customer_address DONNA SMITH clear nect:Caguas:PR MARY PEREJ golden orc:Moreno Valley:CA spark. \ sql(““select concat_ws(‘\t‘,upper(customer_fname),upper(customer_lname)) as customer_name,concat_ws(‘:‘,lower(substr(customer_street,0,10)),customer_city,upper(customer_state)) as customer_address from customers_hive where customer_fname like ‘%Rich%‘““). \ “ write. \ mode(“overwrite“). \ format(“hive“). \ saveAsTable(“customers_result“)
Correct
Incorrect
Unattempted
Question 10 of 65
10. Question
Instructions: Provided pipe delimited file, get total numbers customers in each state whose first name starts with ‘M‘ and save results in HDFS. Input folder /user/cloudera/practice2/q3 Output Requirement: Result should be saved in a hive table ““customer_m““ File format should be parquet file with gzip compression. Output should have state name followed by total number of customers in that state. Output Schema state: String customers_count: String from pyspark.sql.functions import col “ “spark. \ read. \ option(““sep““,““|““). \ csv(““/user/cloudera/practice2/q3““). \ select(col(““_c1““).alias(““fname““),col(““_c2““).alias(““state““)). \ where(““fname like ‘M%‘““). \ groupBy(““state““).count(). \ withColumnRenamed(““count““,““customers_count““). \ write. \ mode(““overwrite““). \ option(““compression““,““gzip““). \ option(““fileFormat““,““parquet““). \ format(““hive““). \ saveAsTable(““customer_m““)
Correct
Incorrect
Unattempted
Question 11 of 65
11. Question
Instructions: Provided a meta-store table named product_ranked_new consisting of product details ,find the most expensive product in each category. Output Requirement: Output should have product_category_id ,product_name,product_price,rank. Output should be saved in /user/cloudera/practice2/q4/output/ Output should be saved as pipe delimited text file Output should be saved in 4 files from pyspark.sql.functions import col,concat_ws “ “spark. \ sql(““select p.product_category_id,p.product_name,p.product_price,dense_rank() over (partition by p.product_category_id order by p.product_price desc) as product_price_rank from product_ranked_new p““). \ filter(““product_price_rank=1““). \ withColumn(““final_description““,concat_ws(““|““,col(““product_category_id““),““product_name““,col(““product_price““),col(““product_price_rank““))). \ select(““final_description““). \ coalesce(4). \ write. \ mode(““overwrite““). \ format(““text““). \ save(““/user/cloudera/practice2/q4/output““)
Correct
Incorrect
Unattempted
Question 12 of 65
12. Question
Instructions: Fetch all pending orders from parquet files stored at hdfs location /user/cloudera/practice2/q5 Output Requirement: Result should be saved in /user/cloudera/practice2/q5/output Output file should be saved in orc format Output file should be saved in lzo compression codec “ “spark. \ read. \ parquet(““/user/cloudera/practice2/q5““). \ filter(““order_status like ‘%PENDING%‘““). \ write. \ mode(““overwrite““). \ option(““compression““,““lzo““). \ format(““orc““). \ save(““/user/cloudera/practice2/q5/output““)
Correct
Incorrect
Unattempted
Question 13 of 65
13. Question
Instructions: provided tab delimited file at hdfs location /user/cloudera/practice2/q6 save only first 4 field in the result Output Requirement: Result should be saved in /user/cloudera/practice2/q6/output Output file should be saved in text format. Output should be saved as pipe delimited file in HDFS from pyspark.sql.functions import col,concat_ws “ “spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice2/q6““). \ withColumn(““final_description““,concat_ws(““|““,col(““_c0““),col(““_c1““),col(““_c2““),col(““_c3““))). \ select(““final_description““). \ write. \ format(““text““). \ save(““/user/cloudera/practice2/q6/output““)
Correct
Incorrect
Unattempted
Question 14 of 65
14. Question
Instructions: Find top 10 products which has made highest revenue. Products and order_items data are placed in HDFS directory /user/cloudera/practice2/q7/order_items /user/cloudera/practice2/q7/order_items Output Requirement: Output should have product_id and revenue seperated with ‘:‘ Output should be saved in /user/cloudera/practice2/q7/output Solution “ “prodDF= spark. \ read. \ csv(““/user/cloudera/practice2/q7/products/““). \ select(col(““_c0““).alias(““prodId““),col(““_c2““).alias(““name““)) “ “spark. \ read. \ csv(““/user/cloudera/practice2/q7/order_items/““). \ select(col(““_c2““).alias(““prodId““),col(““_c4““).alias(““order_total““).cast(““double““)). \ join(prodDF,““prodId““). \ createOrReplaceTempView(““joined““) “ “spark. \ sql(““select concat(prodId,‘:‘,sum(order_total)) from joined group by prodId order by sum(order_total) desc limit 10““). \ write. \ mode(““overwrite““). \ format(““text““). \ save(““/user/cloudera/practice2/q7/output““)
Correct
Incorrect
Unattempted
Question 15 of 65
15. Question
Instructions: Find all customers that lives ‘Brownsville‘ city and save the result into HDFS. Input folder is /user/cloudera/practice2/q8 Output Requirement: Output should be saved in /user/cloudera/practice2/q8/output Output file should be saved in Json format Sample Output {““custId““:““9733““,““name““:““Marie““,““city““:““Brownsville““} {““custId““:““9761““,““name““:““Helen““,““city““:““Brownsville““} from pyspark.sql.functions import col “ “spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice2/q8““). \ select(col(““_c0““).alias(““custId““),col(““_c1““).alias(““name““),col(““_c2““).alias(““city““)). \ filter(““city=‘Brownsville‘““). \ write. \ mode(““overwrite““). \ json(““/user/cloudera/practice2/q8/output““)
Correct
Incorrect
Unattempted
Question 16 of 65
16. Question
Instructions From the provided avro files at below HDFS location. Find out customers who have not placed any order in March 2013. /user/cloudera/practice3/q1/orders /user/cloudera/practice3/q1/customers Output Requirement: Output should be stored in json format at below HDFS location /user/cloudera/practice3/q1/output Output should have two fields customer_fname:customer_lname customer_city:customer_state Sample Output {““name““:““Melissa:Palmer““,““place““:““Caguas:PR““} {““name““:““Mary:Rush““,““place““:““Painesville:OH““} {““name““:““Thomas:Smith““,““place““:““Caguas:PR““} Solution from pyspark.sql.functions import from_unixtime,col,to_date,concat,lit “ “custDF = spark. \ read. \ format(““avro““). \ load(““/user/cloudera/practice3/q1/customers““) “ “ordDF = spark. \ read. \ format(““avro““). \ load(““/user/cloudera/practice3/q1/orders““). \ withColumn(““order_date““,to_date(from_unixtime(col(““order_date““)/1000))). \ filter(““order_date LIKE ‘2014-03%‘““) “ “custDF. \ join(ordDF , col(““customer_id““) == col(““order_customer_id““), ““left_anti““). \ select( \ concat(col(““customer_fname““), lit(““:““), col(““customer_lname““)).alias(““name““), \ concat(col(““customer_city““), lit(““:““), col(““customer_state““)).alias(““place““)). \ write. \ mode(““overwrite““). \ json(““/user/cloudera/practice3/q1/output““)
Correct
Incorrect
Unattempted
Question 17 of 65
17. Question
Instructions Get count of customers in each city who have placed order of amount more than 100 and whose order status is not PENDING. Input files are tab delimeted files placed at below HDFS location: /user/cloudera/practice3/q3/customers /user/cloudera/practice3/q3/orders /user/cloudera/practice3/q3/order_items Input Schema Schema for customers File Customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode Schema for Orders File Order_id,order_date,order_customer_id,order_status Schema for Order_Items File Order_item_id,Order_item_order_id,order_item_product_id,Order_item_quantity,Order_item_subtotal,Order_item_product_price Output Requirements: Output should be placed in below HDFS Location /user/cloudera/practice3/q3/output Output file should be tab separated file with deflate compression. Output should have customer city followed by customer count Sample Output Elmhurst 187 Asheboro 30 Douglasville 27 Solution “ “custDF = spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice3/q3/customers““). \ select(col(““_c0““).alias(““custId““),col(““_c1““).alias(““fname““),col(““_c2““).alias(““lname““),col(““_c6““).alias(““city““)) “ “ordDF = spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice3/q3/orders““). \ select(col(““_c0““).alias(““ordId““),col(““_c2““).alias(““custId““),col(““_c3““).alias(““status““)) “ “itemDF = spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice3/q3/order_items““). \ select(col(““_c1““).alias(““ordId““),col(““_c4““).alias(““price““).cast(““double““)) #Joining ordDF with itemDF to get all orders whose subtotal > 200 “ “ordJoinDF = ordDF. \ filter(““status not like ‘%PENDING%‘““). \ join(itemDF,““ordId““). \ where(““price > 100““) #Joining custDF and Writing output to HDFS “ “custDF. \ join(ordJoinDF,““custId““). \ groupBy(““city““).count(). \ write. \ mode(““overwrite““). \ option(““sep““,““\t““). \ option(““compression““,““deflate““). \ format(““csv““). \ save(““/user/cloudera/practice3/q3/output““)
Correct
Incorrect
Unattempted
Question 18 of 65
18. Question
Instructions: Input file is provided at below HDFS Location. Save the data to hdfs using no compression as sequence file. /user/cloudera/practice3/q4 Output Requirement: Result should be saved in at /user/cloudera/practice3/q4/output Output fields should be seperated by pipe delimiter. Key should be order_id, value should be all fields seperated by a pipe. Order_date should be stored in MM/dd/yyyy format from pyspark.sql.functions import from_unixtime,col,to_date,date_format “ “spark. \ read. \ parquet(““/user/cloudera/practice3/q4““). \ withColumn(““order_date““,date_format(to_date(from_unixtime(col(““order_date““)/1000)),““MM/dd/yyyy““)). \ rdd. \ map(lambda x : (str(x(0)),““|““.join(str(x)))). \ saveAsSequenceFile(““/user/cloudera/practice3/q4/output““)
Correct
Incorrect
Unattempted
Question 19 of 65
19. Question
Instructions: using product_ranked_new metastore table, Find the top 5 most expensive products within each category Output Requirement: Output should have product_category_id,product_name,product_price. Result should be saved in /user/cloudera/practice2/q5/output Output should be saved as pipe delimited file. “ “spark. \ sql(““select p.product_category_id,p.product_name,p.product_price,dense_rank() over (partition by p.product_category_id order by p.product_price desc) as product_price_rank from product_ranked_new p““). \ filter(““product_price_rank <=5““). \ drop(““product_price_rank““). \ withColumn(““finalCol““,concat_ws(““|““,col(““product_category_id““),col(““product_name““),col(““product_price““))). \ select(““finalCol““). \ write. \ mode(““overwrite““). \ format(““text““). \ save(““/user/cloudera/practice2/q5/output““)
Correct
Incorrect
Unattempted
Question 20 of 65
20. Question
Instructions: Input file is provided at below HDFS location /user/cloudera/practice3/q6 Append first three character of firstname with first two character of last name with a colon. Output Requirement: Output should be saved in xml file with rootTag as persons and rowTag as person. Output should be saved at below HDFS location /user/cloudera/practice3/q6/output Sample OutputJac:JaMar:Sm …….. ……..Important Information You can use spark databricks xml library to work with xml processing. Open the spark shell with –packages pyspark –packages com.databricks:spark-xml_2.12:0.10.0 In exam, you need to open pyspark without packages as all required libraries will be pre configured. from pyspark.sql.functions import substring,lit,col,concat “ “spark. \ read. \ csv(““/user/cloudera/practice3/q6““). \ select( \ substring(col(““_c1““), 0, 3).alias(““fname““), \ substring(col(““_c2““), 0, 2).alias(““lname““) \ ). \ select( \ concat(col(““fname““),lit(““:““),col(““lname““)).alias(““fullname““) \ ). \ “ “ write. \ mode(““overwrite““). \ format(““com.databricks.spark.xml““). \ option(““rootTag““, ““persons““). \ option(““rowTag““, ““person““). \ save(““/user/cloudera/practice3/q6/output““)
Correct
Incorrect
Unattempted
Question 21 of 65
21. Question
Instructions: Convert snappy compressed avro data-files stored at hdfs location /user/cloudera/practice4/q1 into parquet file. Output Requirement: Result should be saved in /user/cloudera/practice4/q1/output/ Output should consist of only order_id,order_status,order_date Output file should be saved as Parquet file in gzip Compression. order_date should be in yyyy/MM/dd format. Output schema |– ordId: integer (nullable = true) |– ordStatus: string (nullable = true) |– ordDate: string (nullable = true)
Correct
Incorrect
Unattempted
Question 22 of 65
22. Question
Instructions: provided tab delimited file at hdfs location /user/cloudera/practice4/q2 save only first 4 field in the result as pipe delimited file in HDFS Output Requirement: Result should be saved in /user/cloudera/practice4/q2/output Output file should be saved in orc format. Output should be compressed in snappy format Output Schema |– customer_id: integer (nullable = true) |– customer_fname: string (nullable = true) |– customer_lname: string (nullable = true) |– customer_password: string (nullable = true)
Correct
Incorrect
Unattempted
Question 23 of 65
23. Question
Instructions: Get Customers from metastore table named “customers_p4_q3“ whose fname is like “Rich“. Output Requirement: Output should be stored in hive table named “customers_p4_q3_output“ Output should be stored in Avro format Hive table has one columns – customer_description customer_description – customer_fname and customer_lname seperated by tab followed by customer_city seperated by colon. customer_fname & customer_lname should be in lower case & customer_city should be in upper case Sample Output richard plaza:FRANCISCO rich smith:CHICAGO
Correct
Incorrect
Unattempted
Question 24 of 65
24. Question
Instructions: Join the comma separated file located at hdfs location to find out customers who have placed more than 4 orders. /user/cloudera/practice4/q4/orders /user/cloudera/practice4/q4/customers Schema for customer File customer_id,customer_fname,customer_lname,customer_username,customer_password,customer_street.customer_city,customer_state,customer_zipcode Schema for Order File order_id,order_date,order_customer_id,order_status Output Requirement: Order status should be COMPLETE Output should have customer_fname customer_lname in upper case concatenated with a pipe Output should have customer_orders_count Save the results in json format. Result should be sorted by customer_orders_count in ascending fashion. Result should be saved in /user/cloudera/practice4/q4/output
Correct
Incorrect
Unattempted
Question 25 of 65
25. Question
Instructions: Find top 5 products which has made highest revenue. Products and order_items data are placed in HDFS directory /user/cloudera/practice4/q5/order_items /user/cloudera/practice4/q5/products Output Requirement: Output should have product_id and revenue seperated with ‘:‘ Output should be saved in /user/cloudera/practice4/q5/output Output should be saved in text format
Correct
Incorrect
Unattempted
Question 26 of 65
26. Question
Instructions: Find all customer that lives in ‘Caguas‘ city and Convert avro data-files stored at below hdfs location into tab delimited file /user/cloudera/practice4/q4 Schema for customer File customer_id,customer_fname,customer_lname,customer_username,customer_password,customer_street.customer_city,customer_state,customer_zipcode Output Requirement: Result should be saved in /user/cloudera/practice4/q6/output Output file should be saved as tab delimited format with a header Output should be saved in bzip2 Compression. Output should consist of customer_id customer_name(only first two letter) + customer_lname(only first two letter) Sample Output: id name 21 An+Sm 111 Ma+Jo
Correct
Incorrect
Unattempted
Question 27 of 65
27. Question
Instructions: Find out all PENDING_PAYMENT orders in March 2014. order_date format is in unix_timestamp Input file is parquet file stored at hdfs location /user/cloudera/practice4/q7 Output Requirement: Output should be date and total pending order for that date. Output should be saved at below hdfs location /user/cloudera/practice4/q7/output Output should be saved in orc format. Order_date should be in dd/MM/yyyy format
Correct
Incorrect
Unattempted
Question 28 of 65
28. Question
Instructions: Provided pipe delimited file, get total numbers customers in each state whose first name starts with ‘M‘ and save results in HDFS. Input folder /user/cloudera/practice4/q8 Output Requirement: Result should be saved in a hive table “customer_p4_q8“ Output should be saved in parquet format Output should have state name followed by total number of customers in that state. Output Schema customer_state: String customer_count: long
Correct
Incorrect
Unattempted
Question 29 of 65
29. Question
Instructions: Input file is provided at below HDFS Location. /user/cloudera/practice4/q9 Save the data to hdfs using no compression as sequence file. Output Requirement: Result should be saved in at /user/cloudera/practice4/q9/output Output fields should be seperated by pipe delimiter. Key should be order_id, value should be all fields seperated by a pipe.
Correct
Incorrect
Unattempted
Question 30 of 65
30. Question
Instructions: provided tab delimited file at hdfs location /user/cloudera/practice4/q2 save only first 4 field in the result as pipe delimited file in HDFS Output Requirement: Result should be saved in /user/cloudera/practice4/q2/output Output file should be saved in orc format. Output should be compressed in snappy format Output Schema |– customer_id: integer (nullable = true) |– customer_fname: string (nullable = true) |– customer_lname: string (nullable = true) |– customer_password: string (nullable = true) “ “spark. \ read. \ option(““sep““,““\t““). \ csv(““/user/cloudera/practice4/q2““). \ select(col(““_c0““).alias(““customer_id““),col(““_c1““).alias(““customer_fname““),col(““_c2““).alias(““customer_lname““),col(““_c3““).alias(““customer_password““)). \ write. \ mode(““overwrite““). \ option(““compression““,““snappy““). \ format(““orc““). \ save(““/user/cloudera/practice2/q6/output““)
Correct
Incorrect
Unattempted
Question 31 of 65
31. Question
Instructions: Get Customers from metastore table named ““customers_p4_q3““ whose fname is like ““Rich““. Output Requirement: Output should be stored in hive table named ““customers_p4_q3_output““ Output should be stored in Avro format Hive table has one columns – customer_description customer_description – customer_fname and customer_lname seperated by tab followed by customer_city seperated by colon. customer_fname & customer_lname should be in lower case & customer_city should be in upper case Sample Output richard plaza:FRANCISCO rich smith:CHICAGO “ “spark. \ sql(““select concat(lower(customer_fname),‘\t‘,lower(customer_lname),‘:‘,upper(customer_city)) as customer_description from customers_p4_q3 where customer_fname like ‘%Rich%‘““). \ write. \ mode(““overwrite““). \ option(““fileFormat““,““avro““). \ format(““hive““). \ saveAsTable(““customers_p4_q3_output““)
Correct
Incorrect
Unattempted
Question 32 of 65
32. Question
Instructions: Join the comma separated file located at hdfs location to find out customers who have placed more than 4 orders. /user/cloudera/practice4/q4/orders /user/cloudera/practice4/q4/customers Schema for customer File customer_id,customer_fname,……………………………………………… Schema for Order File order_id,order_date,order_customer_id,order_status Output Requirement: Order status should be COMPLETE Output should have customer_name and customer_orders_count customer_name should be combination of customer_fname customer_lname in upper case concatenated with a pipe Save the results in json format. Result should be sorted by customer_orders_count in ascending fashion. Result should be saved in /user/cloudera/practice4/q4/output Solution from pyspark.sql.functions import col,concat,upper,lit “ “cusDS= spark. \ read. \ format(““csv““). \ load(““/user/cloudera/practice4/q4/customers““). \ select(col(““_c0““).alias(““customer_id““),col(““_c1““).alias(““customer_fname““),col(““_c2““).alias(““customer_lname““)) “ “spark. \ read. \ format(““csv““). \ load(““/user/cloudera/practice4/q4/orders““). \ select(col(““_c2““).alias(““customer_id““),col(““_c3““).alias(““status““)). \ filter(““status==‘COMPLETE‘““).\ groupBy(““customer_id““). \ count().where(““count > 4““). \ withColumnRenamed(““count““,““customer_orders_count““). \ join(cusDS,““customer_id““). \ sort(““customer_orders_count““). \ select(concat( \ upper(col(““customer_fname““)), \ lit(““|““), \ upper(col(““customer_lname““))).alias(““customer_name““), \ col(““customer_orders_count““) \ ). \ “ “ write. \ mode(““overwrite““). \ json(““/user/cloudera/practice4/q4/output““)
Correct
Incorrect
Unattempted
Question 33 of 65
33. Question
Instructions: Find top 5 products which has made highest revenue. Products and order_items data are placed in HDFS directory /user/cloudera/practice4/q5/order_items /user/cloudera/practice4/q5/products Output Requirement: Output should have product_id and revenue seperated with ‘:‘ Output should be saved in /user/cloudera/practice4/q5/output Output should be saved in text format Solution “ “prodDF= spark. \ read. \ csv(““/user/cloudera/practice4/q5/products/““). \ select(col(““_c0““).alias(““prodId““),col(““_c2““).alias(““name““)) “ “spark. \ read. \ csv(““/user/cloudera/practice4/q5/order_items/““). \ select(col(““_c2““).alias(““prodId““),col(““_c4““).alias(““order_total““).cast(““double““)). \ join(prodDF,““prodId““). \ createOrReplaceTempView(““joined““) “ “spark. \ sql(““select concat(prodId,‘:‘,sum(order_total)) from joined group by prodId order by sum(order_total) desc limit 5““). \ write. \ mode(““overwrite““). \ format(““text““). \ save(““/user/cloudera/practice4/q5/output““)
Correct
Incorrect
Unattempted
Question 34 of 65
34. Question
Instructions: Find all customer that lives in ‘Caguas‘ city and Convert avro data-files stored at below hdfs location into tab delimited file /user/cloudera/practice4/q4 Schema for customer File customer_id,customer_fname,customer_lname,customer_username,customer_password,customer_street.customer_city,customer_state,customer_zipcode Output Requirement: Result should be saved in /user/cloudera/practice4/q6/output Output file should be saved as tab delimited format with a header Output should be saved in bzip2 Compression. Output should consist of customer_id customer_name(only first two letter) + customer_lname(only first two letter) Sample Output: id name 21 An+Sm 111 Ma+Jo from pyspark.sql.functions import col,substring,lit “ “spark. \ read. \ format(““avro““). \ load(““/user/cloudera/practice4/q6““). \ withColumn(““customer_fname““, substring(col(““customer_fname““), 0, 2)). \ withColumn(““customer_lname““, substring(col(““customer_lname““), 0, 2)). \ withColumn(““name““,concat(col(““customer_fname““),lit(““+““),col(““customer_lname““))). \ withColumnRenamed(““customer_id““,““id““). \ select(““id““,““name““). \ write. \ mode(““overwrite““). \ option(““header““,““true““). \ option(““sep““,““\t““). \ option(““compression““,““bzip2““). \ csv(““/user/cloudera/practice4/q6/output““)
Correct
Incorrect
Unattempted
Question 35 of 65
35. Question
Instructions: Find out all PENDING_PAYMENT orders in March 2014. order_date format is in unix_timestamp Input file is parquet file stored at hdfs location /user/cloudera/practice4/q7 Output Requirement: Output should be date and total pending order for that date. Output should be saved at below hdfs location /user/cloudera/practice4/q7/output Output should be saved in orc format. Order_date should be in dd/MM/yyyy format “ “spark. \ read. \ format(““parquet““). \ load(““/user/cloudera/practice4/q7““). \ withColumn(““order_date““,date_format(to_date(from_unixtime(col(““order_date““)/1000)),““dd/MM/yyyy““)). \ filter(““order_date LIKE ‘%03/2014%‘ and order_status=‘PENDING_PAYMENT‘““). \ groupBy(col(““order_date““)).count(). \ write. \ mode(““overwrite““). \ format(““orc““). \ save(““/user/cloudera/practice4/q7/output““)
Correct
Incorrect
Unattempted
Question 36 of 65
36. Question
Instructions: Provided pipe delimited file, get total numbers customers in each state whose first name starts with ‘M‘ and save results in HDFS. Input folder /user/cloudera/practice4/q8 Output Requirement: Result should be saved in a hive table ““customer_p4_q8““ Output should be saved in parquet format Output should have state name followed by total number of customers in that state. Output Schema customer_state: String customer_count: long “ “spark. \ read. \ option(““sep““,““|““). \ csv(““/user/cloudera/practice4/q8““). \ select(col(““_c1““).alias(““fname““),col(““_c2““).alias(““customer_state““)). \ where(““fname like ‘M%‘““). \ groupBy(““customer_state““).count(). \ withColumnRenamed(““count““,““customer_count““). \ write. \ parquet(““/user/cloudera/practice4/q8/output““)
Correct
Incorrect
Unattempted
Question 37 of 65
37. Question
Instructions: Input file is provided at below HDFS Location. /user/cloudera/practice4/q9 Save the data to hdfs using no compression as sequence file. Output Requirement: Result should be saved in at /user/cloudera/practice4/q9/output Output fields should be seperated by pipe delimiter. Key should be order_id, value should be all fields seperated by a pipe. “ “spark. \ read. \ parquet(““/user/cloudera/practice4/q9““). \ rdd. \ map(lambda x : (str(x(0)),““|““.join(str(x)))). \ saveAsSequenceFile(““/user/cloudera/practice4/q9/output““)
Correct
Incorrect
Unattempted
Question 38 of 65
38. Question
Instructions: Input file is provided at below HDFS Location. /user/cloudera/practice4/q9 Save the data to hdfs using no compression as sequence file. Output Requirement: Result should be saved in at /user/cloudera/practice4/q9/output Output fields should be seperated by pipe delimiter. Key should be order_id, value should be all fields seperated by a pipe. “ “spark. \ read. \ parquet(““/user/cloudera/practice4/q9““). \ rdd. \ map(lambda x : (str(x(0)),““|““.join(str(x)))). \ saveAsSequenceFile(““/user/cloudera/practice4/q9/output““)
Correct
Incorrect
Unattempted
Question 39 of 65
39. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import data from the orders table into HDFS. Data Description A MySQL instance is running on the ip-172-31-20-247 node with the orders table MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name: orders Username: admin Password: admin Output Requirements Place the file in the HDFS location /user/sqoop/prblm1/solution/ Use a text format Load every employee record End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 40 of 65
40. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import data from the employee table into HDFS. Not all of the MySQL data is required. Data Description A MySQL instance is running on the ip-172-31-20-247 node with the employee table MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name: employee Username: admin Password: admin Output Requirements Place the file in the HDFS location /user/sqoop/prblm2/solution/ Use a text format with a comma as the columnar delimiter Load every employee record Only load columns that contain employee id, first name, last name and birthday data End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 41 of 65
41. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import data from the departments table into HDFS. Not all of the MySQL data is required. Data Description A MySQL instance is running on the ip-172-31-20-247 node with the departments table MySQL database information Installation: On the cluster node gateway Database name: retail_db Table name: departments Username: admin Password: admin Output Requirements Place the file in the HDFS location /user/sqoop/prblm3/solution/ Output directory should only contain 2 files Only load records with department_id between 1 and 22 Only import columns that contain department_id and department_name End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
As mentioned in the question the output directory should have 2 files, so split-by department_id is used along with two mappers -m 2
Incorrect
As mentioned in the question the output directory should have 2 files, so split-by department_id is used along with two mappers -m 2
Unattempted
As mentioned in the question the output directory should have 2 files, so split-by department_id is used along with two mappers -m 2
Question 42 of 65
42. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import all the tables from retail_db into HDFS. Data Description A MySQL instance is running on the ip-172-31-20-247 node MySQL database information Installation: On the cluster node gateway Database name: retail_db Username: admin Password: admin Output Requirements Place the file in the Warehouse location prblm4 The files should use the Parquet file format with snappy compression End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
While importing all the tables warehouse-dir is used.
Incorrect
While importing all the tables warehouse-dir is used.
Unattempted
While importing all the tables warehouse-dir is used.
Question 43 of 65
43. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import data from the categories table into HDFS. Data Description A MySQL instance is running on the ip-172-31-20-247 node with the categories table MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name: categories Username: admin Password: admin Output Requirements Place the file in the HDFS location /user/sqoop/prblm5/solution/ Use the pipe ‘|‘ as columnar delimiter While importing if null value found replace it with -999(empty string) and -8989(empty non string) End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
While importing to replace null values for string and non string sqoop parameters are used. –null-string -999 –null-non-string -8989
Incorrect
While importing to replace null values for string and non string sqoop parameters are used. –null-string -999 –null-non-string -8989
Unattempted
While importing to replace null values for string and non string sqoop parameters are used. –null-string -999 –null-non-string -8989
Question 44 of 65
44. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import data from the customers table into metastore table customers_hive. Data Description A MySQL instance is running on the node ip-172-31-20-247 with the customers table MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name: customers Username: admin Password: admin Output Requirements Load the data in metastore table customers_hive in default database End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 45 of 65
45. Question
Instructions Connect to the MySQL database on the cluster using Sqoop and import data from the orders table into HDFS. Not all of the MySQL data is required. Data Description A MySQL instance is running on then node ip-172-31-20-247 with the orders table MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name: orders Username: admin Password: admin Output Requirements Place the file in the HDFS location /user/sqoop/prblm7/solution/ Fetch all the pending orders from the table The files should use the avro file format with Snappy compression End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Sqoop uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4 avro. Please add the following property after ‘import‘: -Dmapreduce.job.user.classpath.first=true for avrodatafile
Incorrect
Sqoop uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4 avro. Please add the following property after ‘import‘: -Dmapreduce.job.user.classpath.first=true for avrodatafile
Unattempted
Sqoop uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4 avro. Please add the following property after ‘import‘: -Dmapreduce.job.user.classpath.first=true for avrodatafile
Question 46 of 65
46. Question
Instructions Customer data need to be exported from HDFS to MySQL database table Data Description The data files are in HDFS directory /user/sqoop/prblm8/data/customer MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name : solution Username: admin Password: admin Output Requirements Export all of the customer data from HDFS into MYSQL solution The table already exist in the MYSQL database but currently has now rows End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided (This section will not be given in exam) *Info :You have to analyze field terminators before exporting. field terminator for this problem is tab
Correct
Incorrect
Unattempted
Question 47 of 65
47. Question
Instructions Department data need to be exported from HDFS to MySQL database table Data Description Create a csv file named departments.csv with following content and upload file at hdfs location /user/sqoop/prblm9/data/ 2 ,fitness 3 ,footwear 12,fathematics 13,fcience 14,engineering 1000,management MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name : departments Username: admin Password: admin Output Requirements Export all of the department data from HDFS into MYSQL solution. During upload make sure existing department will just updated and new departments needs to be inserted End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 48 of 65
48. Question
Instructions Download the 3 files from dropbox location https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 file1.txt file2.txt file3.txt Data Description The files are documents with text written in paragraphs There is no columnar data in them. Filtered Words “a“,“the“,“an“, “as“, “a“,“with“,“this“,“these“,“is“,“are“,“in“, “for“, “to“,“and“,“The“,“of“ Output Requirement Store the result in the hdfs location /result/location/hdfs/comp_result/ Save the results with gzip compression Load all these three files and do the word count but exclude the Filtered words The results should be saved by word count in reverse order End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 49 of 65
49. Question
Instructions You have been given below list in (name,sex,cost) Data Description List format List( (“Deepak“ , “male“, 4000), (“Deepak“ , “male“, 2000), (“Deepika“ , “female“, 2000),(“Deepak“ , “female“, 2000), (“Deepak“ , “male“, 1000) , (“Neeta“ , “female“, 2000)) Output Requirement Store the result in the hdfs location /result/location/hdfs/emplist/ Perform the sum of cost for combination of name and sex Sample result format Deepak, male, 7000 Neeta, female 2000 End Of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 50 of 65
50. Question
Instructions Download the sales.txt file from dropbox https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Sales data is stored in a comma delimited file Columns Type Department string Designation string costToCompany int State string Output Requirements Results should be saved in a csv file along with the header in location /result/location/hdfs/ Produce the output with group by Department,Designation,State with additional columns with sum(costToCompany) and TotalEmployeeCountt Sample Results Dept,Desg,state,empCount,totalCost Sales,Lead,AP,2,64000 Sales.Lead.LA.3.96000 Sales,Lead,TN,2,64000 End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 51 of 65
51. Question
Instructions Download the user.csv files from dropbox https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Records are stored in a comma delimited format Column Type Id String Topic String Hits Int Output Requirements Store the results in the HDFS Location /result/location/local/hdfs/usermap Results should be saved as text file Results should not contain the file header and the data should be stored in the format Map(id -> om, topic -> scala, hits -> 120) End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 52 of 65
52. Question
Instructions Download the 2 files from dropbox location https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 EmployeeName.csv EmployeeSalary.csv Data Description Records are stored in a comma delimited format Employee Name Column Type Id int name String Employee Salary Column Type Id int salary String Output Requirements Store the results in the HDFS Location /result/location/local/hdfs/employeesalary Save the data in multiple tile group by salary Means each file will have name of employees with same salary. Make sure file name include salary as well End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 53 of 65
53. Question
Instructions Download the 2 files from dropbox location https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 technology.txt salary.txt Data Description Records are stored in a comma delimited format Technology Column Type Fname String Lname String Technology string Salary Column Type Fname String Lname String Salary int Output Requirements Store the results in the HDFS Location /result/location/hdfs/techiesalary/ Results should be saved as text file The result should be in the format Fname,Lname,Technology,Salary End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 54 of 65
54. Question
Instructions The data is available in the MYSQL running on node ip-172-31-20-247 in products table retail_db database. Data Description MySQL database information Installation: On the cluster node ip-172-31-20-247 Database name: retail_db Table name : products Username: admin Password: admin Output Requirements Store the results in the HDFS Location/res/loca/hdfs/prod/ Sort the products data sorted by product price per category Loaded records should be in the format (product_category_id, (product_id,product_name,product_price)) End of Problem {Download product_39.txt file for problem statement if there are issues sqooping the data } Download the files from dropbox https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 55 of 65
55. Question
Instructions Download the sales.txt files from dropbox https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Sales data is stored in a comma delimited file Columns Type Department string Designation string costToCompany int State string Output Requirements Results should be saved in a csv file along with the header in location /result/location/hdfs/ Produce the output with group by Department,Designation,State with additional columns with sum(costToCompany) and TotalEmployeeCountt Sample Results Dept,Desg,state,empCount,totalCost Sales,Lead,AP,2,64000 Sales.Lead.LA.3.96000 Sales,Lead,TN,2,64000 End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 56 of 65
56. Question
Instructions Download the patients.csv from dropbox location https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Records are stored in a comma delimited format Patients Column Type patientID int name String dateOfBirth String lastVisitDate string Output Requirements List patients whose last visited more than 60 days ago Store the results in the HDFS Location /result/location/local/hdfs/patients3/ Store the results in json format Select patients 18 years old or younger Store the results in the HDFS Location /result/location/local/hdfs/patients4/ Store the results in csv format with pipe | as delimiter End of Problem Solution val patients = sc.textFile(“/file/location/local/hdfs/patients.csv“) case class PATIENTS(patientId:Integer, name:String, dateOfBirth:String, lastVisitDate:String) val patRDD = patients.map(x => x.split(“,“)).map(x => PATIENTS(x(0).toInt,x(1),x(2),x(3))).toDF() patRDD.createOrReplaceTempView(“patients“) val result3 = spark.sql(“““Select * from patients WHERE datediff(current_timestamp(),TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate,‘yyyy-MM-dd‘) as TIMESTAMP))) > 60“““) val result4 = spark.sql(“““Select * from patients WHERE datediff(current_timestamp(),TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth,‘yyyy-MM-dd‘) as TIMESTAMP)))/365 > 18“““) result3.write.json(“/result/location/local/hdfs/patients3/“) result4.write.format(“csv“).option(“delimiter“,“|“).save(“/result/location/local/hdfs/patients4/“)
Correct
Answer provided in the problem statement
Incorrect
Answer provided in the problem statement
Unattempted
Answer provided in the problem statement
Question 57 of 65
57. Question
Instructions Your task is to retrieve billing records that have a large charge associated with them and store those records in compressed Parquet files Data Description There are billing records stored in a metastore table called billing in the problem database Column Type id int charge float code string tstamp string Output Requirement Store the result data in the hdfs location /user/cert/problem/solut/ Only retrieve billing records that have a charge greater than $10.00 Save the results in Parquet format with gzip compression Parquet file schema should be the same as the input billing table Sample Results 5103830 19.41 X1 03/29/15 18:39:34 5456102 13.77 GA18 03/29/15 18:40:33 End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 58 of 65
58. Question
Instructions Download the customer.txt file available at dropbox location https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Customer needs data to be converted into a new file format for query performance improvement. Data Description Customer records are to be stored in HDFS location /user/spark/problem/data/customer/ Output Requirement Save the result data in HDFS location /user/spark/problem/solution The data needs to be stored in Snappy Compressed Parquet file The output should have listed columns Column Type id string fname string lname string street string city string state string zip string End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 59 of 65
59. Question
Instructions Calculate how many customers live in each city of the country Data Description Records are in HDFS location /user/spark/problem/data/cust/ The file contains many columns, including add information Column Type id string fname string lname string address string city string state string Zip code int Output Requirement Save the result data in HDFS location /user/spark/problem/solution/cust Use text format and column delimiter should be tab The result should contain single entry for each file Output the city, state and total number of customers in a city Sample Results St. Paul MN 48 Minneapolis MC 91 End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 60 of 65
60. Question
Instructions Download the 3 files from dropbox location product_71.csv supplier_71.csv product_supplier.csv https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Records are stored in a comma delimited format in each of the three file Product column format productID int productCode String name string quantity int price float supplierid int Supplier Name column format Supplierid int Name string Phone string Product Supplier column format productID int supplierID int Output Requirements Store the results in the HDFS Location /result/location/local/hdfs/prdsupplier Results should be saved as text file Results should contain product, price , supplier name where product price is less than 0.6 End of Problem Solution val products = sc.textFile(“/file/location/local/hdfs/product_71.csv“) val supplier = sc.textFile(“/file/location/local/hdfs/supplier_71.csv“) val prdsup = sc.textFile(“/file/location/local/hdfs/product_supplier.csv“) case class Product(productid:Integer, code:String, name:String, quantity:Integer, price:Float, supplierid:Integer) case class Supplier(supplierid:Integer, name:String, phone:String) case class PrdSup(productid:Integer, supplierid:Integer) val prdRDD = products.map(x => x.split(“,“)).map(p => Product(p(0).toInt, p(1), p(2), p(3).toInt, p(4).toFloat, p(5).toInt)).toDF() val supRDD = supplier.map(x => x.split(“,“)).map(p => Supplier(p(0).toInt, p(1), p(2))).toDF() val prdsupRDD = prdsup.map(x => x.split(“,“)).map(p => PrdSup(p(0).toInt, p(1).toInt)).toDF() prdRDD.createOrReplaceTempView (“products“) supRDD.createOrReplaceTempView (“supplier“) prdsupRDD.createOrReplaceTempView (“prdsupplier“) Val result = spark.sql(“““SELECT products.name, products.price, supplier.name as sup_name FROM products JOIN supplier ON products.supplierid= supplier.supplierid WHERE price < 0.6“““) result .rdd.map(x => (x(0),x(1),x(2))).saveAsTextFile(“/result/location/local/hdfs/prdsupplier/“)
Correct
Answer provided in the problem statement above
Incorrect
Answer provided in the problem statement above
Unattempted
Answer provided in the problem statement above
Question 61 of 65
61. Question
Instructions For Security purpose , your company needs to refer to employees on particular forms without using their full name. Data Description Employee records are in HDFS location /user/spark/problem/data/employee/ in a Parquet file format Output Requirement Save the result‘s in HDFS directory /user/spark/problem/solution/employee/ Snappy Compression should be stored for storing data Create a column calles ‘alias‘ by taking the first letter of the first name and appending the last name Write out the employee id, fname, lname and alias Sample Results 7500000 Matthew Holmes MHolmes 7499998 Brittany Hewitt Bhewitt End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 62 of 65
62. Question
Instructions Download the 3 files from dropbox location EmployeeManager.csv EmployeeName.csv EmployeeSalary.csv https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Records are stored in a comma delimited format in each of the three file Employee Manager column format (id, manager_name) Employee Name column format (id, name) Employee Salary column format (id, salary) Output Requirements Store the results in the HDFS Location /result/location/hdfs/employeemanager/ Results should be saved as text file Generate a joined output as ld,name,salary,managerName End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 63 of 65
63. Question
Instructions Loudacre Telecom is concerned about mobile phone overheating. Sensor data is recorded from each device on a regular basis. Find the AVGERAGE temperature for each phone model. Data Description Phone sensor data are stored in HDFS location /user/spark/problem/data/sensor/ Column Type Timestamp Int Customer ID Int Phone ID String Phone Model String Latitude Float Longitude Float Firmware Ver Int Bluetooth status Int GPS status Int Wifi Status Int Battery left Float Temperature Float Signal Strength Float Output Requirements Save the result data in the HDFS location /user/spark/problem/solution/sensor/ The results should be in comma-delimited text format Store the phone model and average temperature Sample Results SMGRZ 41.3556451744487 VVON7 44.2236067977499 End of Problem Actual Exam is hands on exam and there are no multiple choice answer provided
Correct
Incorrect
Unattempted
Question 64 of 65
64. Question
Instructions Create a report of bills owed by customer Data Description Customer data is stored in HDFS location /local/hdfs/customer/ in a tab delimited file Column id, fname, lname, address, city, state, zip Billing data is stored in the HDFS location /local/hdfs/billing/ in tab delimited file. Custid is the foreign key to the customer that owns the bill Column custid, amount, code, billdate Output Requirements Store the results in the HDFS Location /solut/hdfs/custbil/ Results should be saved as text file with tab delimiter The first column should be customer full name (first, space, last) The second column should be the amount of money of a single billing transaction. End of Problem Solution val f1 = sc.textFile(“ /local/hdfs/customer/“) val f2 = sc.textFile(“ /local/hdfs/billing/“)
Correct
Refer to the solution in the Problem statement. Due to lack of space answer can be fit in the options section
Incorrect
Refer to the solution in the Problem statement. Due to lack of space answer can be fit in the options section
Unattempted
Refer to the solution in the Problem statement. Due to lack of space answer can be fit in the options section
Question 65 of 65
65. Question
Instructions Download the patients file from dropbox location https://www.dropbox.com/sh/x5froxn3m0cn8nd/AABs9BEfvRGM8a6wEHtB7CUFa?dl=0 Data Description Records are stored in a comma delimited format Patients Column Type patientID int name String dateOfBirth String lastVisitDate string Output Requirements Find all the patients whose lastVisitDate between current time and ‘2012-09-15‘ Store the results in the HDFS Location /result/location/local/hdfs/patients/ Store the results in json format with gzip compression Find all the patients who born in 2011 Store the results in the HDFS Location /result/location/local/hdfs/patients1/ Store the results in text format Find all the patients age Store the results in the HDFS Location /result/location/local/hdfs/patients2/ Store the results in parquet format End of Problem Solution val patients = sc.textFile(“/file/location/local/hdfs/patients.csv“) case class PATIENTS(patientId:Integer, name:String, dateOfBirth:String, lastVisitDate:String) val patRDD = patients.map(x => x.split(“,“)).map(x => PATIENTS(x(0).toInt,x(1),x(2),x(3))).toDF() patRDD.createOrReplaceTempView(“patients“) val result = spark.sql(“““Select * from patients where TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate,‘yyyy-MM-dd‘) as TIMESTAMP)) BETWEEN ‘2012-09-15‘ and current_timestamp() order by lastVisitDate“““) val result1 = spark.sql(“““Select * from patients where TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth,‘yyyy-MM-dd‘) as TIMESTAMP)) BETWEEN ‘2011-01-01‘ and ‘2011-12-31‘ order by lastVisitDate“““) val result2 = spark.sql(“““Select name,dateOfBirth, datediff(current_timestamp(),TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth,‘yyyy-MM-dd‘) as TIMESTAMP)))/365 as age from patients“““) result.write.option(“compression“,“gzip“).format(“json“).save(“/result/location/local/hdfs/patients/“) result1.rdd.saveAsTextFile(“/result/location/local/hdfs/patients1/“) result2 .write.format(“parquet“).save(“/result/location/local/hdfs/patients2/“)
Correct
Solution Provided in the problem statement
Incorrect
Solution Provided in the problem statement
Unattempted
Solution Provided in the problem statement
X
Use Page numbers below to navigate to other practice tests