Top 160 Spark Questions and Answers for Job Interview
1. Tell us something about Shark.
Answer: Shark is an amazing application to work with most data users know only SQL for database management and are not good at other programming languages. Shark is a tool that has been developed specifically for such people. The tool helps such database users to access Scala MLib capabilities through the Hive like SQL interface. Shark is basically a tool that helps data users run Hive on Spark all the while offering compatibility with Hive metastore, queries, and data.
2. Mention some events where Spark has been found to perform better than Hadoop in processing.
Answer: There are a number of instances where Spark has been found to outperform Hadoop:
• Sensor Data Processing –The special feature of Apache Spark’s In-memory computing works best in such a condition, as data is required to be retrieved and has to be combined from different sources.
• For real-time querying of data usually Spark is preferred over Hadoop.
• Stream Processing – Apache Spark is the best solution for transaction processes such as processing logs and detecting frauds in live streams for alerts.
3. Do you know anything about Sparse Vector?
Answer: A sparse vector has two parallel arrays –
• One for indices
• One for values
These vectors are used for storing non-zero entries to save space.
4. Mention some points about RDD.
Answer: RDD is the acronym for Resilient Distributed Datasets. They are abstractions in Apache Spark that represent the process of data coming into the system in object format. All RDDs have been used by the users for in-memory computations on large clusters, usually in a fault tolerant manner. All such databases are read-only portioned, a collection of records and have been grouped under two categories–
• Immutable – No RDD can be altered.
• Resilient – If a situation occurs in which a node holding the partition fails, the other node automatically takes the data.
5. Do you know anything about transformations and actions with respect to Resilient Distributed Datasets?
Answer: Transformations are essentially functions that are always executed on demand in order to produce a new RDD. All such transformations are followed by user-defined actions. These transformations can include a map, filter, and reduceByKey.
Actions are the results of all kinds of RDD transformations and computations. After any action has been performed by the user, the data from RDD returns to the local machine. Reduce, collect, first, and take are some of the examples of Actions.
6. List the languages supported by Apache Spark for developing any big data applications.
Answer: The languages supported by Apache Spark for developing any big data applications are
7. Is there an option for a user to use Spark to access and investigate any external data stored in Cassandra databases?
Answer: Yes, it is possible to use Spark Cassandra Connector to analyze and access external data stored in Cassandra databases.
8. Can a user use Apache Spark on Apache Mesos?
Answer: Apache Spark can be executed on the hardware clusters managed by Apache Mesos. This is also one of the features which make the Apache Spark quite popular.
9. Mention something about all the different cluster managers available in Apache Spark.
Answer: The 3 different clusters managers supported in Apache Spark are:
• Standalone deployments – These are well suited for new deployments which can only run and are very easy to set up.
• Apache Mesos – This has rich resource scheduling capabilities and has been designed to be well suited to run Spark along with other applications. It is especially advantageous when numerous users run interactive shells, majorly because it scales down the CPU allocation between commands.
10) Can Spark be connected to Apache Mesos by a user?
Yes, a user can connect Spark to Apache Mesos. In order to connect Spark with Mesos, the user must follow the given steps-
• The spark driver program needs to be configured in order to connect to the Mesos. Any Spark binary package should be in an exclusive location accessible by Mesos.
• This is an alternative way to achieve the same. The user needs to install Apache Spark in the same location similar to the Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it has been installed.
11. Can the user minimize data transfers when working with Spark? If yes, how?
Answer: Yes, any user has been given the option to minimize the data transfers while working with Spark. Minimizing data transfers and escaping shuffling helps the user to write Spark programs that can be executed in a fast and reliable manner. The various ways in which data transfers can be reduced when working with Apache Spark are:
Using Broadcast Variable- Broadcast variables are designed to enhance the efficiency of joins between small and large RDDs.
The most common way to minimize data transfers are to avoid operations ByKey, repartition or any other similar operation which triggers shuffles.
Using Accumulators – Accumulators help the user to update the values of variables in parallel while executing the program simultaneously.
12. Does a user need broadcast variables while working with Apache Spark? If yes, why?
Answer: Broadcast variables are read-only variables and are present within the memory cache on every machine. When a user is working with Spark, he/she needs to use broadcast variables in order to eliminate the necessity to send copies of a variable for every task, so data can be processed faster. Broadcast variables also help to store a lookup table inside the memory in order to enhance the retrieval efficiency when compared to a RDD lookup ().
13. Can a user execute Spark and Mesos in accordance with Hadoop?
Answer: Yes, it is possible to run Spark and Mesos with Hadoop by launching each of the individual services as a separate service on the machine. The Apache Mesos acts as a unified schedule that assigns tasks to either Spark or Hadoop.
14. Do you know anything about the lineage graph?
Answer: All the RDDs available in Spark solely depend on more than one RDD. The representation of all such dependencies between RDDs is termed as Lineage graph. The information provided by a Lineage graph is used to compute each RDD on demand in order to make sure that whenever a part of a persistent RDD is lost, the data lost can be recovered without a fuss using the lineage graph information.
15. Can a user trigger automatic clean-ups in Spark in order to handle accumulated metadata?
Answer: Yes, the user can trigger automatic clean-ups by setting the parameter ‘spark.cleaner.ttl’. Alternatively, the user can achieve the same by dividing the long-running jobs into different batches and writing all intermediary results to the disk.
16. What do you know about the major libraries that constitute the Spark Ecosystem?
Answer: The following are the major libraries that make up a bulk of the Spark Ecosystem:
Spark Streaming – This library is generally used to process real-time streaming data.
Spark MLib- This is the Machine learning library in Spark and is commonly used for learning algorithms like clustering, regression, classification, etc.
Spark SQL – This library helps to execute SQL like queries on Spark data using standard visualization or BI tools.
Spark GraphX – This is the Spark API for graph parallel computations along with basic operators like joinVertices, subgraph, aggregateMessages, etc.
17. Mention the benefits of using Spark in accordance with Apache Mesos.
Answer: Spark when used with Apache Mesos renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and any other big data framework.
18. Mention the importance of Sliding Window operation.
Answer: The function called Sliding Window controls transmission of data packets between the various computer networks. Spark Streaming library provides a number of windowed computations where the transformations on RDDs are explicitly applied over a sliding window of data. Whenever the window slides, all the RDDs that fall within the particular window are combined and operated upon in order to produce new RDDs of the windowed DStream.
19. What so you know about a DStream?
Answer: A Discretized Stream, generally known as a DStream, is a sequence of Resilient Distributed Databases (RDDs) that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
All minds of DStreams have two operations –
• Output operations that write data to an external system
• Transformations that produce a new DStream
20. While running Spark applications, is it necessary for the user to install Spark on all the nodes of a YARN cluster?
Answer: One of the most striking features of Spark is that it does not need not be installed when running a job under YARN or Mesos. This is because Spark can execute on top of YARN or Mesos clusters without casuing any change to the cluster.
21. Tell us something about the Catalyst framework.
Answer: A Catalyst framework is a new optimization framework present in the Spark SQL. This special framework allows Spark to automatically transform SQL queries by adding some new optimizations in order to build a faster processing system.
22. Can you mention the companies that use Apache Spark in their respective production?
Answer: Some of the companies that make use of the Apache Spark in their production are
• Open Table
23. Do you know about the Spark library that allows reliable file sharing at memory speed across different cluster frameworks?
Answer: The Tachyon is the Spark Library that allows reliable file sharing at memory speed across different cluster frameworks.
24. Do you know anything about BlinkDB? Why is it used?
Answer: BlinkDB is a query engine that is used for executing interactive SQL queries on large volumes of data and condenses query results marked with meaningful error bars. It is an amazing tool and helps users to balance query accuracy along with response time.
25. Differentiate between Hadoop and Spark with respect to ease of use.
Answer: The Hadoop MapReduce is required by user for programming in Java which is difficult. The Pig and Hive have been developed to make programming with Java considerably easier. However, learning the syntax of Pig and Hive takes a lot of time. Spark has a number of interactive APIs for different languages like Java, Python or Scala and also includes Spark SQL. This makes it comparatively easier to use than Hadoop.
26. Mention the common mistakes that developers usually commit when running Spark applications.
Answer: The common mistakes that developers usually commit when running Spark applications:
• Hitting the web service several times by using multiple clusters.
• Run everything on the local node instead of distributing it.
27. Mention the advantages of a Parquet file.
Answer: Parquet file is a columnar format file that helps the user to–
• Consumes less space
• Limit I/O operations
• Fetches only required columns
28. Mention the various data sources available in SparkSQL.
Answer: The various data sources available in SparkSQL are:
• Parquet file
• Hive tables
• JSON Datasets
29. How can a user execute Spark using Hadoop?
Answer: Spark has been designed with its own cluster management computation and uses Hadoop for storage mainly.
30. Mention the features of Apache Spark that make it so popular.
Answer: The features of Apache Spark that make it so popular are:
• Apache Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc.
• Apache Spark has good performance gains, as it helps to run an application in the Hadoop cluster ten times faster on disk and 100 times faster within the memory.
• Apache Spark has built-in APIs in multiple languages like Java, Scala, Python and R.
31. Tell us something about the Pair RDD.
Answer: Special operations can be performed on RDDs in Spark using the available key/value pairs. Such RDDs are termed as Pair RDDs. Pair RDDs allow the innumerable users to access each key in parallel. They also have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements which have the same key.
32. Between the Hadoop MapReduce and Apache Spark, which must be used for a project?
Answer: Choosing an application or development software depends on the given project scenario. Spark uses memory instead of network and disk I/O. However, Spark uses a large amount of RAM and requires a dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies vigorously with the requirements of the project and budget of the organization.
33. Mention the different types of transformations on DStreams.
Answer: The different types of transformations on DStreams are:
• Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch.
o Examples –Transformations that depend on sliding windows.
• Stateless Transformations- Processing of the batch does not depend on the output of the previous batch.
o Examples – map (), reduceByKey (), filter ().
34. Explain about the popular use cases of Apache Spark.
Answer: Apache Spark is mainly used for:
• Iterative machine learning.
• Interactive data analytics and processing.
• Stream processing
• Sensor data processing
35. Can Apache Spark be used for Reinforcement learning?
Answer: No, a user cannot use Apache Spark for Reinforcement Learning. The Apache Spark works well for simple machine learning algorithms like clustering, regression, and classification.
36. What do you know about the Spark Core?
Answer: Spark Core is one of the features of Spark. It has all the basic functionalities of Spark, such as – interacting with storage systems, memory management, fault recovery, scheduling tasks, etc.
37. Can the user remove the elements with a key present in another RDD?
Answer: The user can remove the elements with a key present in another RDD by using the subtractByKey () function.
38. Differentiate between persist() and cache() methods.
Answer: The persist() method allows the user to specify the storage level whereas the method cache () uses the default storage level.
39. Tell us about the various levels of persistence in Apache Spark.
Answer: The Apache Spark automatically continues the intermediary data from a number of shuffle operations. However, it is often recommended for the users call the persist() method on the RDD so that it can be reused. The Spark has various persistence levels to store a number of RDDs on disk or within the memory or as a combination of both the disk and the memory with different replication levels.
The various storage/persistence levels in Spark are –
• MEMORY_AND_DISK_SER, DISK_ONLY
40. How has the Spark been designed to handle monitoring and logging in while in the Standalone mode?
Answer: Spark has been provided with a web-based user interface for keeping a check on the cluster in standalone mode that shows the cluster as well as job statistics. The user log output for each job is written to the working directory of the slave nodes.
41. Can the Apache Spark provide checkpointing to the user?
Answer: Lineage graphs are have been provided within the Apache Spark to recover RDDs from a failure. However, this is time-consuming if the RDDs have long lineage chains. Spark has been provided with an API for checkpointing i.e. a REPLICATE flag to persist. The decision on which data to checkpoint is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
42. How can the user launch Spark jobs within Hadoop MapReduce?
Answer: By using the SIMR (Spark in MapReduce) users is able to execute any Spark job inside MapReduce without using any admin rights.
43. How is Spark able to use Akka?
Answer: Spark has been designed to use Akka for scheduling. All the users usually request for a task to master after registering themselves. The master just assigns the task. At this particular instance, Spark uses Akka for messaging between the workers and masters.
44. How is the user able to achieve high availability in Apache Spark?
Answer: The user is able to achieve high availability in Apache Spark by applying the given methods:
• By implementing a single node recovery in accordance with the local file system.
• By using the StandBy Masters with the Apache ZooKeeper.
45. How does Apache Spark achieve fault tolerance?
Answer: The data storage model in Apache Spark is based on RDDs. The RDDs help achieve fault tolerance through lineage graphs. The RDD has been designed to always store information on how to build from other datasets. If any partition of an RDD is lost due to failure, the lineage helps to build only that particular lost partition.
46. Explain the core components of a distributed Spark application
Answer: The core components of any distributed Spark application are as follows:
• Executor – It consists of the worker processes that run the individual tasks of a Spark job.
• Driver- This consists of the process that runs the main() method of the program to create RDDs and perform transformations and actions on them.
• Cluster Manager- This is a pluggable component in Spark which is used to launch Executors and Drivers. The cluster manager allows the Spark to run with external managers like Apache Mesos or YARN in the background.
47. Do you know anything about Lazy Evaluation?
Answer: When Spark is instructed to operate on a given dataset, it takes care of the instructions and makes a note of it, so that it does not forget. However Spark does nothing about the instructions unless the user asks for the final result.
When a transformation like the method map() is called on an RDD, Spark does not perform the operation immediately. All transformations in Spark are not evaluated until the user has to perform an action. This helps to optimize the overall data processing workflow.
48. What do you know about a worker node?
Answer: A worker node is a node that can run the Spark application code in a cluster. A worker node can have more than one process which can be easily configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker node is initiated if the SPARK_ WORKER_INSTANCES property is not defined.
49. Tell us something about SchemaRDD.
Answer: An RDD that comprises row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column is known as a SchemaRDD.
50. Mention the disadvantages of using Apache Spark over Hadoop MapReduce.
Answer: The Apache Spark does not perform very well for compute-intensive jobs and consumes a large number of system resources. Apache Spark’s in-memory capability causes a major barrier for cost-efficient processing of big data. Spark has its own file management system and needs to be integrated with other cloud-based data platforms and in case, Apache Hadoop.
51. Does the user need to install Spark on all the nodes of a YARN cluster while running Apache Spark on YARN?
Answer: No, it is not necessary for the user to install Spark on all the nodes of a YARN cluster while running Apache Spark on YARN because Apache Spark runs on top of YARN.
52. Do you know anything about the Executor Memory in a Spark application?
Answer: Every Spark application has the same fixed heap size and a fixed number of cores for a Spark executor. The heap size is known as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every Spark application has been designed to have one executor on each worker node. The executor memory is a measure of the size of memory of the worker node that the application utilizes.
53. What has the Spark Engine been designed to accomplish?
Answer: The Spark engine has been designed to accomplish a number of tasks such as creating schedules, distributing and monitoring all data applications across the Spark cluster.
54. Apache Spark is good at low-latency workloads like graph processing and machine learning. Elaborate on the reasons behind this.
Answer: The Apache Spark stores data in-memory for faster model building and training. All Machine learning algorithms require multiple iterations to produce an optimal model result. Similarly, graph algorithms navigate through all the nodes and edges. All these low latency workloads that need multiple iterations can lead to increased performance. Less disk access and controlled network traffic change the entire equation when there is a lot of data to be processed.
55. Does the user have to start Hadoop to run any Apache Spark Application?
Answer: No, starting Hadoop is not mandatory for the user to run any Spark application. As there is no separate storage in Apache Spark, it uses the Hadoop HDFS. The data can be stored in a local file system and can be conveniently loaded from the local file system and processed.
56. Mention the default level of parallelism in Apache Spark.
Answer: If the user does not explicitly state the level of parallelism, then the number of partitions are considered as default level of parallelism in Apache Spark.
57. Do you know anything about the common workflow of a Spark program?
Answer: Yes, the following is the common workflow of the Spark Program:
• The first step in a Spark program is the creation of the input RDD’s from external data.
• The various RDD transformations like filter() are next used to create new transformed RDD’s based on the business logic.
• The persist() method is used for any intermediate RDD’s which might have to be reused in the future.
• Finally, the various RDD actions like first(), count() are launched to begin parallel computation. Later these are optimized and executed by Spark.
58. In a Spark program, how can the user identify whether a given operation is a Transformation or Action?
Answer: One can identify the operation based on the return type –
• The operation is an Action if the return type is anything other than RDD.
• The operation is Transformation if the return type is the same as the RDD.
59. What is a common mistake any Apache Spark developer usually makes while working with Spark?
Answer: Some of the common mistakes that all Apache Spark developers make while working with Spark are:
• Maintaining the required size of shuffle blocks.
• Trying to manage directed acyclic graphs (DAG’s.)
60. Differentiate between Spark SQL and Hive.
Answer: The following are the differences between Spark SQL and Hive:
• Any Hive query can easily be executed in Spark SQL but vice-versa is not true.
• Spark SQL is faster than Hive.
• It is not compulsory to create a metastore in Spark SQL but it is compulsory to create a Hive metastore.
• Spark SQL is a library while Hive is a framework.
• Spark SQL automatically deduces the schema while in Hive, the schema needs to be explicitly declared.
61. Mention the sources from where Spark streaming component can process real-time data.
Answer: Usually the users apply Apache Flume, Apache Kafka, and Amazon Kinesis for Spark streaming component to process real-time data.
62. What are the companies that are currently using Spark Streaming?
Answer: Uber, Netflix, Pinterest are some of the companies that are currently making use of Spark Streaming.
63. What is the bottom layer of abstraction in the Spark Streaming API?
Answer: DStream is the bottom layer of abstraction in the Spark Streaming API.
64. Do you know anything about receivers in Spark Streaming?
Answer: Receivers are special entities in Spark Streaming that consume data from various data sources and move them accordingly to Apache Spark. Receivers are usually created by streaming contexts as long-running tasks on various executors and scheduled to operate in a Round-Robin manner with each receiver taking a single core.
65. How is the user supposed to calculate the number of executors required to do real-time processing using Apache Spark? What factors need to be considered for deciding on the number of nodes for real-time processing?
Answer: The number of nodes can be easily calculated by by benchmarking the hardware. While doing so, one must also consider multiple factors such as optimal throughput (network speed), memory usage, the execution frameworks being used (YARN, Standalone or Mesos) and considering the other jobs that are running within those execution frameworks along with Spark.
66. Differentiate between Spark Transform in DStream and Map.
Answer: The transform() function in Spark Streaming allows the concerned developers to use Apache Spark transformations on the underlying RDD’s for the stream.
The map() function in Hadoop is used for element-to-element transform and can be implemented using the transform() function. The map() method works on the elements of Dstream while the transform() method allows developers to work with RDD’s of the DStream. A map() method is an elementary transformation whereas the transform() method is an RDD transformation.
67. What is Apache Spark?
Answer: The Apache Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced execution engine supporting cyclic data flow and in-memory computing. Spark can run on Hadoop, can run standalone or in the cloud and is also capable of accessing diverse data sources including HDFS, HBase, Cassandra, and others.
68. Explain some of the key features of Spark.
Answer: Spark has become a favorite tool among developers due to the following features:
• Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
• Spark supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing
• Allows Integration with Hadoop and files included in HDFS.
• Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
69. Do you know anything about RDD?
Answer: RDD is the acronym for Resilient Distribution Datasets. It is a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD:
• Parallelized Collections: The existing RDD’s running parallel with one another.
• Hadoop datasets: They perform a function on each file record in HDFS or another storage system.
70. Define Partitions.
Answer: A partition is a smaller and logical division of data similar to ‘split’ in the MapReduce method of programming. Partitioning is the process used to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.
71. What are the operations supported by RDD?
Answer: A RDD only supports the following two functions:
72. What do you understand by Transformations in Spark?
Answer: Transformations are functions applied on RDD, resulting in another RDD. A transformation specifically does not execute until an action occurs. The map() and filer() methods are examples of transformation. The map() method former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements to form current RDD that pass function argument.
73. Elaborate on the concept of Actions.
Answer: An action in Spark helps in restoring the data from RDD to the local machine. The execution of any Action is the result of all previously created transformations. The reduce() method is an action that implements the function passed again and again until one value is finally left. The take() Action takes all the values from RDD to the local node.
74. Mention the functions of SparkCore.
Answer: The SparkCore acts as the base engine and performs a number of functions such as:
• Memory management
• Job scheduling
• Monitoring jobs
• Interaction with storage systems
75. What do you know about RDD Lineage?
Answer: Spark does not support data replication in the memory. If any data is lost, it is automatically rebuilt using RDD lineage. The RDD lineage is a process that reconstructs lost data partitions and is always remembers how to build from other datasets.
76. What do you know about Spark Driver?
Answer: Spark Driver is the program that runs on the master node of the machine and is used to declare transformations and Actions on data RDDs. The driver in Spark creates SparkContext, connected to a given Spark Master. The driver also delivers the RDD graphs to the Spark Master, where the standalone cluster manager runs.
77. What do you know about Hive on Spark?
Answer: Hive contains significant support for Apache Spark but Hive execution is configured to Spark through the given piece of code:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
78. Name some of the commonly-used Spark Ecosystems.
Answer: These are some of the common commonly-used Spark Ecosystems:
• Spark SQL (Shark)- for developers.
• SparkR to promote R Programming in Spark engine.
• GraphX for generating and computing graphs.
• MLlib (Machine Learning Algorithms).
• Spark Streaming for processing live data streams.
79. Do you know anything about Spark Streaming?
Answer: Spark supports stream processing. It is an extension to the Spark API and allows stream processing of live data streams. Data is rocured from different sources like Flume and HDFS. It is streamed and finally processed to file systems, live dashboards, and databases. This is similar to batch processing as the input data is divided into streams like batches.
80. Tell us something about GraphX.
Answer: Spark uses the tool, GraphX for graph processing and to build and transform interactive graphs. The GraphX component enables programmers to study structured data at scale.
81. Why is the MLlib required?
Answer: The MLlib is a accessible machine learning library provided within the Spark. It makes machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
82. What do you know about Spark SQL?
Answer: SQL Spark or Shark is a novel module introduced in Spark to work with structured data and execute structured data processing. Spark executes relational SQL queries on the data. The core of the Shark supports an altogether different RDD called the SchemaRDD. The SchemaRDD is composed of rows objects and schema objects defining the data type of each column in the row and is similar to a table in a relational database.
83. Tell us something about a Parquet file.
Answer: A Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both the read and write operations with the Parquet file.
84. What are the file systems supported by Spark?
Answer: The following are the file systems supported by Spark:
• Hadoop Distributed File System (HDFS).
• Local File system.
85. Elaborate on the Yarn.
Answer: The Yarn is one of the key features in Spark and is very similar to Hadoop. It provides a central and resource management platform to deliver accessible operations across the cluster. When the user runs Spark on Yarn, he/she has to necessitate a binary distribution of Spark as it is built on Yarn support.
86. List the functions of Spark SQL.
Answer: The Spark SQL is capable of accomplishing the following functions:
• Loading data from a variety of structured sources.
• Providing integration between SQL and regular Python/Java/Scala code, along with the ability to join RDDs and SQL tables, and expose custom functions in SQL.
• Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
87. Mention the benefits of using Spark as compared to MapReduce.
Answer: Spark has a number of advantages as compared to MapReduce:
• Spark implements the processing around 10-100x faster than Hadoop MapReduce due to the availability of in-memory processing. MapReduce makes use of persistent storage for any of the data processing tasks.
• Spark provides in-built libraries to perform multiple tasks form the same core as batch processing, Steaming, Machine learning, Interactive SQL queries. Hadoop MapReduce however, only supports batch processing.
• Hadoop MapReduce is highly disk-dependent while Spark promotes caching and in-memory data storage.
• Spark is capable of performing iterative computation while there is no iterative computing implemented by Hadoop.
88. Is there any benefit of learning Hadoop MapReduce?
Answer: Yes a user must learn the Hadoop MapReduce. It is a paradigm used by many big data tools including Spark and is extremely relevant when the data grows bigger and bigger. Tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
89. What do you know about the Spark Executor?
Answer: When a user connects the SparkContext to a cluster manager, it acquires a Spark Executor on nodes in the cluster. The Executors are Spark processes that run computations and store the data on the worker node. The final tasks are transferred to the Spark Executors for their final execution.
90. Name the types of Cluster Managers present in Spark.
Answer: Spark supports three major types of Cluster Managers:
• Standalone: Manager to set up a cluster.
• Apache Mesos: It is the generalized/commonly-used cluster manager which also runs Hadoop MapReduce and other applications.
• Yarn: It is mainly responsible for resource management in Hadoop
91. Elaborate on the worker node.
Answer: The Worker node refers to any node that can run the application code in a cluster.
92. Do you know anything about the PageRank?
Answer: The PageRank is the measure of each vertex in the graph and is one of the striking features of the Graph in Spark.
93. Does the user need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?
Answer: No, there is no compulsion regarding this because Spark runs on top of Yarn.
94. Tell us about some of the demerits of using Spark.
Answer: Spark utilizes more storage space as compared to Hadoop and MapReduce and hence, can cause certain problems. Developers need to be careful while running their applications in Spark. All Spark developers must make sure that the work is equally distributed over multiple clusters.
95. How can a user create RDD?
Answer: Spark provides two methods to create RDD:
• By parallelizing a collection in your Driver program.
This makes use of SparkContext’s ‘parallelize’ methodical IntellipaatData = Array(2,4,6,8,10)
val distIntellipaatData = sc.parallelize(IntellipaatData)
• By loading an external dataset from external storage like HDFS, shared file system.
96. Tell us about the key features of Apache Spark.
Answer: Apache Spark has the following key features:
• Hadoop Integration: Apache Spark provides compatibility with Hadoop. Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.
• Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is able to achieve this speed through controlled partitioning and deftly manages data using partitions that help to parallelize distributed data processing with minimal network traffic.
• Real-Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation. It is designed for massive scalability and supports several computational models.
• Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL.
• Machine Learning: Spark’s MLlib is the machine learning component which is used for big data processing. It eliminates the need to use multiple tools for processing and for machine learning.
• Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R and hence, Spark code can be written in any of these four languages. It provides a shell in Scala and Python.
• Lazy Evaluation: Apache Spark delays its evaluation until it is absolutely necessary and hence speeds up the whole process. For transformations, Spark adds them to a DAG of computation and when the driver requests some data, does this DAG actually gets executed.
97. What are the languages supported by Apache Spark? Which is the most popular language?
Answer: Apache Spark supports the following four languages:
Scala and Python have interactive shells for Spark. The Scala shell can be accessed through./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used language among them because Spark is written in Scala.
98. Explain the concept of Resilient Distributed Dataset (RDD).
Answer: RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of operational elements that run in parallel. There are two types of RDD:
• Parallelized Collections: The existing RDDs running parallel with one another.
• Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.
RDDs are parts of data that are stored in the memory distributed across many nodes. They are lazily evaluated in Spark which makes Spark operate at a faster speed.
99. How do we create RDDs in Spark?
Answer: Spark provides two methods to create RDD:
• By parallelizing a collection in your Driver program.The goal of this method is to make use of SparkContext’s ‘parallelize’ by applying the following piece of code:
method val DataArray = Array(2,4,6,8,10)
val DataRDD = sc.parallelize(DataArray)
• By loading an external dataset from external storage like HDFS, HBase, shared file system.
100. What do you know about Executor Memory in any Spark application?
Answer: Every Spark application has the same fixed heap size and a fixed number of cores for a Spark Executor. This heap size is what is referred to as the Spark executor memory and is controlled with the spark.executor.memory property of the –executor-memory flag. Every Spark application always has one Executor on each worker node. The executor memory is a measure on the amount of memory the worker node of any application will utilize.
101. Define Partitions in Apache Spark.
Answer: A partition is a smaller and logical division of data. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process.
Spark manages data using partitions that help to parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
102. What operations does the RDD support?
Answer: An RDD has distributed a collection of objects. Each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable data structures, i.e., they can only be read. The user can’t change original RDD, but can transform it into a different RDD with all the required changes.
RDDs support two types of operations: transformations and actions.
• Transformations: Transformations create a new RDD from an existing RDD like map() and reduceByKey(). Transformations are executed on demand.
• Actions: Actions return final results of RDD computations. Actions trigger execution using lineage graph to load data into original RDD, execute all intermediate transformations and return final results to Driver program or write result to the file system.
103. Do you know anything about Transformations in Spark?
Answer: Transformations are functions applied on RDD, resulting in another RDD. It does not execute until an action occurs. All Transformations are lazily evaluated.
Observe the given piece of code:
val rawData=sc.textFile(“path to/movies.txt”)
Here, the rawData RDD is transformed into moviesData RDD.
104. What do you know about Actions in Spark?
Answer: An Action in Spark helps to return the data from RDD to the local machine. The execution of an Action is the result of all previously created transformations. Actions generate execution using lineage graph to load data into original RDD, execute all intermediate transformations and return final results to Driver program or write the result to file system.
The reduce() method is an action that implements the function passed again and again until one value is left. The take() Action takes all the values from RDD to a local node.
Consider the given code:
Here, the moviesData RDD is saved into a text file called the MoviesData.txt.
105. Mention the functions provided within the SparkCore.
Answer: Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.
SparkCore performs various important functions like:
• Memory management
• Monitoring jobs
• Job scheduling
• Interaction with storage systems
There are some additional libraries as well which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. These libraries are responsible for:
• Memory management and fault recovery
• Scheduling, distributing and monitoring jobs on a cluster
• Interacting with storage systems
106. Elaborate on the concept of Pair RDD.
Answer: The Apache Spark defines PairRDD functions class as:
class PairRDDFunctions[K, V] extends Logging with HadoopMapReduceUtil with Serializable
A number of special operations can be performed on RDDs in Spark using the key/value pairs and such RDDs are called Pair RDDs. Pair RDDs are used to enable users to access each key in parallel. They have areduceByKey() method that is used to collect data, based on each key and a join() method that is used to combine different RDDs together, based on the elements having the same key.
107. What are the components of the Spark Ecosystem?
Answer: The following are the widely-known components of the Spark Ecosystem:
• Spark Core: This is the base engine for large-scale parallel and distributed data processing.
• MLlib: This part of the Spark Ecosystem is used to perform machine learning in Apache Spark.
• Spark SQL: This part integrates the relational processing with Spark’s functional programming API.
• GraphX: it deals with all the graphs and graph-parallel computation.
• Spark Streaming: This is used for processing real-time streaming data.
108. How can Streaming be implemented in Spark?
Answer: The Spark Streaming component of the Spark Ecosystem is used for processing real-time streaming data. It enables a high-throughput and fault-tolerant stream processing of live data streams.
The essential stream unit is DStream which is also a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data is attained from different sources like Flume and HDFS, is streamed and processed to file systems, live dashboards, and databases. The working of this part is similar to batch processing as the input data is divided into streams like batches.
109. Is there a special API required for implementing graphs in Spark?
Answer: The GraphX is the Spark API component used for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. The property graph is a directed multi-graph which can have multiple parallel edges. Every edge and vertex have user-defined properties. These parallel edges allow multiple relationships between the same vertices. At a high-level, the GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph. That is a directed multigraph with properties attached to each vertex and edge.
To support graph computation, the GraphX is designed to expose a set of fundamental operators and also includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
110. What do you know about the PageRank in GraphX?
Answer: The PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an certification of v’s importance by u. GraphX comes with static and dynamic implementations of PageRank as methods of the PageRank Object. The Static PageRank runs for a fixed number of iterations, while the dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). The GraphOps allows calling these algorithms directly as methods of the Graph.
111. How can machine learning be implemented in Spark?
Answer: The MLlib is the component of the Spark Ecosystem is an accessible machine learning library provided by Spark. It makes machine learning easy with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
112. How does Spark implement SQL?
Answer: Spark SQL is a newly introduced module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data via SQL or Hive Query Language (HQL). Spark SQL integratess relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible for the execution of SQL queries with code transformations which results in a very powerful tool.
The following are the four libraries of Spark SQL.
• Data Source API
• DataFrame API
• Interpreter & Optimizer
• SQL Service
113. What do you know about the Parquet file in Spark?
Answer: The Parquet is a columnar format file supported by many data processing systems. Spark SQL performs both read and write operations with the help of the Parquet file.
The Parquet is a columnar format, supported by many data processing systems. The advantages of having columnar storage are as follows:
• Columnar storage limits IO operations.
• It gives better-summarized data and follows type-specific encoding.
• Columnar storage consumes less space.
• It can fetch specific columns that you need to access.
114. How can Apache Spark be used along with Hadoop?
Answer: The Apache Spark is compatible with the Hadoop. Spark usually benefits from the best of Hadoop. Using Spark and Hadoop together, one can leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN. All Hadoop components can be used along with Spark in the following ways:
• HDFS: Spark can run on top of the HDFS to leverage the distributed replicated storage.
• Batch & Real-Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.
• YARN: Spark applications can also be run on YARN (Hadoop NextGen).
• MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
115. What do you know about the RDD Lineage?
Answer: Spark does not support data replication in the memory and hence, data needs to be rebuilt using RDD lineage. RDD lineage is a process that is used to reconstruct lost data partitions.
116. What do you know about Spark Driver?
Answer: The Spark Driver is a program that runs on the master node of the machine and declares transformations and actions on data RDDs. A driver in Spark creates SparkContext, which in turn is connected to a given Spark Master. The driver also delivers the RDD graphs to the SparkMaster, where the standalone cluster manager is supposed to execute.
117. What do you know about file systems supported by Spark?
Answer: The following file systems are supported by Spark:
• Hadoop Distributed File System (HDFS).
• Local File system.
• Amazon S3
118. Mention the functionalities of Spark SQL.
Answer: The Spark SQL is capable of:
• Loading data from a variety of structured sources.
• Querying data using SQL statements inside a Spark program and from external tools that connect to Spark SQL through standard database connectors such as JDBC and ODBC.
• Providing complete integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables and exposing custom functions in SQL.
119. Name the different types of Cluster Managers in Spark.
Answer: The Spark framework supports three types of Cluster Managers:
• Apache Mesos
120. How much do you know about the worker node?
Answer: Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes. The worker node is the slave node. The Master node assigns work and the worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the Master schedules tasks.
121. What are some of the demerits of Apache Spark?
Answer: The following are some of the demerits of using the Apache Spark:
• Developers need to be careful while running their applications in Spark because it consumes a huge amount of data when compared to Hadoop.
• Spark’s in-memory capability can cause an unnecessary bottleneck when it comes to cost-efficient processing of big
• Instead of running everything on a single node, the work must be distributed over multiple clusters.
122. Can Spark outperform Hadoop in processing at any time?
Answer: The following cases can be considered to understand how Spark outperforms Hadoop:
• Real-Time Processing: Spark is preferred over Hadoop for real-time querying of data for areas such as Stock Market Analysis, Banking, Healthcare, Telecommunications, etc.
• Stream Processing: For processing of logs and detection of frauds in live streams for alerts, Apache Spark is generally used.
• Sensor Data Processing: Apache Spark’s In-memory computing is used to retrieve data and combine all the data retrieved from different sources.
• Big Data Processing: Spark runs up to 100 times faster than Hadoop when it comes to processing medium and large-sized datasets.
123. Can the user use Spark to access and analyze data stored in any Cassandra databases?
Answer: Yes, it is possible for a user to do so. If the user makes use of the Spark Cassandra Connector to connect Spark to a Cassandra cluster, the given can be achieved. A Cassandra Connector needs to be added to the Spark project. In such a setup, a Spark executor will communicate with a local Cassandra node and will only query for local data. It makes queries faster by reducing the use of network to send data between Spark executors (to process data) and Cassandra nodes (where data lives).
124. Is it possible for the user to run Apache Spark on Apache Mesos?
Answer: Yes, Apache Spark can be run on the hardware clusters managed by Apache Mesos. When using Mesos, the Mesos master replaces the Spark master as the cluster manager. The Mesos determines which machine will handle which task as it takes into account other frameworks when scheduling these many short-lived tasks. Multiple frameworks can coexist on the same cluster without taking help of the system of static partitioning of resources.
125. How can Spark be connected to Apache Mesos?
Answer: A user can connect Spark to Apache Mesos. The following steps need to be followed to do so:
• Configure the Spark driver program to connect to Apache Mesos.
• The Spark binary package should be in the location accessible by Apache Mesos.
• Finally, the user needs to install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
126. How can the user minimize data transfers while working with Spark?
Answer: The user needs to minimize data transfers and avoid shuffling to write Spark programs that run in a fast and reliable manner. There are a number of ways to minimize data transfers in Apache Spark:
• Using Broadcast Variable- The broadcast variable enhances the efficiency of joins between small and large RDDs.
• Using Accumulators – Accumulators help to update the values of variables in parallel while executing.
The most common way to minimize data transfer is to avoid operations ByKey, repartition or any other operations which trigger any form of shuffles.
127. What do you know about broadcast variables?
Answer: Broadcast variables have been designed to enable the programmer to keep a read-only variable cached on each machine instead of shipping a copy of it with the concerned tasks. These variables can be used to give every node a copy of a large input dataset in an efficient manner. Spark distributes broadcast variables using some efficient broadcast algorithms to reduce the communication cost.
128. Explain the accumulators present in Apache Spark.
Answer: Accumulators are variables that are only added through an associative and commutative operation in Spark and are used to implement counters or sums. Tracking accumulators in the UI are usually for understanding the progress of running stages. Spark natively supports numeric accumulators.
129. Why does a user need broadcast variables when working with Apache Spark?
Answer: Broadcast variables are read-only variables, present in-memory cache on every machine. When a user is working with Spark the use of broadcast variables eliminates the compulsion of shipping copies of a variable for every task so that data can be processed faster. The broadcast variables help in storing a lookup table inside the memory which improves the retrieval efficiency when compared to an RDD lookup().
130. How can the user trigger an automatic clean-ups in Spark to handle accumulated metadata?
Answer: The user can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or simply by dividing the long-running jobs into a number of different batches and writing the intermediary results to the disk.
131. What is the significance of the Sliding Window operation in Apache Spark?
Answer: The Sliding Window in Apache Spark controls the transmission of data packets among various computer networks. The Spark Streaming library provides all possible windowed computations where the transformations on RDDs have been applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce a number of new RDDs of the windowed DStream.
132. What do you know about the DStream in Apache Spark?
Answer: Discretized Stream is the acronym for DStream and is the basic abstraction provided by Spark Streaming. It is a continuous stream of data and is received from a data source or from a processed data stream generated solely by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs and each RDD contains data from a particular interval. Any operation applied on the DStream translates to operations on the underlying RDDs. The DStreams can be created from a number of different sources like Apache Kafka, HDFS, and Apache Flume.
All DStreams basically have two operations:
• Transformations that produce a new DStream.
• Output operations that write data to an external system.
There are a number of DStream transformations possible in Spark Streaming. The filter () method returns a new DStream by selecting only the records of the source DStream on which the function returns true.
133. Can you explain Caching in Spark Streaming?
Answer: The DStreams allows the developers to cache or persist the stream’s data in memory. This proves to be quite useful if the data in the DStream is computed multiple times. The persist() method is used to achieve this on a DStream. For the input streams that receive data over the network, the default persistence level is set to replicate the data to two nodes for fault-tolerance.
134. When running Spark applications, is it necessary for the user to install Spark on all the nodes of YARN cluster?
Answer: No, any user of the Spark does not need to be installed when running a job under YARN or Apache Mesos because Spark can easily be executed on top of YARN or the Apache Mesos clusters without affecting the cluster.
135. What are the various data sources available within the Spark SQL?
Answer: The Parquet file, JSON datasets, and Hive tables are the data sources available within the Spark SQL.
136. What are the different levels of persistence present within the Apache Spark?
Answer: The Apache Spark automatically perseveres the intermediary data from various shuffle operations. However, it is not suggested for users to call persist () method on the RDD in order to reuse it. Spark has various persistence levels to store the RDDs on disk or within memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are:
• DISK_ONLY: It stores the RDD partitions only on disk.
• MEMORY_ONLY: This stores the RDD as deserialized Java objects in the JVM. If the RDD does not fit inside the memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
• OFF_HEAP: This is similar to the MEMORY_ONLY_SER but stores the data in an off-heap memory.
• MEMORY_ONLY_SER: It stores the RDD as serialized Java objects, i.e., a one-byte array per partition.
• MEMORY_AND_DISK_SER: This is similar to the MEMORY_ONLY_SER, but there are spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.
• MEMORY_AND_DISK: This stores the RDD as deserialized Java objects in the JVM. If the RDD does not fit within the memory, it stores the partitions that don’t fit on a disk, and read them from there when they’re needed.
137. Does Apache Spark need to be provided with checkpoints?
Answer: Checkpoints help to execute Apache Spark throughout the day all along avoiding all types of failures unrelated to the application logic. Lineage graphs help in recovering RDDs from a failure but this process is time-consuming if the RDDs have long lineage chains.
Spark has an API for check pointing wherever a REPLICATE flag has been used to persist. They also prove to be useful when the lineage graphs are long and have wide dependencies.
138. How does Spark make use of Akka?
Answer: Spark uses Akka for scheduling. All workers request for a task from Master after registering. The Master just assigns the task. Akka is also used for messaging between the workers and masters.
139. What do you understand by SchemaRDD in Apache Spark RDD?
Answer: The SchemaRDD is an RDD that consists of row objects, i.e., wrappers around the basic string or integer arrays, with schema information about the type of data in each column.
The SchemaRDD also provides direct relational query interface functions that are realized through the SparkSQL.
It has been officially renamed to DataFrame API on the Spark’s latest trunk.
140. How is Spark SQL different from HQL and SQL?
Answer: SQL is a special component on the Spark Core engine that supports both SQL and Hive Query Language without changing any in-built syntax. It is possible to join the SQL table and HQL table to Spark SQL.
141. Explain a scenario where the user can make use of Spark Streaming.
Answer: The user can stream data in real-time onto the Spark program with the help of Spark Streaming.
The Twitter Sentiment Analysis is a real-life application of Spark Streaming. Trending Topics can be used to create campaigns and attract a wider audience. It also helps in:
• Crisis management
• Service adjusting
• Target marketing
142. Tell us something about the Spark Data Frame.
Answer: The Spark Data Frame is a named Columns or Data set of Rows which closely resembles a table in the Relational table.
143. How does the user create the Spark Data Frame?
Answer: The Spark Data Frame can be created by the user by reading the data from multiple sources such as files, HDFS, and Hive.
144. What do you know about the Spark Data Set?
Answer: The Apache Spark Data Set is a collection of data that helps the user in performing RDD operations such as Transformations and Actions. It is a form of Typed RDD.
145. How does the user execute the Apache Spark Job?
Answer: The user can run the Spark Job using the Spark Submit inside Unix Shell Script and scheduling that UNIX shell script on Unix CRONTAB.
The user can achieve the same also with the help of Scala REPEL by launching it using the Spark-shell command.
146. How does the user debug the Apache Spark Job in Production?
Answer: The user can debug any Apache Spark Job in Production using any one of the three options:
• Using Spark UI / History URL
• Using Driver Machine Logs
• Using Executors Logs
One can also use Cluster URL with YARN to find the job which is run most recently by searching with the specific app name.
147. Which Language is generally used to develop Apache Spark Programs?
Answer: The following languages are generally used to develop Apache Spark Programs:
148. What do you know about the term ‘Immutable’?
Answer: The term ‘Immutable’ is used to indicate that once an entity has been created and assigned a value, no modification can be made to it.
Spark is immutable by default, i.e., no updates and modifications are allowed.
149. What do you mean by Lazy evaluated?
Answer: If the user executes more than one program at a time, it is not compulsory for the system to evaluate all of them immediately. In Transformations, Laziness behaves similar to a trigger.
150. What is the Catchable?
Answer: Catchable is the process of keeping all the data in-memory for computation instead of accessing the disk. This process aids Spark in getting a hold of the data 100 times faster than Hadoop.
151. What is the Spark engine responsibility?
Answer: Spark is majorly responsible for
• Scheduling of data
• Distribution of data
• Constant Monitoring of the application across the cluster
152. What are the common Spark Ecosystems?
Answer: The given are the common Spark Ecosystems:
• Spark SQL(Shark) for SQL developers.
• Spark Streaming for streaming data.
• MLLib for machine learning algorithms.
• GraphX for Graph computation.
• SparkR to execute R on Spark engine.
• BlinkDB enables interactive queries over enormous data.
• GraphX, SparkR, and BlinkDB are in the incubation stage.
153. What are Partitions?
Answer: Partition can be defined as a logical division of the data. Logical data is precisely derived to process the data entered. Small parts of data are able to support scalability and quicken the process. All types of data is Partitioned RDD such as Input data, intermediate data, and output data.
154. How does Spark partition the data?
Answer: Spark uses the concept of map-reduce API for partitioning of data. In the input format, a number of partitions can be created explicitly by the user.
The HDFS block size is similar to the partition size for best performance by default but it is possible to change partition size like the Split.
155. How does Spark store the data?
Answer: Spark is basically a processing engine without any storage engine. Data can be easily retrieved from any storage engine like HDFS, S3 or any other data resources.
156. Is it mandatory for the user to start Hadoop to run any Spark application?
Answer: No, it is not mandatory for the user to start Hadoop to run any Spark application.
However, there is no separate storage in Spark, so the local file system is used for Hadoop to store the data. Data can be loaded from the local system and process it. Therefore, Hadoop or HDFS is not obligatory to run Spark application.
157. Do you know anything about SparkContext?
Answer: When a programmer creates a number of RDDs, the SparkContext automatically connects the Spark cluster for creating a new SparkContext object. The SparkContext reports to the Spark about the probable ways of accessing the cluster. The SparkConf is a helpful factor for creating a number of programmer applications.
158. What are the various SparkCore functionalities that a user should know?
Answer: SparkCore is the base engine of the Apache Spark framework. The primary functionalities of Spark can be listed as below:
• Memory management
• Fault tolerance
• Monitoring jobs
• Interacting with store systems
159. Differentiate between SparkSQL and other query languages such as HQL and SQL.
Answer: The SparkSQL is a special component within the SparkCore engine that supports both SQL and HiveQueryLanguage (HQL) without causing any change within the in-built syntax.
160. What do you know about the File System API?
Answer: The File System API is used for reading data from a number of different storage devices such as the HDFS, S3 or local FileSystem. Spark is designed in such a way that it uses the File System API for reading data from different storage engines.
Apache Spark is a fast and general-purpose cluster computing system. Apache Spark has become most popular unified analytics framework in analytics space. With ever increasing focus on data analytics , machine learning and deep learning, Apache Spark is expected to be in huge demand in coming years. 160 frequently asked Apache Spark interview questions given above will help you to succeed in your next job interview.