Sign up FAST! Login

Apache Spark is a platform for processing big data through streaming

Spark on Cloudera

Spark on Cloudera | 1 Bit Entropy

Apache Spark is a platform for processing big data through streaming.  Streaming can be much faster than disk-based processing offered by traditional Hadoop installations.  Here’s what Cloudera has to say about Spark.

Use cases: Apache Spark supports batch, streaming, and interactive analytics on all your data, enabling historical reporting, interactive analysis, data mining, real-time insights.

Support: Cloudera offers commercial support for Spark with Cloudera Enterprise.

Performance: Spark is 10-100x faster than MapReduce analysts for iterative algorithms that are often used by analysts and data scientists.  Performance benefits materialize both in memory and on disk.

Language support: Spark supports Java, Scala, and Python.  It is not necessary to write “map” and “reduce” operators.

Integration: Spark is integrated with CDH and can read any data in HDFS and deployed through Cloudera Manager.

Features: API for working with streams, exactly-once semantics, fault tolerance, common code for batch and streaming, joining streaming data to historical data.

Differences vs. Storm: Spark Streaming can recover lost work and deliver exactly-once semantics out of the box.

Fast Analytics and Stream ProcessingApache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics on all your data. Cloudera offers commercial support for Spark with Cloudera Enterprise.

Fast, Powerful Data Processingspark-time-per-iteration-125x221.png

For analysts and data scientists who rely on iterative algorithms (e.g. clustering/classification), Spark is 10-100x faster than MapReduce delivering faster time to insight on more data, resulting in better business decisions and user outcomes.

Spark is:

  • Fast: Data processing up to 100x faster than MapReduce, both in-memory and on disk
  • Powerful: Write sophisticated parallel applications quickly in Java, Scala, or Python without having to think in terms of only “map” and “reduce” operators
  • Integrated: Spark is deeply integrated with CDH, able to read any data in HDFS and deployed through Cloudera Manager

Easy, Real-Time Stream Processingspark-throughput-125x190.png

Spark Streaming extends Spark with an API for working with streams, providing exactly-once semantics and full fault tolerance for mission-critical environments. With common code across your batch and streaming applications, you can build sophisticated unified analytic applications quickly and easily.

Spark Streaming is:

  • Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming lets you rapidly develop streaming applications
  • Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark Streaming recovers lost work and delivers exactly-once semantics out of the box with no extra code or configuration
  • Integrated: Reuse the same code for batch and stream processing, even joining streaming data to historical data

Unified Analytics with Cloudera’s Enterprise Data HubOrganizations need to use more data and more types of data to increase their competitive edge and reduce costs. Their use of data typically spans multiple use cases: reporting on what has happened, deep interactive analysis and data mining to discover why things are happening, and increasingly sophisticated applications to deliver real-time insights to decision makers.

Faster Decisions (Interactive)Better Decisions (Batch)Real-Time Action (Streaming and Applications)Web SecurityWhy is my website slow?What are the common causes of performance issues?How can I detect and block malicious attacks in real-time?RetailWhat are our top selling items across channels?What products and services to customers buy together?How can I deliver relevant promotions to buyers at the point of sale?Financial ServicesWho opened multiple accounts in the past 6 months?What are the leading indicators of fraudulent activity?How can I protect my customers from identity theft in real-time?


With Cloudera’s enterprise data hub including Spark, you can implement powerful end-to-end analytic workflows, comprising batch data processing, interactive query, deep data mining, and real-time applications all from a single common platform. No need to maintain separate systems – with separate data, metadata, security, management – that quickly lead to complexity and cost.

Cloudera Enterprise Data HubSpark enables faster batch processing, analytics, and stream processing on Hadoop.

Stashed in: Big Data!

To save this post, select a stash from drop-down menu or type in a new one:

You May Also Like: