什么是 Apache Spark?大数据分析平台如是说(What is Apache Spark? The big data analytics platform explained)

未分类 William 568浏览 0评论


From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, Facebook, IBM, and Microsoft.

Out of the box, Spark can run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in your cluster. However, it’s more likely you’ll want to take advantage of a resource or cluster management system to take care of allocating workers on demand for you. In the enterprise, this will normally mean running on Hadoop YARN (this is how the Cloudera and Hortonworks distributions run Spark jobs), but Apache Spark can also run on Apache Mesos, while work is progressing on adding native support for Kubernetes.

If you’re after a managed solution, then Apache Spark can be found as part of Amazon EMRGoogle Cloud Dataproc, and Microsoft Azure HDInsightDatabricks, the company that employs the founders of Apache Spark, also offers the Databricks Unified Analytics Platform, which is a comprehensive managed service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and optimized cloud I/O performance over a standard Apache Spark distribution.

Spark vs. Hadoop

值得一提的是,拿 Apache Spark 和 Apache Hadoop 比是有点不恰当的。目前,在大多数 Hadoop 发行版中都包含 Spark 。但是由于以下两大优势,Spark 在处理大数据时已经成为首选框架,超越了使 Hadoop 腾飞的旧 MapReduce 范式。

第一个优势是速度。Spark 的内存内数据引擎意味着在某些情况下,它执行任务的速度比 MapReduce 快一百倍,特别是与需要将状态写回到磁盘之间的多级作业相比时更是如此。即使 Apache Spark 的作业数据不能完全包含在内存中,它往往比 MapReduce 的速度快10倍左右。

第二个优势是对开发人员友好的 Spark API 。与 Spark 的加速一样重要的是,人们可能会认为 Spark API 的友好性更为重要。

Spark Core

In comparison to MapReduce and other Apache Hadoop components, the Apache Spark API is very friendly to developers, hiding much of the complexity of a distributed processing engine behind simple method calls. The canonical example of this is how almost 50 lines of MapReduce code to count words in a document can be reduced to just a few lines of Apache Spark (here shown in Scala):

val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts = textFile.flatMap(line => line.split(“ “))
                      .map(word => (word, 1))
                      .reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)

By providing bindings to popular languages for data analysis like Python and R, as well as the more enterprise-friendly Java and Scala, Apache Spark allows everybody from application developers to data scientists to harness its scalability and speed in an accessible manner.

Spark Core

与 MapReduce 和其他 Apache Hadoop 组件相比,Apache Spark API 对开发人员非常友好,在简单的方法调用后面隐藏了分布式处理引擎的大部分复杂性。其中一个典型的例子是几乎要 50 行的 MapReduce 代码来统计文档中的单词可以缩减到几行 Apache Spark 实现(下面代码是 Scala 中展示的):

val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts = textFile.flatMap(line => line.split(“ “))
                      .map(word => (word, 1))
                      .reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)

通过提供类似于 Python、R 等数据分析流行语言的绑定,以及更加对企业友好的 Java 和 Scala ,Apache Spark 允许应用程序开发人员和数据科学家以可访问的方式利用其可扩展性和速度。

Spark RDD

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing.

RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

Spark runs in a distributed fashion by combining a driver core process that splits a Spark application into tasks and distributes them among many executor processes that do the work. These executors can be scaled up and down as required for the application’s needs.

Spark RDD

Apache Spark 的核心是弹性分布式数据集(Resilient Distributed Dataset,RDD)的概念,这是一种编程抽象,表示一个可以在计算集群中分离的不可变对象集合。 RDD 上的操作也可以跨群集分割,并以批处理并行方式执行,从而实现快速和可扩展的并行处理。

RDD 可以通过简单的文本文件、SQL 数据库、NoSQL 存储(如 Cassandra 和 MongoDB )、Amazon S3 存储桶等等创建。Spark Core API 的大部分是构建于 RDD 概念之上,支持传统的映射和缩减功能,还为连接数据集、过滤、采样和聚合提供了内置的支持。

Spark 是通过结合驱动程序核心进程以分布式方式运行的,该进程将 Spark 应用程序分解成任务,并将其分发到完成任务的许多执行程序的进程中。这些执行程序可以根据应用程序的需要进行扩展和缩减。

Spark SQL

Originally known as Shark, Spark SQL has become more and more important to the Apache Spark project. It is likely the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.

Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem.

Selecting some columns from a dataframe is as simple as this line:

citiesDF.select(“name”, “pop”)

Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:

citiesDF.createOrReplaceTempView(“cities”)
spark.sql(“SELECT name, pop FROM cities”)

Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. In the Apache Spark 2.x era, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) is the recommended approach for development. The RDD interface is still available, but is recommended only if you have needs that cannot be encapsulated within the Spark SQL paradigm.

Spark SQL

Spark SQL 最初被称为 Shark,Spark SQL 对于 Apache Spark 项目开始变得越来越重要。它就像现在的开发人员在开发应用程序时常用的接口。Spark SQL 专注于结构化数据的处理,借用了 R 和 Python 的数据框架(在 Pandas 中)。不过顾名思义,Spark SQL 在查询数据时还兼容了 SQL2003 的接口,将 Apache Spark 的强大功能带给分析师和开发人员。

除了支持标准的 SQL 外,Spark SQL 还提供了一个标准接口来读写其他数据存储,包括 JSON,HDFS,Apache Hive,JDBC,Apache Parquet,所有这些都是可以直接使用的。像其他流行的存储工具 —— Apache Cassandra、MongoDB、Apache HBase 和一些其他的能够从 Spark Packages 生态系统中提取出来单独使用的连接器。

下边这行简单的代码是从数据框架中选择一些字段:

citiesDF.select(“name”, “pop”)

要使用 SQL 接口,首先要将数据框架注册成一个临时表,之后我们就可以使用 SQL 语句进行查询:

citiesDF.createOrReplaceTempView(“cities”)
spark.sql(“SELECT name, pop FROM cities”)

在后台, Apache Spark 使用名为 Catalyst 的查询优化器来检查数据和查询,以便为数据局部性和计算生成有效的查询计划,以便在集群中执行所需的计算。在 Apache Spark 2.x 版本中,Spark SQL 的数据框架和数据集的接口(本质上是一个可以在编译时检查正确性的数据框架类型,并在运行时利用内存并和计算优化)是推荐的开发方式。RDD 接口仍然可用,但只有无法在 Spark SQL 范例中封装的情况下才推荐使用。

Spark MLlib

Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. Spark MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use.

Note that while Spark MLlib covers basic machine learning including classification, regression, clustering, and filtering, it does not include facilities for modeling and training deep neural networks (for details see InfoWorld’s Spark MLlib review). However, Deep Learning Pipelines are in the works.

Spark MLib

Apache Spark 还有一个捆绑许多在大数据集上做数据分析和机器学习的算法的库 (Spark MLib) 。Spark MLlib 包含一个框架用来创建机器学习管道和在任何结构化数据集上进行特征提取、选择、变换。MLLib 提供了聚类和分类算法的分布式实现,如 k 均值聚类和随机森林等可以在自定义管道间自由转换的算法。数据科学家可以在 Apache Spark 中使用 R 或 Python 训练模型,然后使用 MLLib 存储模型,最后在生产中将模型导入到基于 Java 或者 Scala 语言的管道中。

需要注意的是 Spark MLLib 只包含了基本的分类、回归、聚类和过滤机器学习算法,并不包含深度学建模和训练的工具(更多内容 InfoWorld’s Spark MLlib review )。提供深度学习管道的工作正在进行中。

Spark GraphX

Spark GraphX comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the GraphFrames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.

Spark Streaming

Spark Streaming was an early addition to Apache Spark that helped it gain traction in environments that required real-time or near real-time processing. Previously, batch and stream processing in the world of Apache Hadoop were separate things. You would write MapReduce code for your batch processing needs and use something like Apache Storm for your real-time streaming requirements. This obviously leads to disparate codebases that need to be kept in sync for the application domain despite being based on completely different frameworks, requiring different resources, and involving different operational concerns for running them.

Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches, which could then be manipulated using the Apache Spark API. In this way, code in batch and streaming operations can share (mostly) the same code, running on the same framework, thus reducing both developer and operator overhead. Everybody wins.

A criticism of the Spark Streaming approach is that microbatching, in scenarios where a low-latency response to incoming data is required, may not be able to match the performance of other streaming-capable frameworks like Apache Storm, Apache Flink, and Apache Apex, all of which use a pure streaming method rather than microbatches.

Spark GraphX

Spark GraphX 提供了一系列用于处理图形结构的分布式算法,包括 Google 的 PageRank 实现。这些算法使用 Spark Core 的 RDD 方法来建模数据;GraphFrames 包允许您对数据框执行图形操作,包括利用 Catalyst 优化器进行图形查询。

Spark Streaming

Spark Streaming 是 Apache Spark 的一个新增功能,它帮助在需要实时或接近实时处理的环境中获得牵引力。以前,Apache Hadoop 世界中的批处理和流处理是不同的东西。您可以为您的批处理需求编写 MapReduce 代码,并使用 Apache Storm 等实时流媒体要求。这显然导致不同的代码库需要保持同步的应用程序域,尽管是基于完全不同的框架,需要不同的资源,并涉及不同的操作问题,以及运行它们。

Spark Streaming 将 Apache Spark 的批处理概念扩展为流,将流分解为连续的一系列微格式,然后使用 Apache Spark API 进行操作。通过这种方式,批处理和流操作中的代码可以共享(大部分)相同的代码,运行在同一个框架上,从而减少开发人员和操作员的开销。每个人都能获益。

对 Spark Streaming 方法的一个批评是,在需要对传入数据进行低延迟响应的情况下,批量微操作可能无法与 Apache Storm,Apache Flink 和 Apache Apex 等其他支持流的框架的性能相匹配,所有这些都使用纯粹的流媒体方法而不是批量微操作。

Structured Streaming

Structured Streaming (added in Spark 2.x) is to Spark Streaming what Spark SQL was to the Spark Core APIs: A higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming dataframes and datasets. It also solves some very real pain points that users have struggled with in the earlier framework, especially concerning dealing with event-time aggregations and late delivery of messages. All queries on structured streams go through the Catalyst query optimizer, and can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data.

Structured Streaming is still a rather new part of Apache Spark, having been marked as production-ready in the Spark 2.2 release. However, Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.

Structured Streaming

Structured Streaming(在 Spark 2.x 中新增的特性)是针对 Spark Streaming 的,就跟 Spark SQL 之于 Spark 核心 API 一样:这是一个更高级别的 API,更易于编写应用程序。在使用 Structure Streaming 的情况下,更高级别的 API 本质上允许开发人员创建无限流式数据帧和数据集。它还解决了用户在早期的框架中遇到的一些非常真实的痛点,尤其是在处理事件时间聚合和延迟传递消息方面。对 Structured Streaming 的所有查询都通过 Catalyst 查询优化器,甚至可以以交互方式运行,允许用户对实时流数据执行 SQL 查询。

Structured Streaming 在 Apache Spark 中仍然是一个相当新的部分,已经在 Spark 2.2 发行版中被标记为产品就绪状态。但是,Structure Streaming 是平台上流式传输应用程序的未来,因此如果你要构建新的流式传输应用程序,则应该使用 Structure Streaming。传统的 Spark Streaming API 将继续得到支持,但项目组建议将其移植到 Structure Streaming 上,因为新方法使得编写和维护流式代码更加容易。

What’s next for Apache Spark?

While Structured Streaming provides high-level improvements to Spark Streaming, it currently relies on the same microbatching scheme of handling streaming data. However, the Apache Spark team is working to bring continuous streaming without microbatching to the platform, which should solve many of the problems with handling low-latency responses (they’re claiming ~1ms, which would be very impressive). Even better, because Structured Streaming is built on top of the Spark SQL engine, taking advantage of this new streaming technique will require no code changes.

In addition improving streaming performance, Apache Spark will be adding support for deep learning via Deep Learning Pipelines. Using the existing pipeline structure of MLlib, you will be able to construct classifiers in just a few lines of code, as well as apply custom Tensorflow graphs or Keras models to incoming data. These graphs and models can even be registered as custom Spark SQL UDFs (user-defined functions) so that the deep learning models can be applied to data as part of SQL statements.

Neither of these features is anywhere near production-ready at the moment, but given the rapid pace of development we’ve seen in Apache Spark in the past, they should be ready for prime time in 2018.

Apache Spark 的下一步是什么?

尽管结构化数据流为 Spark Streaming 提供了高级改进,但它目前依赖于处理数据流的相同微量批处理方案。然而, Apache Spark 团队正在努力为平台带来连续的流媒体处理,这应该能够解决许多处理低延迟响应的问题(声称大约1ms,这将会非常令人印象深刻)。 更好的是,因为结构化流媒体是建立在 Spark SQL 引擎之上的,所以利用这种新的流媒体技术将不需要更改代码。

除此之外,Apache Spark 还将通过 Deep Learning Pipelines 增加对深度学习的支持。 使用 MLlib 的现有管线结构,您将能够在几行代码中构建分类器,并将自定义 Tensorflow 图形或 Keras 模型应用于传入数据。 这些图表和模型甚至可以注册为自定义的 Spark SQL UDF(用户定义的函数),以便深度学习模型可以作为 SQL 语句的一部分应用于数据。

这些功能目前都无法满足生产的需求,但鉴于我们之前在 Apache Spark 中看到的快速发展,他们应该会在2018年的黄金时段做好准备。


via:oschina

转载请注明:AspxHtml学习分享网 » 什么是 Apache Spark?大数据分析平台如是说(What is Apache Spark? The big data analytics platform explained)

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址