Book Description:
Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
about the technology
Spark is a powerful general-purpose analytics engine that can handle massive amounts of data distributed across clusters with thousands of servers. Optimized to run in memory, this impressive framework can process data up to 100x faster than most Hadoop-based systems. Spark’s support for SQL, along with its ability to rapidly run repeated queries and quickly adapt to modified queries, make it well-suited for machine learning, so important in this age of big data. Whether you’re using Java, Scala, or Python, Spark offers straightforward APIs to access its core features.
about the book
Spark in Action, Second Edition is an entirely new book that teaches you everything you need to create end-to-end analytics pipelines in Spark. Rewritten from the ground up with lots of helpful graphics, you’ll learn the roles of DAGs and data frames, the advantages of “lazy evaluation”, and ingestion from files, databases, and streams.
By working through carefully-designed Java-based examples, you’ll delve into Spark SQL, interface with Python, and cache and checkpoint your data. Along the way, you’ll learn to interact with common enterprise data technologies like HDFS and file formats like Parquet, ORC, and Avro.
You’ll also discover interesting Spark use cases, like interactive reporting, machine learning pipelines, and even monitoring players in online games. You’ll even get a quick look at machine learning techniques you can apply without a Ph.D. in mathematics! All examples are available on GitHub for you to explore and adapt as you learn. The demand for Spark-savvy developers is so steep, they’re among the highest paid in the industry today!
what’s inside
- Lots of examples based in the Spark Java APIs using real-life dataset and scenarios
- Examples based on Spark v2.3 Ingestion through files, databases, and streaming
- Building custom ingestion process
- Querying distributed datasets with Spark SQL
- Deploying Spark applications
- Caching and checkpointing your data
- Interfacing with data scientists using Python
- Applied machine learning
- Spark use cases including Lumeris, CERN, and IBM
about the reader
For beginning to intermediate developers and data engineers comfortable programming in Java. No experience with functional programming, Scala, Spark, Hadoop, or big data is required.
about the author
An experienced consultant and entrepreneur passionate about all things data, Jean Georges Perrin was the first IBM Champion in France, an honor he’s now held for ten consecutive years. Jean Georges has managed many teams of software and data engineers.