Sale!

Fast Data Processing with Spark 2 3rd edition by Krishna Sankar 1785882968 9781785882968

Name: Fast Data Processing with Spark 2 3rd edition by Krishna Sankar 1785882968 9781785882968
SKU: EB-6808714
Availability: InStock

Original price was: $50.00.Current price is: $25.00.

Fast Data Processing with Spark 2 Krishna Sankar Digital Instant Download

Author(s): Krishna Sankar

ISBN(s): 9781785889271, 1785889273

Edition: 3rd

File Details: PDF, 31.58 MB

Year: 2016

Language: English

SKU: EB-6808714 Category: Computers - Algorithms and Data Structures Tag: Krishna Sankar

Description

Fast Data Processing with Spark 2 3rd edition by Krishna Sankar – Ebook PDF Instant Download/DeliveryISBN: 1785882968, 9781785882968

Full download Fast Data Processing with Spark 2 3rd edition after payment.

Product details:

ISBN-10 : 1785882968

ISBN-13 : 9781785882968

Author: Krishna Sankar

This book is for developers with little to no knowledge of Spark, but with a background in Scala/Java programming. It’s recommended that you have experience in dealing and working with big data and a strong interest in data science.

Fast Data Processing with Spark 2 3rd Table of contents:

1. Installing Spark and Setting Up Your Cluster
Directory organization and convention
Installing the prebuilt distribution
Building Spark from source
Downloading the source
Compiling the source with Maven
Compilation switches
Testing the installation
Spark topology
A single machine
Running Spark on EC2
Downloading EC-scripts
Running Spark on EC2 with the scripts
Deploying Spark on Elastic MapReduce
Deploying Spark with Chef (Opscode)
Deploying Spark on Mesos
Spark on YARN
Spark standalone mode
References
Summary
2. Using the Spark Shell
The Spark shell
Exiting out of the shell
Using Spark shell to run the book code
Loading a simple text file
Interactively loading data from S3
Running the Spark shell in Python
Summary
3. Building and Running a Spark Application
Building Spark applications
Data wrangling with iPython
Developing Spark with Eclipse
Developing Spark with other IDEs
Building your Spark job with Maven
Building your Spark job with something else
References
Summary
4. Creating a SparkSession Object
SparkSession versus SparkContext
Building a SparkSession object
SparkContext – metadata
Shared Java and Scala APIs
Python
iPython
Reference
Summary
5. Loading and Saving Data in Spark
Spark abstractions
RDDs
Data modalities
Data modalities and Datasets/DataFrames/RDDs
Loading data into an RDD
Saving your data
References
Summary
6. Manipulating Your RDD
Manipulating your RDD in Scala and Java
Scala RDD functions
Functions for joining the PairRDD classes
Other PairRDD functions
Double RDD functions
General RDD functions
Java RDD functions
Spark Java function classes
Common Java RDD functions
Methods for combining JavaRDDs
Functions on JavaPairRDDs
Manipulating your RDD in Python
Standard RDD functions
The PairRDD functions
References
Summary
7. Spark 2.0 Concepts
Code and Datasets for the rest of the book
Code
IDE
iPython startup and test
Datasets
Car-mileage
Northwind industries sales data
Titanic passenger list
State of the Union speeches by POTUS
Movie lens Dataset
The data scientist and Spark features
Who is this data scientist DevOps person?
The Data Lake architecture
Data Hub
Reporting Hub
Analytics Hub
Spark v2.0 and beyond
Apache Spark – evolution
Apache Spark – the full stack
The art of a big data store – Parquet
Column projection and data partition
Compression
Smart data storage and predicate pushdown
Support for evolving schema
Performance
References
Summary
8. Spark SQL
The Spark SQL architecture
Spark SQL how-to in a nutshell
Spark SQL with Spark 2.0
Spark SQL programming
Datasets/DataFrames
SQL access to a simple data table
Handling multiple tables with Spark SQL
Aftermath
References
Summary
9. Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists
Datasets – a quick introduction
Dataset APIs – an overview
org.apache.spark.sql.SparkSession/pyspark.sql.SparkSession
org.apache.spark.sql.Dataset/pyspark.sql.DataFrame
org.apache.spark.sql.{Column,Row}/pyspark.sql.(Column,Row)
org.apache.spark.sql.Column
org.apache.spark.sql.Row
org.apache.spark.sql.functions/pyspark.sql.functions
Dataset interfaces and functions
Read/write operations
Aggregate functions
Statistical functions
Scientific functions
Data wrangling with Datasets
Reading data into the respective Datasets
Aggregate and sort
Date columns, totals, and aggregations
The OrderTotal column
Date operations
Final aggregations for the answers we want
References
Summary
10. Spark with Big Data
Parquet – an efficient and interoperable big data format
Saving files in the Parquet format
Loading Parquet files
Saving processed RDDs in the Parquet format
HBase
Loading from HBase
Saving to HBase
Other HBase operations
Reference
Summary
11. Machine Learning with Spark ML Pipelines
Spark’s machine learning algorithm table
Spark machine learning APIs – ML pipelines and MLlib
ML pipelines
Spark ML examples
The API organization
Basic statistics
Loading data
Computing statistics
Linear regression
Data transformation and feature extraction
Data split
Predictions using the model
Model evaluation
Classification
Loading data
Data transformation and feature extraction
Data split
The regression model
Prediction using the model
Model evaluation
Clustering
Loading data
Data transformation and feature extraction
Data split
Predicting using the model
Model evaluation and interpretation
Clustering model interpretation
Recommendation
Loading data
Data transformation and feature extraction
Data splitting
Predicting using the model
Model evaluation and interpretation
Hyper parameters
The final thing
References
Summary
12. GraphX
Graphs and graph processing – an introduction
Spark GraphX
GraphX – computational model
The first example – graph
Building graphs
The GraphX API landscape
Structural APIs
What’s wrong with the output?
Community, affiliation, and strengths
Algorithms
Graph parallel computation APIs
The aggregateMessages() API
The first example – the oldest follower
The second example – the oldest followee
The third example – the youngest follower/followee
The fourth example – inDegree/outDegree
Partition strategy
Case study – AlphaGo tweets analytics
Data pipeline
GraphX modeling
GraphX processing and algorithms