- Main Menu-

Catalogue Tag Display

MARC 21

Big data analytics: A Handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Handoop clusters
Tag	Description
020	$a9781785884696
041	$aEng
084	$aHD38.7.A55 2016
100	$aAnkam, Venkat
245	$aBig data analytics$bA Handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Handoop clusters$ht
260	$aBirmingham$bPackt Pub.$c2016
300	$ xv$a300 p. : ill. ; 23 cm.
307	$bBook
505	$aBig Data Analytics at a 10,000-Foot View; Big Data analytics and the role of Hadoop and Spark; A typical Big Data analytics project life cycle; Identifying the problem and outcomes; Identifying the necessary data; Data collection; Preprocessing data and ETL; Performing analytics; Visualizing data; The role of Hadoop and Spark; Big Data science and the role of Hadoop and Spark; A fundamental shift from data analytics to data science; Data scientists versus software engineers. Data scientists versus data analystsData scientists versus business analysts; A typical data science project life cycle; Hypothesis and modeling; Measuring the effectiveness; Making improvements; Communicating the results; The role of Hadoop and Spark; Tools and techniques; Real-life use cases; Summary; Chapter 2: Getting Started with Apache Hadoop and Apache Spark; Introducing Apache Hadoop; Hadoop Distributed File System; Features of HDFS; MapReduce; MapReduce features; MapReduce v1 versus MapReduce v2; MapReduce v1 challenges; YARN; Storage options on Hadoop; File formats. Compression formatsIntroducing Apache Spark; Spark history; What is Apache Spark?; What Apache Spark is not; MapReduce issues; Spark's stack; Why Hadoop plus Spark?; Hadoop features; Spark features; Frequently asked questions about Spark; Installing Hadoop plus Spark clusters; Summary; Chapter 3: Deep Dive into Apache Spark; Starting Spark daemons; Working with CDH; Working with HDP, MapR, and Spark pre-built packages; Learning Spark core concepts; Ways to work with Spark; Spark Shell; Spark applications; Resilient Distributed Dataset; Method 1 --parallelizing a collection. Method 2 --reading from a fileSpark context; Transformations and actions; Parallelism in RDDs; Lazy evaluation; Lineage Graph; Serialization; Leveraging Hadoop file formats in Spark; Data locality; Shared variables; Pair RDDs; Lifecycle of Spark program; Pipelining; Spark execution summary; Spark applications; Spark Shell versus Spark applications; Creating a Spark context; SparkConf; SparkSubmit; Spark Conf precedence order; Important application configurations; Persistence and caching; Storage levels; What level to choose?; Spark resource managers --Standalone, YARN, and Mesos. Local versus cluster modeCluster resource managers; Standalone; YARN; Mesos; Which resource manager to use?; Summary; Chapter 4: Big Data Analytics with Spark SQL, DataFrames, and Datasets; History of Spark SQL; Architecture of Spark SQL; Introducing SQL, Datasources, DataFrame, and Dataset APIs; Evolution of DataFrames and Datasets; What's wrong with RDDs?; RDD Transformations versus Dataset and DataFrames Transformations; Why Datasets and DataFrames?; Optimization; Speed; Automatic Schema Discovery; Multiple sources, multiple languages; Interoperability between RDDs and others.
650	$aBig data -- Security measures