Apache Spark has revolutionised the way in which large organisations ingest and process Big Data and is now quickly becoming the industry standard for large scale and near real-time analytics.
During this training course, the attendees will be introduced to the essential features of the Spark architecture, its data structures and compatibility with other Big Data and analytical tools (e.g. Hadoop, Hive, SQL, R and Python). They will be also provided with practical skills in understanding Scala language to allow them to design and deploy Spark applications on a multi-node, parallel computing cluster.
The course will also contain an introduction to machine learning techniques available in Spark (through Spark ML and MLlib libraries), model validation, selection and optimisation methods, and examples of algorithms used for network/graph analytics (using GraphX library).
Basic course information
Minimum recommended duration: 4-5 full days or 8-10 half-days (can be spread across multiple weeks)
Programming languages used: Scala (also HDFS shell commands and basics of Java)
Minimum number of attendees: 5
Course level: For beginners/novice/intermediate data engineers, data scientists and developers.
Pre-requisites: Good IT skills and practical experience in manipulating large datasets are recommended. Some knowledge of Unix commands will be beneficial, however these will be explained during the training.
IT recommendations: During the course the attendees will perform several MapReduce jobs on a Linux-based Mind Project Hadoop cluster. In order to benefit from the contents of the course it is recommended that attendees have at least one of the following web browsers: Chrome, Safari, Mozilla Firefox and/or Internet Explorer, installed on their laptops (any operating system). Also, the laptops should be equipped with a simple text editor suitable for code/script typing e.g. Notepad++ (for Windows users) or TextWrangler (for Mac users). Please be advised that we do not recommend the following applications: WordPad, Gedit or TextEdit. Other IT requirements will apply depending on the agreed setup. Please contact us should you wish to use a different setup for your course.
The programme for each in-house training course is discussed and agreed individually with the client. The proposed contents of the course may include (but is not limited to) the following concepts and topics:
Using Scala language, Spark engine and its libraries for data import/export from/to various file formats and storage systems (e.g. standard file formats like csv, tab, txt, or Hadoop, Amazon S3 buckets, Hive etc.),
Understanding the structure and operations applicable to Resilient Distributed Datasets, DataFrames and other Spark data structures and objects; Spark transformations and actions,
Data wrangling within Spark – converting between various data structures, recoding values, joins and merges, working with timestamps and strings, preparing data for further processing, applying Spark ML transformers e.g. normalisation or standardisation,
Calculating descriptive statistics and carrying out essential exploratory data analysis including data aggregations and summaries, cross-tabulations, frequency/contingency tables etc.,
Deploying fully-functional Big Data machine learning Spark applications using Spark ML pipelines – multiple linear regressions for predicting numeric continuous target variable and Generalized Linear Models e.g. logistic regression for binary classification – a tutorial on Spark ML and MLlib libraries,
Performing model cross-validation; to calculate and interpret models evaluation metrics e.g. accuracy, recall, precision, R squared, ROC curve, MSE and RMSE,
Manipulating and extracting information from graphs and networks, estimating essential network/graph parameters e.g. degrees, triangles, or (strongly) connected components and applying graph algorithms e.g. PageRank or label propagation – a tutorial on Spark GraphX,
Understanding the compatibility of Spark with other Big Data and data science tools (e.g. Hadoop, Hive, RDBMSs) and programming languages (e.g. Java, Python and R).
Customise the course
We can adapt our in-house training courses to address your specific needs and requirements e.g.:
The course can be designed to include your own data. If it is not possible e.g. due to data security issues, we can customise the course to contain exercises that address similar problems,
The course period can be spread across multiple weeks/months depending on your needs and availability – this will allow your delegates to revise and practise the learnt skills before the next session and provide them with additional time to internalise all presented material,
The course can include a custom project spread across several weeks/months with a follow-up session at the end of the period,
As all our in-house training courses are quoted individually, the final cost quotation will be based on several factors: the number of attendees, days of training (plus additional support/project guidance if needed), location of the training, complexity of IT setup and the extent of course customisation.
Arrange this course at your organisation
If you are interested in this in-house training course, please press Ask For Quote button in the top part of the page to enquire about and request a quote for this course based on your specific needs and desired outcomes of the training.
In your enquiry please include the following information:
contact details to a person who should receive the quote,
number of delegates you would like to train,
approximate number of days (or half-days) you would like to arrange the course for (including additional support/project guidance if needed),
location of the training venue,
any details on course customisation or specific topics you would like the course to address – most importantly, please indicate desired outcomes of the course if different then presented above,
any other questions you may have.