Learning Patterns Your Source for Quality Technology Courseware

Introduction to Spark 2 with Python: Lab Setup Instructions (Windows OS: Java 8, Python 3.6, Spark 2.4.6)

Below are the standard requirements for this course. If you have any questions or issues, please contact us.

Important Note: Student lab files are required on each computer used for the course. The links for these are not in this lab setup, and you should receive them separately.

Other notes:

  • It’s a good idea to keep downloaded software install files on the machines during the class in case of problems that require a re-install.
  • Cloning a setup is generally not a problem. If it is, we’ll mention it in the software section (for example, much of the IBM/RAD-WAS software can be problematic in this regard).

Hardware and classroom setup.

Each student and the instructor shall have a workstation that fulfills the listed requirements.

  • Required: Intel-compatible processor (with reasonably recent hardware).
  • Memory: 8GB min recommended
  • Disk Space: Free disk space for software installs (generally minimum 2GB)
  • Operating System: Windows OS (Any modern version - e.g. Windows 10. - labs have not been tested on Windows 8 variants)
  • Required: Zip utility. A good free one is 7-zip
  • Required: Adobe Acrobat Reader
  • Required: One of Firefox browser (https://www.mozilla.org/en-US/firefox/new/) or Chrome browser (https://www.google.com/chrome/). Edge browser is not sufficient.
  • Recommended: Internet access
  • Recommended: Class machines networked together - allows students to access a shared network directory.

Install 7-zip 

We’ve found that there are sometimes problems using the built in Windows archive/zip utility. This generally has to do with long path lengths that it can’t handle. Use 7-zip to extract the labs and any software zips which we’ve found very reliable.

  • Can try direct download link for 64-bit install: https://www.7-zip.org/a/7z2301-x64.exe
  • If that doesn’t work, go to home page https://www.7-zip.org
  • Near the top of the page, find the download link for your bitness (probably 64 bit), and download the installer.
  • Execute the installer, and take all the defaults.
  • You can now extract zip files by right clicking on them, and selecting 7-Zip | Extract ...

Lab Files: Each student and instructor must have lab files installed (links to these files are generally sent separately via e-mail).

  • Extract the lab files to a location conveniently accessible to the student (e.g. C:\ )
  • Recommend using utility like 7-zip, not Windows built-in extractor.
  • If using folder other than C:\, make sure that students know where they are.

Other instructor requirements for the classroom

  • Projector or large screen TV capable of 1280x800 or higher resolution. Instructor must be able to use this to project slides.
  • Whiteboard (preferred) or flip charts with markers.

Install Java Development Kit – JDK 1.8 Update 411

  • Note that any relatively recent JDK1.8 version is fine.
  • Note that you'll need a free Oracle logon to download the JDK.
  • From https://www.oracle.com/java/technologies/javase/javase8u211-later-archive-downloads.html find the latest Java 8 release installer file for your OS
    • Windows 64 bit: e.g. jdk-8u411-windows-x64.exe (Almost certainly this is the one you want. 32-bit Windows OS installs are now rare).
  • Click the link for the installer file, accept the license agreement, enter your login credentials, and download the installer.
  • Run the installer and take all defaults.

  • Create or modify environment variables as appropriate for your OS. This will add an environment variable JAVA_HOME, and modify your path to include the jdk bin folder
    • JAVA_HOME:
      • Right click My Computer and choose Properties > click the Advanced tab > click the Environment Variables button
      • In the bottom half of the dialog, click New to add a new System variable
      • Variable name: JAVA_HOME (this is case-sensitive)
      • Variable value: C:\Program Files\Java\jdk1.8.0_411 (or adjust to the actual path where you installed the JDK and your JDK version – please double-check this path – probably best to copy and paste it)
      • Click OK
    • Path:

      %JAVA_HOME%\bin;

      • Find this existing entry in the bottom half of the Environment Variables button, and click Edit
      • Click in the Variable value field and move your cursor all the way to the left (pressing Home on your keyboard should do this quickly for you)
      • Check whether the value below is already present, or add it at the beginning if necessary (make sure you get all of this, including the trailing semicolon, with no spaces):
    • Click OK repeatedly (likely in 3 different dialogs) until all the dialogs close.

  • Open a terminal prompt type the below, and press Enter

    javac -version

    You should get a message that tells you the version. If the command is not found, you did something wrong.

  • Close the terminal prompt. You’re done installing Java

PySpark Environment Setup

  • Set the following Environment Variables using the standard Windows dialogs
    • PYSPARK_PYTHON=C:\Users\[CurrentUserName]\Python36
      • Make sure this is appropriate for your environment - this is where our setup says to put the Python install
    • SPARK_LABS=C:\spark-labs-python
    • SPARK_HOME=%SPARK_LABS%\spark
    • KAFKA_HOME=%SPARK_LABS%\kafka
    • HADOOP_HOME=%SPARK_LABS%\winutils
    • Add the following to the PATH
      • %SPARK_HOME%\bin
      • %SPARK_HOME%\sbin
      • %KAFKA_HOME%\bin\windows
         
  • Add the Visual C++ Redistributable for needed DLLs
  • Test the install of the VC++ components
    • Open a command shell in C:\spark-labs-python (not a PowerShell) and run the following command

      C:\spark-labs>winutils\bin\winutils.exe
       
    • You should NOT get a windows dialog about a missing DLL - if you do, something’s wrong.
      ​​​​​​​​​​​​​
  • Test/Initialize the pyspark install and the OS path
    ​​​​​​​
    • Open a command shell in C:\spark-labs-python (not a PowerShell)
    • Run the following command

      C:\spark-labs-python> pyspark
       
    • The shell should come up cleanly with a >>> prompt.
    • Exit the spark shell by typing quit() 
    • Run the following command in the same command prompt (You can NOT copy paste all of it from here as noted below).
      • Note that the packages and master options are each preceded by two hyphens. The HTML display may make this hard to see visually
      • DO NOT copy/paste the complete command, as the HTML display is likely to change the hyphen characters into something that doesn’t work.
      • You can copy/paste the “org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6” as that won’t be changed by the HTML display.
         
        C:\spark-labs-python> pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6 --master local[2]
    • It should do some downloading of needed jars, then start up cleanly. We do this to make sure the jars are stored locally in case students have issues using the internet.
    • Exit the spark shell by typing quit()
    • You’re done testing this

Install Cygwin

  • Go to install page https://www.cygwin.com/install.html
  • This also includes the download link - which we give here: https://www.cygwin.com/setup-x86_64.exe
  • Execute the installer, and you will eventually get to Select packages dialog where you can select packages to install (most of Cygwin is not installed in the default install)
  • In the Select packages dialog, make sure the “View” drop down in the upper left is set to Category
  • Expand the Net category, and find the nc item.
  • Click the down arrow associated with the nc item, and select the entry that looks like a version number (e.g. 1.107-4)
  • Click Next through all remaining dialogs. When you come to the dialog that asks, create shortcuts on Start and desktop.

Test Cygwin

  • To test, start Cygwin from the desktop shortcut.
  • Once the window comes up, execute the following commands.
  • ——————
  • cd C:
  • ls
  • ——————
  • You should get a directory listing of C:\

Install Notepad++

Install Firefox Browser

This specific browser is required for this course. Other browsers may not have the exact capabilities needed for some labs.

Install Chrome Browser

This specific browser may be required for this course. Other browsers may not have the exact capabilities needed for some labs.

  • Download and save the installer file.
  • Save and execute the installer - you can take all the defaults in the installation.
  • Once installed, start it and make sure it starts up normally.
  • Make it the default browser.