Apache Spark is an open-source cluster computing framework originally developed in the AMP Lab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based Map-Reduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.
In Syncfusion Big data platform, we are deploying spark in Yarn-client mode and we are shipping number of sample scripts for getting started in spark. For more information, refer here.
Spark tab provides user friendly interface to manage and run Spark command scripts at ease. It provides following features.
Interactively run Spark scripts (Scala, Python and Spark SQL)
Spark Scala/Python scripts can be run interactively from within Big Data Studio by directly typing Spark commands into the provided console.
Execute complete Spark script.
You can execute complete Spark script file loaded in Editor by clicking the “Execute” button.
On executing script file using “Execute” button, the output is displayed in a separate “Result” tab in plain text view.
Logs generated during execution are displayed under Logs tab.
History of Spark jobs submitted by clicking “Execute” button are maintained separately and can be accessed through history tab.
You can run all commands in the script file loaded in Editor through interactive console one by one by clicking “Run All” button or by choosing the “Run in Console” option in context menu.
You can run selected commands in the script file through interactive console one by one by clicking “Run Selection” button or by choosing the “Run Selection in Console” option in context menu.
Autocomplete feature is added in the Editor. It will provide suggestion for the keywords based on user typing and allows the user to accept the suggestion or select by pressing “down arrow” key.
Manage script files
You can create new script file and load a file using “Script” button.
You can save as a file using “Save As” button.
You have option to import scripts from folder, create new script and delete scripts present in the tree view.
We ship several samples which you can use it for getting started.
Spark Memory Configuration
You can configure Spark settings for each Spark shell independent of the Spark configuration set in Hadoop cluster. This enables you to run Spark shell with different configurations based on your needs.
You can change the spark memory configuration values and restart the shell. Click the “Configuration” button.
It will open the “Spark Memory Configuration” form. Update the Driver memory, Executor memory, Executor core and Number of Executors value and restart the shell by clicking “Restart Shell” button.
Switch between Scala, Python, IPython and Spark SQL
By default, Spark tab opens shell in Spark-Scala mode, to switch to emulate PySpark, Click Python and wait for few seconds to allow the PySpark console to get started within studio and the samples displayed in left tree view are now reloaded with PySpark scripts. To Switch to IPython, just click IPython, Here PySpark is configured to run within IPython. To switch to Spark SQL , just click Spark SQL.
Spark SQL query editor provides option to interactively run Spark SQL queries against Spark Thrift Server. Also we can visualize the Spark SQL tables and database details.
IPython enables an interactive data science and scientific computing notebook. You can use mathematical expression and plots for analyzing the user relevant output data. The plots(graphs) can be used to visualize the result in statistical format using matplotlib.
Access Spark SQL data from .NET applications from Spark thrift server
As like in Hive, data that are stored using Spark SQL can be accessed through a Syncfusion provided .NET API (Syncfusion.ThriftHive.Base). This API provides user friendly access to Spark SQL data from within the .NET environment. At present spark thrift server is managed in Syncfusion Cluster manager application only and there is no local standalone manager. You can start and stop spark thrift server in Hadoop cluster formed using Syncfusion Cluster manager.
By default, Spark thrift server is configured to run in Port number 10001 using our Syncfusion Cluster manager. To connect the Hive C# samples with the Spark thrift server, just changes the hostname and port number and run the sample.
For more information about the Thrift library (Syncfusion.ThriftHive.Base), refer the documentation link available here.