Build Apache Spark Application in IntelliJ IDEA 14.1

My Operating System is Windows 7, so this tutorial may be little difference for your environment.

Firstly, you should install Scala 2.10.x version on Windows to run Spark, else you would get errors like this:

Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
        at akka.actor.ActorCell$.<init>(ActorCell.scala:305)
        at akka.actor.ActorCell$.<clinit>(ActorCell.scala)
        at akka.actor.RootActorPath.$div(ActorPath.scala:152)
        ......

Please refer this post.

Secondly, you should install Scala plugin and create a Scala project, you can refer this document: Getting Started with Scala in IntelliJ IDEA 14.1.  

After all the above steps are done, the project view should like this:

21

Then follow the next steps:

(1) Select “File” -> “Project Structure“:

22

(2) Select “Modules” -> “Dependencies” -> “+” -> “Library” -> “Java“:

23

(3) Select spark-assembly-x.x.x-hadoopx.x.x.jar, press OK:

24

(4) Configure Library, press OK:

25

(5) The final configuration likes this:

26

(6) Write a simple CountWord application:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object CountWord{
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "c:\\winutil\\")

    val logFile = "C:\\spark-1.3.1-bin-hadoop2.4\\README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

Please notice “System.setProperty("hadoop.home.dir", "c:\\winutil\\")” , You should downloadwinutils.exe and put it in the folder: C:\winutil\bin. For detail information, you should refer the following posts:
a) Apache Spark checkpoint issue on windows;
b) Run Spark Unit Test On Windows 7.

(7) The final execution likes this:

27

 

The following part introduces creating SBT project:

(1) Select “New project” -> “Scala” -> “SBT“, then click “Next:

sbt1

(2) Fill the “project name” and “project location“, then click “Finish“:

sbt2

(3) In Windows, modify the scala version to 2.10.4 in build.sbt:

sbt4

(4) Add spark package and create an scala object in “src -> main -> scala-2.10” folder, the final file layout likes this:sbt5(5) Run it!

You can also build a jar file:
File” -> “Project Structure” -> “Artifacts“, then select options like this:

sbt6

Refer this post in stackoverflow.

Then using spark-submit command execute jar package:

C:\spark-1.3.1-bin-hadoop2.4\bin>spark-submit --class "CountWord" --master local
[4] C:\Work\Intellij_scala\CountWord\out\artifacts\CountWord_jar\CountWord.jar
15/06/17 17:05:51 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
[Stage 0:>                                                          (0 + 0) / 2]
[Stage 0:>                                                          (0 + 1) / 2]
[Stage 0:>                                                          (0 + 2) / 2]

Lines with a: 60, Lines with b: 29

Getting Started with Scala in IntelliJ IDEA 14.1

This tutorial uses IntelliJ IDEA 14.1.3 version.

Prerequisites:

You should install Java and Scala first.

(1) Install Scala plugin:

a) After installing IntelliJ IDEA successfully, we need to install Scala plugin first: In the welcome window, select Configure -> Plugins:  

0

b) Select “Install JetBrains Plugin...“:

2c) If your computer needs proxy, please click “HTTP Proxy Settings” to configure proxy, else ignore it:

3

 

d) Select Scala plugin, and click Install plugin to install it:

4

 

The installing progress is like this:

5

e) After installation, restart IntelliJ IDEA:

6

 

 

 

(2) Create Scala project:
a) Select “Create New Project:

11

b) Select “Scala” -> “Scala“, then click Next:

7

c) Select a valid name for project and a folder to store project files:

12

d) Fill Project SDK with JDK directory:

13

After selection, click “OK:

14

e) For Scala SDK, click “Create“. It will display the installed Scala, click “OK“:

15

f) Click “Finish“:

16

(3) Create Scala application:

a) Select src -> New -> Scala Class:

17

b) Select object as Kind value:

18

c) Write a simple “Hello World” program:

19

d) Select Run -> Run:

20

e) Select HelloWorld:

21

f) The application outputs “Hello World!“:

22

All is OK now!

 

 

How to organize a successful technical party?

Since last year, I began to take part in some technical parties. Some are held very successful, while some seem not. In this article, I will share ideas about how to organize a successful technical party and use Golang programming language as an example.

To hold a party, there must have been a stable user group first. According to the number of user, there may need a committee or a president, and the job of the organizer is searching for the sponsors, selecting the topics, etc. Although there have been so many social platforms now, there must be a mailing list for the group. Because as long as the internet exists, the email will not die, while the selected social platform may not.

The period of holding a party may be six weeks or two months. Too long or too short of the interval may not be appropriate. Before holding a party, the organizer could collect topics from the user group. If there are too many topics, the organizer should decide which will be used. Personally, I think four presentations are enough for one party. In the first topic, the speaker could share the latest news or some stories of Golang. The second and third must be Golang orientated, the speakers can share programming skills, debugging tricks, source code analysis, etc. The final topic can be technical related, but may not Golang, and the speaker can share *NIX internals, script programming knowledge, etc.

If possible, recording the videos and uploading them into the internet is better, because this will enhance the influence of the group, and attract more people and sponsors. During or after the party, it is reasonable to do advertising for sponsors since they have provided support, and this behaviour may encourage them to do more support in the future!

Hope this post can help some people! Enjoying a successful technical party!

Why do I need a root privilege?

Last week, the support engineer told me that a strange issue had occurred on commercial system, and gave me an account to let me check. I used this account to log in the system, but when I wanted to use some commands, the system prompted me “Permission denied”. I also wanted to use DTrace, but it also requires root privilege. So the following dialogue came out between I and administrator:

I: I need the root privilege, because I want to write some scrips and do some test.
Administrator: This is the commercial system, only operation team members have root privilege. You can send commands to them and let them execute the commands and send results back to you.
I: I need to do further investigation according to the previous results, and this may last a long time. So I think it is convenient for me to operate the system myself.
Administrator: No, it is not allowed for you to operate the commercial system. You can only send your scripts and commands to operation members, and they can send results back.
I:……

Per my understanding, debugging is a tough progress which may last several days even months, and the engineer need to dig and analyse from previous output then decide what to do next. Sometimes, maybe a digit can spark engineer. So I need a root privilege and do debugging myself, and don’t want to send mails back and forth. This disrupts me!

No root privilege, it really sucks!

A trick of building multithreaded application on Solaris

Firstly, Let’s see a simple multithreaded application:

#include <stdio.h>
#include <pthread.h>
#include <errno.h>

void *thread1_func(void *p_arg)
{
           errno = 0;
           sleep(3);
           errno = 1;
           printf("%s exit, errno is %d\n", (char*)p_arg, errno);
}

void *thread2_func(void *p_arg)
{
           errno = 0;
           sleep(5);
           printf("%s exit, errno is %d\n", (char*)p_arg, errno);
}

int main(void)
{
        pthread_t t1, t2;

        pthread_create(&t1, NULL, thread1_func, "Thread 1");
        pthread_create(&t2, NULL, thread2_func, "Thread 2");

        sleep(10);
        return;
}

What output do you expect from this program? Per my understanding, the errnoshould be a thread-safe variable. Though The thread1_func function changes theerrno, it should not affect errno in thread2_func function.

Let’s check it on Solaris 10:

bash-3.2# gcc -g -o a a.c -lpthread
bash-3.2# ./a
Thread 1 exit, errno is 1
Thread 2 exit, errno is 1

Oh! The errno in thread2_func function is also changed to 1. Why does it happen? Let’s find the root cause from the errno.h file:

/*
 * Error codes
 */

#include <sys/errno.h>

#ifdef  __cplusplus
extern "C" {
#endif

#if defined(_LP64)
/*
 * The symbols _sys_errlist and _sys_nerr are not visible in the
 * LP64 libc.  Use strerror(3C) instead.
 */
#endif /* _LP64 */

#if defined(_REENTRANT) || defined(_TS_ERRNO) || _POSIX_C_SOURCE - 0 >= 199506L
extern int *___errno();
#define errno (*(___errno()))
#else
extern int errno;
/* ANSI C++ requires that errno be a macro */
#if __cplusplus >= 199711L
#define errno errno
#endif
#endif  /* defined(_REENTRANT) || defined(_TS_ERRNO) */

#ifdef  __cplusplus
}
#endif

#endif  /* _ERRNO_H */

We can find the errno can be a thread-safe variable(#define errno (*(___errno()))) only when the following macros defined:

defined(_REENTRANT) || defined(_TS_ERRNO) || _POSIX_C_SOURCE - 0 >= 199506L

Let’s try it:

bash-3.2# gcc -D_POSIX_C_SOURCE=199506L -g -o a a.c -lpthread
bash-3.2# ./a
Thread 1 exit, errno is 1
Thread 2 exit, errno is 0

Yes, the output is right!

From Compiling a Multithreaded Application, we can see:

For POSIX behavior, compile applications with the -D_POSIX_C_SOURCE flag set >= 199506L. For Solaris behavior, compile multithreaded programs with the -D_REENTRANT flag.

So we should pay more attentions when building multithreaded application on Solaris.

P.S., the full code is here.

Reference:
(1) Compiling a Multithreaded Application;
(2) What is the correct way to build a thread-safe, multiplatform C library?