Banner

Friday, October 28, 2016

Kotlin for Data Science

Can Kotlin be an Effective Alternative for Python and Scala?

As I started diving formally into data science, I cannot help but notice there is a large gap between data science and software engineering. It is good to come up with prototypes, ideas, and models out of data analysis. However, executing those ideas is another animal. You can outsource the execution to a software engineering team, and that can go well or badly depending on a number of factors. In my experience, it is often helpful to do the execution yourself or at least offer assistance by modeling towards production.

Although Python can be used to build production software, I find its lack of static typing causes difficulty in scaling. It does not easily plug in with large corporate infrastructures built on the Java platform either. Scala, although an undeniably powerful JVM language, is somewhat esoteric and does not click with everyone, especially those who do not have a software engineering background or love of expressing code with mathematical flair. But Kotlin, a new JVM language by JetBrains (the creator of Intellij IDEA, PyCharm, and dozens of other developer tools), has an active community, rapid growth and adoption, and might serve as a pragmatic alternative to Scala. Although Kotlin is unlikely to replace Python's numerical efficiency and data science libraries, it might make a practical addition to your toolbelt especially since it works with Spark out-of-the-box. as shown below.

import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import kotlin.reflect.KClass
 
fun main(args: Array<String>) {
 
    class MyItem(val id: Int, val value: String)
 
    val conf = SparkConf()
            .setMaster("local")
            .setAppName("Line Counter")
 
    conf.registerKryoClasses(MyItem::class)
 
    val sc = JavaSparkContext(conf)
 
    val input = sc.parallelize(listOf(MyItem(1,"Alpha"),MyItem(2,"Beta")))
 
    val letters = input.flatMap { it.value.split(Regex("(?<=.)")) }
            .map { it.toUpperCase() }
            .filter { it.matches(Regex("[A-Z]")) }
    println(letters.collect())
}
/*extension function to register Kotlin classes*/
fun SparkConf.registerKryoClasses(vararg args: KClass<*>) = 
    registerKryoClasses(args.map { it.java }.toTypedArray()

I'll cover Spark with Kotlin another time, but you can look at this simple GitHub project if you like. Apache Spark is definitely a step in the right direction to close the gap between data science and software engineering, or more specifically, turning an idea into immediate execution. You can use SparkR and PySpark to interface R and Python with Spark. But if you want to use a production-grade JVM language, the only mainstream options seem to be Scala and Java. But as stated Kotlin works with Spark too, as it is 100% interoperable with all Java libraries.

A Comparison Between Python and Kotlin

Let's take a look at a somehwat simple data analysis case study. For now, we will leave Scala and Spark out to only compare Kotlin with Python. What I want to highlight is how Kotlin has the tactical conciseness of Python, and maybe in some ways brings a little more to the table as a language for data analysis. Granted, there are not a lot of mainstream JVM data science libraries other than Apache Spark. But they do exist and perhaps there is room for growth, and maybe the language may be worth keeping your eye on (and even exploring).

This comparison was inspired by the social media example from the first chapter of Data Science from Scratch (Grus, O'Reilly). Let's start with declaring two sets of data, users and friendships. Using simply dicts, lists, and tuples without any classes, this is how it could be done in Python.

Python

users = [ 
    { "id" : 0, "name" : "Hero" }, 
    { "id" : 1, "name" : "Dunn" }, 
    { "id" : 2, "name" : "Sue" }, 
    { "id" : 3, "name" : "Chi" }, 
    { "id" : 4, "name" : "Thor" }, 
    { "id" : 5, "name" : "Clive" }, 
    { "id" : 6, "name" : "Hicks" }, 
    { "id" : 7, "name" : "Devin" }, 
    { "id" : 8, "name" : "Kate" }, 
    { "id" : 9, "name" : "Klein" }, 
]
 
friendships = [ 
    (0,1), 
    (0,2), 
    (1,2), 
    (1,3), 
    (2,3), 
    (3,4), 
    (4,5), 
    (5,6), 
    (5,7), 
    (6,8), 
    (7,8), 
    (8,9)
]

The users is a List of dict items, and the friendships are a List of Tuple items. A feature of dynamic typing is you can be fast-and-loose creating data structures that maintain a raw data-like nature. There is no enforcement to uses classes or explicit types.

The equivalent to doing this in Kotlin would look like this:

Kotlin

val users = listOf(
    mapOf("id" to 0, "name" to "Hero"),
    mapOf("id" to 1, "name" to "Dunn"),
    mapOf("id" to 2, "name" to "Sue"),
    mapOf("id" to 3, "name" to "Chi"),
    mapOf("id" to 4, "name" to "Thor"),
    mapOf("id" to 5, "name" to "Clive"),
    mapOf("id" to 6, "name" to "Hicks"),
    mapOf("id" to 7, "name" to "Devin"),
    mapOf("id" to 8, "name" to "Kate"),
    mapOf("id" to 9, "name" to "Klein")
    )
 
 val friendships = listOf(
        listOf(0,1), listOf(0,2), listOf(1,2), listOf(1,3), listOf(2,3), listOf(3,4),
        listOf(4,5), listOf(5,6), listOf(5,7), listOf(6,8), listOf(7,8), listOf(8,9)
 )

For the friendships, we can actually create Pair<Int,Int> items. Kotlin does not really encourage Tuples (or any collection with differing types) and we will see what it offers instead later with the data class. But let's use Pairs in this example instead using the to operator.

val friendships = listOf(
        0 to 1, 0 to 2, 1 to 2, 1 to 3, 2 to 3, 3 to 4,
        4 to 5, 5 to 6, 5 to 7, 6 to 8, 7 to 8, 8 to 9
)

This may look effective at first glance, and Kotlin is statically typed. It is inferring the type for users and friendships as List<Map<String,Any>> and List<List<Int> respectively. Notice how friendships is a List containing Map<String,Any> items, meaning each item has a String for a key and an Any for the value. The reason for the Any is some values are String and others are Int (due to the "id" and "name"), and because the type is not consistent it cast them back down to Any. If we want to use Hero's "id", we need to cast it back up to an Int for it to be treated like an Int rather than a raw Any.

val herosId = users[0]["id"] as Int

Of course, if an "id" value is slipped in as a String accidentally this would throw an error. You can check if it is an Int, but at this point we are just fighting the statically-typed nature of Kotlin (just like Java and Scala). In Kotlin, we are much better off creating a class and doing things the statically-typed way. While this may make dynamic-typing advocates moan, check this out. Kotlin has a concise, readable way of declaring a class quickly and easily, even exceeding Python's standards

Python

class User(Any):
    def __init__(self, id, name):
        self.id = id
        self.name = name
 
    def __str__(self):
        return "{0}-{1}".format(self.idself.name)
 
users = [ 
    User(0,"Hero"), 
    User(1,"Dunn"), 
    User(2,"Sue"), 
    User(3,"Chi"), 
    User(4,"Thor"), 
    User(5,"Clive"), 
    User(6,"Hicks"), 
    User(7,"Devin"), 
    User(8,"Kate"), 
    User(9,"Klein"), 
]

Kotlin

data class User(val id: Int, val name: String)
 
val users = listOf(
      User(0, "Hero"),
      User(1, "Dunn"),
      User(2, "Sue"),
      User(3, "Chi"),
      User(4, "Thor"),
      User(5, "Clive"),
      User(6, "Hicks"),
      User(7, "Devin"),
      User(8, "Kate"),
      User(9, "Klein")
    )

Not too shabby, right? Technically, we did less typing (as in keyboard typing) than Python (76 characters less to be exact, excluding spaces). And we achieved static typing in the process.

Kotlin is certainly a progressive language compared to Java, and it even has practical features like data classes. We made our User a data class, which will automatically implement functionality typically used for classes holding plain data. It will implement toString() and hashcode()/equals() using the properties, as well as a nifty "copy-and-modify" builder by using a copy() function. (This helps aid flexibility while maintaining immutability, which is valued in software engineering).

Kotlin

data class User(val id: Int, val name: String)
 
val user = User(10,"Tom")
val changedUser = user.copy(name = "Thomas")
 
println("Old user: $user")
println("New user: $changedUser")

OUTPUT:

Old user: User(id=11, name=Tom)
New user: User(id=11, name=Thomas)

NOTE: In Kotlin, val precedes the declaration of an immutable variable. var precedes a mutable one.

Data classes are a valuable tool especially for working with data. And yes, Kotlin supports named arguments for constructors and functions as shown in the copy() function above.

Let's return back to our example. Say we wanted to find the mutal friends between two Users. Traditionally in Python, you would create a series of helper functions to assist in this task.

Python

class User(object):
    def __init__(self, id, name):
        self.id = id
        self.name = name
 
    def __str__(self):
        return "{0}-{1}".format(self.idself.name)
 
users = [ 
    User(0,"Hero"), 
    User(1,"Dunn"), 
    User(2,"Sue"), 
    User(3,"Chi"), 
    User(4,"Thor"), 
    User(5,"Clive"), 
    User(6,"Hicks"), 
    User(7,"Devin"), 
    User(8,"Kate"), 
    User(9,"Klein"), 
]
 
friendships = [ 
    (0,1), 
    (0,2), 
    (1,2), 
    (1,3), 
    (2,3), 
    (3,4), 
    (4,5), 
    (5,6), 
    (5,7), 
    (6,8), 
    (7,8), 
    (8,9)
]
 
def user_for_id(user_id):
    for user in users:
        if user.id == user_id:
            return user
 
 
 
def friends_of(user):
    for friendship in friendships:
        if friendship[0] == user.id or friendship[1] == user.id:
            for other_user_id in friendship:
                if other_user_id != user.id:
                    yield user_for_id(other_user_id)
 
def mutual_friends_of(user, otherUser):
    for friend in friends_of(user):
        for other_friend in friends_of(otherUser):
            if (friend.id == other_friend.id):
                yield friend
 
 
# print mutual friends between Hero and Chi 
 
for friend in mutual_friends_of(users[0],users[3]):
    print(friend)

OUTPUT:

1-Dunn
2-Sue

But we can do something similar in Kotlin. This is our first pass, so I'll show a better way in a moment.

Kotlin

fun main(args: Array<String>)  {
 
    data class User(val id: Int, val name: String)
 
    val users = listOf(
        User(0,"Hero"),
        User(1, "Dunn"),
        User(2, "Sue"),
        User(3, "Chi"),
        User(4, "Thor"),
        User(5, "Clive"),
        User(6, "Hicks"),
        User(7, "Devin"),
        User(8, "Kate"),
        User(9, "Klein")
        )
 
    
    val friendships = listOf(
            0 to 1, 0 to 2, 1 to 2, 1 to 3, 2 to 3, 3 to 4,
            4 to 5, 5 to 6, 5 to 7, 6 to 8, 7 to 8, 8 to 9
    )
 
    fun userForId(id: Int): User {
        for (user in users)
            if (user.id == id)
                return user
        throw Exception("User not found!")
    }
 
    fun friendsOf(user: User): List<User> {
        val list = mutableListOf<User>()
        for (friendship in friendships) {
            if (friendship.first == user.id)
                list += userForId(friendship.second)
            if (friendship.second == user.id)
                list += userForId(friendship.first)
        }
        return list
    }
 
    fun mutualFriendsOf(user: User, otherUser: User): List<User> {
        val list = mutableListOf<User>()
        for (friend in friendsOf(user))
            for (otherFriend in friendsOf(otherUser))
                if (friend.id == otherFriend.id)
                    list += friend
 
        return list
    }
 
    for (friend in mutualFriendsOf(users[0],users[3]))
        println(friend)
}

OUTPUT:

User(id=1, name=Dunn)
User(id=2, name=Sue)

Although Kotlin seems to have lost in this example by being wordier and resorting to Lists, hold on. Kotlin has no direct concept of generators and yield keywords. However, we can accomplish something that fulfills the same purpose (and is arguably stylistically better) through Sequence.

fun main(args: Array<String>)  {
 
    data class User(val id: Int, val name: String)
 
    val users = listOf(
        User(0,"Hero"),
        User(1, "Dunn"),
        User(2, "Sue"),
        User(3, "Chi"),
        User(4, "Thor"),
        User(5, "Clive"),
        User(6, "Hicks"),
        User(7, "Devin"),
        User(8, "Kate"),
        User(9, "Klein")
        )
 
    
    val friendships = listOf(
            0 to 1, 0 to 2, 1 to 2, 1 to 3, 2 to 3, 3 to 4,
            4 to 5, 5 to 6, 5 to 7, 6 to 8, 7 to 8, 8 to 9
    )
 
    fun userForId(id: Int) = users.asSequence().filter { it.id == id }.first()
 
    fun friendsOf(user: User) = friendships.asSequence()
            .filter { it.first == user.id || it.second == user.id }
            .flatMap { sequenceOf(it.first,it.second) }
            .filter { it != user.id }
            .map { userForId(it) }
 
    fun mutualFriendsOf(user: User, otherUser: User) = friendsOf(user).flatMap { friend ->
        friendsOf(otherUser).filter { otherFriend -> otherFriend.id == friend.id }
    }
 
    mutualFriendsOf(users[0],users[3]).forEach { println(it) }
}

OUTPUT:

User(id=1, name=Dunn)
User(id=2, name=Sue)

We can use the Sequence to compose a series of operators as a chain, like filter(), map(), flatMap(), and many others. This style of functional programming has been getting a lot of traction over the years thanks to LINQ, primarily because it easily breaks up logic into simple pieces and increases maintainability. 99.99% of the time, I am never using for loops but rather using a Kotlin Sequence, a Java 8 Stream, or an RxKotlin/RxJava Observable. This chain-operator syntax is becoming less alien in Python as well (look at PySpark and RxPy). What is great about this style of programming is you can read what is happening left-to-right, top-to-bottom rather than jumping through several loops and helper functions.

Conclusions

In the coming months, I am going to blog about my experiences using Kotlin for data science, and I will continue to share what I learn. I may throw in an article occasionally covering ReactiveX for data science as well (for both Kotlin and Python). I acknowledge that the Java JVM platform, which Kotlin runs on, does not handle numbers as efficiently as Python or R (maybe Project Valhalla will change that?). But successful models inevitably need to turn into execution, and the Java platform increasingly seems to be the place that happens.

Kotlin merely provides a pragmatic abstraction layer that provides a tactical and concise syntax that seems excellent not just for data analysis, but also executing software. Outside of data science, Kotlin has spurred many successful open-source libraries even before a year after its release. One library, TornadoFX, allows rapid turnaround of complex business user interfaces using Kotlin (As a disclaimer, I have helped with that project). The Kotlin community is active, growing, and engaged on the Slack channel. It continues to be adopted on Android as well as backends, and JetBrains is using Kotlin to build all their tools (including PyCharm and Intellij IDEA). It is also replacing Groovy as the official language for Gradle. Because of these facts, I do not see Kotlin's momentum slowing down anytime soon.

I believe Kotlin could make a great first JVM language, more so than Java or Scala (I struggle to make Jython count). If you are already happy with Scala or Java you will likely not need Kotlin. But for folks wanting to break into JVM languages, there is a new O'Reilly video series that covers Kotlin from scatch. Its instructor Hadi Hariri (one of the JetBrains folks behind Kotlin) believes Pythonistas should be able to follow along. He said anybody familiar with classes, functions, properties, etc should be able to learn Kotlin in a day with this video series. Unfortunately, the existing Kotlin documentation and books assume prior Java knowledge, and hopefully more resources other than the video pop up in the future.

There is a lot of exciting features I have not covered about Kotlin in this article. Features like nullable types, extension properties and functions, and boilerplate-free delegates make the language pleasant to use and productive. So check out Kotlin if you are using Python for data science and wanting to learn a JVM language. Again, this is not a proposal to drop your current tools, but rather consider exploring an additional one that may help you tackle new problems. I will continue blogging about my experiences with Kotlin, and showcase it being used in deeper data science topics as well as Spark.

12 comments:

  1. > Scala, although an undeniably powerful JVM language, is rather esoteric and inaccessible for those who do not have a software engineering background

    Just plain false

    Your title is misleading, you don't actually compare Scala

    ReplyDelete
    Replies
    1. I'll grant I have not been explicit who I was writing this article for: Python developers who struggled to adopt Scala or Java. That's why I was comparing the syntax to Python. If you are a Scala developer, you have plenty of expertise to research Kotlin. And if you like Scala, you have little need to read this article.

      Personally I tried to learn Scala about a year ago and I just couldn't "get it", even as a Java developer. It's too math-driven for me and a lot of Kotlin developers have similar sentiments. But a good platform like the JVM has something for everyone, right? :)

      Delete
    2. Comparison of Scala to Kotlin: https://kotlinlang.org/docs/reference/comparison-to-scala.html

      Delete
    3. Your observation that Pythonistas might find JVM style typing cumbersome is pretty fascinating to me since I came from the reverse direction, moving from C/C++ to Java to Python, which I tend to love precisely for its loose typing. It DOES support type wrapping, which can keep most unexplained behavior in check (especially around any math operations or string manipulation).

      There are plenty of other reasons to use JVM variants for enterprise development, mostly for their approach to private functions and robust threading support - how does Kotlin compare there?

      Delete
    4. Hey Eric. Yes you are right on all points. I like Python and am using it weekly with RxPy. Python is currently the right tool for many tasks due to its data science libraries. But there are other benefits though to JVM and Kotlin in the enterprise. Kotlin has aptly been described as "Scala for dummies", meaning it has a pragmatic syntax and set of language features designed for industry, but removes a lot of what people find to be academic and esoteric. It has full concurrency support since it runs on the JVM, and (although I use Rx for concurrency) there will be support for async/await/yield operations soon.

      https://blog.jetbrains.com/kotlin/2016/04/kotlin-post-1-0-roadmap/

      I see you sent a message to me. I'll follow up later today and get back.

      Delete
  2. Why are you comparing character counts? If that matters to you, use Perl and single char varnames.

    But, in the example when you do it for Python, since it's a criteria of yours: you may want to consider a named tuple.

    >>> Person = namedtuple("Person", "fname lname age")
    >>> Person("Fishy", "McFishface", 34)
    Person(fname='Fish', sname='McFishface', age=34)

    ReplyDelete
  3. I'm afraid I don't follow your first question regarding counts. And yeah, you can use named tuples for sure. As a Java dev they just feel odd since I can create a class, although I see your point in that it yields less boilerplate.

    ReplyDelete
  4. I think Kotlin is going to take over android Community very very soon
    Learn Kotlin android

    ReplyDelete
  5. I definitely appreciate your blog. Excellent work Data science !

    ReplyDelete
  6. Hello,
    The Article on Kotlin for Data Science is nice,It give detail information about it.Thanks for Sharing the information about Data Science. big data scientist

    ReplyDelete

  7. it’s really nice and meanful. it’s really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information.
    Data Science Training in Hyderabad

    ReplyDelete