About StackOverkill

A Big Data project for the analysis of the current use of different programming languages, frameworks and software development technologies.

Follow StackOverkill on !



What is this?

This is a Big Data project made by Daniel Sánchez Castelló. It was released on May 2016 as Degree's Final Project for the University of Wales.


What is the source and purpose of this project?

It's made from the data of StackOverflow's website, to measure the activity (questions, answers and votes) for the different languages, frameworks and technologies (reachable by the posts' tags). The intention is to get a picture of the current popularity and trends of those languages and technologies over the last years.


How is it made?

First of all I downloaded StackOverflow's public dataset (updated every 3-4 months), which is in XML format, and I transformed it into AVRO format (much more Big Data friendly) using a script of my own called SOXTA, which uses LXML and AVRO libraries.

For the data mining I set a distributed environment with Apache's Hadoop, Hive and Spark. However, I don't own physical machines to store and process that amount of data, so I used S3, EC2 and EMR from Amazon Web Services.

Once the data is processed I made a little bit of witchcraft with another Python script of my own called SOSTC3, which uses Pandas and NumPy libraries to get the time series coefficients for each language and technology. I also used a polinomial regression algorithm to make a short-term prediction for them.

And, finally, it's shown to you via C3 charts. I hope you find it useful!


Cool! What about you?

I'm glad you ask! I'm Daniel Sánchez Castelló (a.k.a. Dani Sancas). I was born in Bilbao (Spain) on June 1988. I've been a web developer since I was 20 years-old. And now, thanks to this project's, I'm moving into the Big Data world. This is my first Big Data project. I found it very interesting and exciting, and I'm looking forward to keep on learning!


GitHub Twitter LinkedIn