Data analytics field to focus on

I remembered my 1st data analyst interview when I was kindly advised against doing A to Z. The takeway of the interview is I got to focus.

In bigger organisations, there exists separate roles in the field of data analytics. Each data profession contributes his or her area of expertise from Architecting to Modeling. In smaller outfit, the same data analyst might be doing end to end work.


Creating a system that can take in data in its raw form, process it, store it, use it and get the results to the program or end user that needs them. Open-source technologies like Hive, Pig and Spark are the tools of trade.  If the organisation is a “Microsoft Shop”, there is a certified exam. Mostafa has a good run down on preparing for the exam. For smaller entity, this architecting part might be outsource to the cloud. I do not see myself working in this area.

Experiment design

Creating a plan to collect enough data of the right type to give meaningful results.

Data Wrangling

Creating clean and useful data from messy real world collections from disparate sources and concatenate them together .

Data Modeling

Creating a mathematical representation of your data.

What is my focus?

I would love to do Experiment Design, Data Wrangling and Data Modeling all together. It a multi-step process of data analysis. These steps brings variety to the job.





Application of Machine Learning – Infer gender from first name

Twitter does not ask for your gender or where you from when you sign up. Thanks to Machine Learning,  they can infer based on what you tweet and serve up relevant ads.

Perhaps Twitter is using “Infer gender from first name” approach based on your Twitter handle. A search on Google turns up some interesting tools and approaches that helps you to “Infer gender from first name”

There are times, they do not get it spot on. Ads for Ukraine woman seeking marriage do appears on my Twitter timeline on few occasions. 😊


Maybe I should tweet more about my family instead of just current affairs, data science, Singapore, living in Ireland and aviation  stuff.


So you called yourself a Data Scientist?

There should be a law where people can’t call themselves data scientist unless

  • the title being conferred by a recognised award association,
  • wrote a peer-reviewed IEEE published relevant research paper,
  • develop and contribute a R/Python package,
  • work with GB of structured and unstructured data daily
  • and at least a Master Degree.

Sem 3 Project Update 1

As I decide whether how I can use that data for my project, I revisited the proposal.

The proposal was intend for private rental not rental by local authorities or approved housing bodies (housing associations)

Such distinction make a difference. This mean data from local authorities or approved housing bodies (housing associations) can be drop and focus on data that matters.

The wrong set of data, other things being equal is going to adversely impact on the final outcome of the machine learning algorithm.

Happy Days!


Ultimate R Resources

R Resources that guides me when I am stuck. You can’t possibly remember every command.

View Ultimate R Resources on Hackpad.


What your census story?

Screenshot 2016-04-26 22.21.23
Looking at my own Electoral Division (CSO Area Code ED 02158) where I live, the population is 2,164 (1,266 males and 1,353 females). Of this, 2,110 respondents said they were Catholic.

There are 294 Asian or Asian Irish.

741 males were reported as single vs 727 females.

1,824 were reported as not being able to speak Irish.

1,090 population out of 2,159 are at work.

Only 21 out of 1,052 households had no central heating – the vast majority of 770 were heated by natural gas.

As for education, there are 8 in my local area with PhDs (5 males/3 females), 236 people gave their occupation as Professional Occupations.

716 houses (out of 1,052) have a computer, while 1 households have four or more cars.

466 of them drive to work, school or college.


Day 3 with RapidMiner

Hit the wall on the 3rd day with RapidMiner.

Instead of diving in as usual, today I decided to go over the tutorial that came together with the software. Screenshot 2016-04-26 18.01.35

Unlike R which is a command line tool, you type in to get thing done. RapidMiner offers a point and click approach. Every step of the process, there is dialogue box to alert you what you have overlook. The most common one is this.

Screenshot 2016-04-26 17.53.49

Frustrated, I feel like dropping them an email but decided to press ahead by doing something different. Trying out their tutorial.

I learnt that if you want to export your data to csv in RapidMiner, you use the Write CSV Operator.


However, I am not please with the output csv. I expecting all those data to be in each individual column.

Until then, I will keep exploring this tool.

As someone who learn thing visually, this tool serve as an aid for me to learn more about data mining techniques.

Most likely, this RapidMiner going to be the first tool I go to before writing a R command. When collaborating with others, the visual representation offered by RapidMiner can be less intimidating than confronting with lines of code.


Which R reference book for me?

View Which R book for me? on Hackpad.


Linkedin is not FaceBook


Much have said about Linkedin is not FaceBook which I agreed. However, We are more than our jobs. It can’t be the only thing that defines us fully. At the workplace, I am sure you don’t just talk about work all day long. You share your life favourite moment with your co-worker, your like and dislike, your take on a sporting event over the weekend. You put your family photo on the work desk. You “decorate” your work desk with bottles of hand lotion.

The best place to have the best of both world online – professionally and personal is Twitter. No one will censure you if you share what you eat for the day (just don’t over do it) and the next tweet, you share business idea or how to extract un-sample raw data from Google Analytics using Open Source tool.

Let connect on Twitter. I am at


Successfully connect RStudio on my PC to GitHub

Tinkering with Tech. Successfully connect RStudio on my PC to GitHub and push the changes.

The idea behind this tinkering is somewhat similar to Dropbox.  When you move a file to a folder and that file sync to  When you are not with your PC, you have access to the file from any PC.

In this case, instead of a file, any changes to your R code is being reflected on GitHub.  This helps when you collaborate with other people.  They are able to see what you have done when you use R to manipulate the data.


To get RStudio to push R code to Github, you got to install Git. RStudio can act as a GUI front-end for Git. This means one less software , a Git Client to install.

This is a new learning milestone for me. Make the geek in me happy.

This part of setting up Git for version control is not cover in NCIRL’s Higher Diploma in Science in Data Analytics.  If you enroll in one of those Coursera data science course, you are introduce to Git.

Here my R code on GitHub