Back story

A few years ago, I chose to leave the IT Department and take a lateral move into an Industrial Engineering Department at my same employer. I transitioned from working on execution and movement systems to working on extracting data from the manufacturing systems and using that data to make timely decisions on the manufacturing floor. For example, based on a demand signal, what product should run next on a tool. As a result of my career move, I became much more familiar with statistics, Theory of Constraints, Little’s Law, and innumerable details of the semiconductor manufacturing process. I learned quite a bit more about SQL and significantly improved my Excel skills.

My experience using data to make decisions lead me to learn R and take a much deeper interest in statistical analysis. (I’ll be honest, I was also driven by my dislike for Excel.) R is an excellent language for this. I can’t say I’ve read every page of The R Book but some of the pages I’ve read, marked up and read again and again. Operation chaining was huge discovery for me. Being able to chain together group by and aggregation functions in a pipeline made a lot of sense for me because it’s similar to the Unix pipeline. I first used them in dplyr with data frame then later with data.table.

I started out with R on the command line in REPL mode. I’ve worked with REPLs for a long time starting out with the first non-compiled language I loved – perl. The debugger was easy to drop into and became my first REPL. (I can still type perl -d -e 1 from finger memory.)

I made quite a bit of progress with the R REPL, but things really got much easier with first the R Console GUI, then later RStudio. There was joy in doing a ggplot and seeing the resulting graph pop up.

Disillusionment

I ran into several challenges with R that eventually caused me to look to other languages. I’m not challenging others using R, just explaining what I faced.

R is difficult to automate. Command line is confusing
Creating reusable packages is challenging. I’ve ready R Package and it still isn’t clear how to do it.

Although I lost momentum on R, I did not lose my desire for a data analytics language paired with a good graphical REPL.

Python & Jupyter

I started exploring Python as an alternative to R. I had read good reviews of pandas and though that it might be an alternative. Just like learning R packages, the learning curve for Pandas and other Python packages is a bit steep.

It didn’t take me long to start working with Jupyter Notebooks. I ran into it when I picked up Think DSP. The Think DSP book is a good example of an excellent blend of Python, Jupyter Notebooks, and package development. If the topic is interesting, I would highly recommend it as a way of becoming much more familiar with Python. The code for the chapters is available on Git: Think DSP git

Now, whenever I have time to spare, I fire up a Jupyter Notebook running Python & Pandas and explore some interesting data sets.

Jupyter in the world

Earlier this year I participated in my first Kaggle Event. Learn Python Challenge which I found was a great way to extend my Python knowledge as well as get a better understanding of Kaggle. I recommend anyone trying to learn Python to take the learn Python challenge on Kaggle. It’s quick but provides a lot of useful information on Python. I particularly like the tests that help you determine if you’ve completed each exercise with a sufficient answer.

Netflix appears to be heavily involved in the Jupyter notebooks ecosystem.

Using Jupyter Notebooks for production pipelines

As well as several presentations and Medium posts:

I really like the thesis in Part 1:

That notebook then becomes an immutable historical record, containing all related artifacts — including source code, parameters, runtime config, execution logs, error messages, and so on.

Posthumously diagnosing any failed software is difficult especially if there is no record of the context as it ran. All production code that I have worked on has used traces and log files to support Posthumous investigations. Using an executed notebook as that record of execution makes a lot of sense.

Summary

If you are interested in getting into Data Engineering or Data Science. I think Notebooks in general and Jupyter Notebooks as a specific example provide a valuable venue to share and present data in a compelling way. I’m impressed with the utility of Python as a way to investigate data deeply in a language that is easy to understand and teach to others.

Arriving at Jupyter Notebooks

Back story

Disillusionment

Python & Jupyter

Jupyter in the world

Summary