Jupyter Notebooks for Data Analysis

Why Does Apache Spot Include iPython notebooks?

The project team wants Apache Spot to be a versatile tool that can be used by anyone. This means that data scientists and developers need to be able to query and handle the source data to find all the information they need for their decision making. The iPython Notebook is an appropriate platform for easy data exploration. One of its biggest advantages is that it provides parallel and distributed computing to enable code execution and debugging in an interactive environment – thus the ‘i’ in iPython.

The iPython notebook is a web based interactive computational environment that provides access to the Python shell. While iPython notebooks were originally designed to work with the Python language, they support a number of other programming languages, including Ruby, Scala, Julia, R, Go, C, C++, Java and Perl. There are also multiple additional packages that can be used to get the most out of this highly-customizable tool.

Starting on version 4.0, most notebook functionalities are now part of the Project Jupyter, while iPython remains as the kernel to work with Python code in the notebooks.


IPython with Apache Spot for Network Threat Detection

NOTE:  This is not intended to be a step-by-step tutorial on how to code a threat analysis in Apache Spot, but more like an introduction on how to approach the suspicions of a security breach.

Although machine learning (ML) will do most of the work detecting anomalies in the traffic, Apache Spot also includes two notebook templates that can get you started on this. The Threat_Investigation_master.ipynb is designed to query the raw data table to find all connections in a day that are related to any threat you select – even connections that were not necessarily flagged as suspicious by ML on a first run. This gives us the chance to get a new data subset and here is where the fun begins.

If you suspect of a specific type of attack in your network, you can get the whole story by answering the Five ‘W’s.


Maybe there’s been an increase in the logs collected by the system, which indicates abnormal amounts of communication in your network. Or, the amount of POST requests in your network have risen overnight. This is the mystery that needs to be solved by researching through the anomalies previously detected by ML.


Assuming you have a network context, you can identify the name of the infected machine inside the network, as well as the name of the IP or DNS on the other side of the connection (if it is a known host). If you don’t have a network context or are using DHCP, this can be a little tricky to detect using only Netflow logs. But, that’s where DNS and Proxy logs, come in handy. Including a network context file with Apache Spot is really simple and can go a long way when identifying a threat.


To have a broader visibility on the attack, you can customize the queries on the Threat investigation notebook to review the data through a wider time lapse – instead of just checking through the current day. With this, you could find an increase of a certain type of requests to one (or many) URIs and predict its future behavior.


When working only with DNS, having a destination URL might not say much about where your information is going to, but Apache Spot allows you to connect with a geolocation database to identify the location of the suspected attackers IP. Taking advantage of this option, you can visually locate the other end of the connection on a map. You might find that it’s pointing to a country banned by your company, indicating a leak.


This answer to “why” will depend highly on the result of the analysis. For instance, an excessive amount of POST requests from one machine inside the network to an unidentified URI can indicate a data mining attack. Tracing back to patient zero, you can find that this could have originated with a phishing email, malicious software installed by an employee or a one-time visitor’s infected machine that connected to your network.

How to Get Answers to the Five Ws Questions

All of the previous questions can be answered by looking at the raw data collected. Although performing elaborated queries directly to your database can seem tempting, this type of analysis with Hive, or even Impala, can be very time consuming. A better approach would be to use Pandas to read and transform your dataset into a relational structured dataframe. This lets you work with as if it were an offline structured relational database.

Once you have your desired results and data subsets, you can use MatplotLib to easily graph your findings. (We cover this subject in more depth in another post.) Another advantage of the notebook is that you can download it as HTML or a PDF file to store locally and use it in a presentation – or just keep it for future reference.

Wrap Up

This post was meant to be just a brief introduction of how you can use iPython notebooks in Apache Spot to perform further data analysis and include it our executive report (in addition to the already included Story board). Although this is not the only way you can do this, it is a very interactive and fun way to do it. You’ll also see that the overall processing time is very short – thanks to the iPython notebook task parallelism ability.

We want to hear from YOU! Have you used iPython notebooks before? How do you feel about having this tool in Apache Spot? If you’re interested in further data analysis through interactive charts, a new post is coming soon on D3 and jQuery data visualization. Also, check back soon to read more on this and other Cybersecurity subjects.