Updating the models

Today I was working on another branch to write the code that can fill the database. I scanned some code and saved copied the results to open them in clean JSON format because I wanted to iterate over the results. I used the online the following JSON parser.


It is a website that shows clean results and you can correctly see what’s going on. After checking out the results I found that there was some discrepancy with the database. So I left the work of filling database and started the work on updating the database.

{"files": [{"licenses": [{"category": "Permissive", "start_line": 2358, "short_name": "MIT License", "spdx_url": "https://spdx.org/licenses/MIT", "text_url": "http://opensource.org/licenses/mit-license.php", "spdx_license_key": "MIT", "homepage_url": "http://opensource.org/licenses/mit-license.php", "score": 10.0, "end_line": 2358, "key": "mit", "owner": "MIT", "dejacode_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit", "matched_rule": {"licenses": ["mit"], "license_choice": false, "identifier": "mit_14.RULE"}}, {"category": "Permissive", "start_line": 2532, "short_name": "MIT License", "spdx_url": "https://spdx.org/licenses/MIT", "text_url": "http://opensource.org/licenses/mit-license.php", "spdx_license_key": "MIT", "homepage_url": "http://opensource.org/licenses/mit-license.php", "score": 97.6, "end_line": 2547, "key": "mit", "owner": "MIT", "dejacode_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit", "matched_rule": {"licenses": ["mit"], "license_choice": false, "identifier": "mit.LICENSE"}}], "path": "URL/5", "packages": [], "scan_errors": [], "copyrights": [{"end_line": 2542, "holders": ["GitHub Inc."], "start_line": 2530, "statements": ["Copyright (c) 2011-2017 GitHub Inc."], "authors": []}, {"end_line": 2588, "holders": [], "start_line": 2582, "statements": ["(c) 2017 span title"], "authors": []}]}], "scancode_notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.", "scancode_version": "2.0.0rc3", "files_count": 1, "scancode_options": {"--format": "json", "--ignore": [], "--license": true, "--package": true, "--license-score": 0, "--copyright": true}}

This is the JSON result that helped me to create the changes in the database.

Following is the commit related to the change in the database.


After this, I went back to work on filling the database. I will be able to push something tomorrow. It is a tough task to keep everything. If I wrote some bad code then it is going to be very hard to debug.

Using celery to run long running task asynchronously

The heading itself is a bit confusing. So, let’s try to break it down by taking the example of what we wanted to achieve with this.

Our job was to run code scans when the user gives us an URL in the URL field. If you are reading the earlier posts, we have achieved somewhat similar results. But the problem is that the scanning is a long and tedious task and takes a lot of time. And for the time the server is generating the output, the user has to wait on the same page for a lot of time.

So what we wanted to do was that run these tasks on the server as separate operations, totally different from the simple tasks that are happening on regular basis. Only the tasks that are very simple are included in the main part.

This thing can be thought of being related to the threads. The main thread runs all the operations but on some occasions, we have some child threads that runs synchronously with the main thread. While this concept is used to reduce the execution time, we are using similar thing, celery to provide good user experience.

This form was created to take the URL from the user. Once the user gives in the URL, we handle the tasks in the tasks.py file, where all the stuff is defined.

Now we used a library called celery for running these tasks in the background. What we did was as follows:

We sent the users to a waiting URL. The same waiting URL will be used to show the results when the tasks are done and results are generated. The waiting URL will look somewhat like this:

We can also integrate the email system that will ask the user to give the email and the system will notify them when the tasks are done.

After the completion of tasks when the user reloads the page, he will get the following results. We are not looking to get proper formatting of the result for now. This is one of the upcoming task.

If you compare the URL of the one giving the output to wait and the URL that shows the results, you will find that both of them are same.

So what’s happening in the backend? 

Basically, I used a small trick up here. I coded the following model.

class CeleryScan(models.Model):
     scan_id = models.AutoField(primary_key = True)
     scan_results = models.CharField(max_length = 20000, null=True, blank=True)
     is_complete = models.BooleanField(default = False)

     def __str__(self):
          return str(self.scan_id)

The hack is to use is_complete attribute in the model. Whenever a user gives an URL to scan we generate an instance of the CeleryScan class and send it to the celery task manager. What we have to remember here is the scan_id. scan_results is initialized to null and is_complete variable is assigned to False.

Finally, when the tasks are done models are updated by using the same scan_id. We give the string output to the scan_results and set is_complete is set to True.

In views.py, I wrote the code that will take the scan_id from the URL. If the is_complete variable for the same scan_id is True then results are shown otherwise the same 'Please wait' message is shown.

How does celery works: 

Celery creates a queue of the incoming tasks. First, we register various tasks that are going to be executed by celery. Whenever such task is encountered by Django, it passes it on to celery. It also doesn’t wait for the results. Each task reaching the celery is given a task_id. We can check for various things about the task using this task_id.

How to use celery: 

I used the following sources to include celery in the project.



It’s fairly simple. Please refer to the following commit for the code related to this post.


Using requests library to grab code present at an URL

Finally, we were on the stage to build the main module of the project i.e. the module to get the URL’s from the users and return the scan results. As the first part, we are picking whatever is present at the URL, scanning the retrieved thing and showing the results.

The approach that I was thinking of using:

I was thinking that first of all, we would take the URL from the user in the form. Then we will pick up the files and subfolders from the location, gather everything under one location and apply scan command on the code.

For doing this I was thinking of using wgetand git clone modules for the beginning. I was thinking that for the beginning we will make the checker that will check if the files can be downloaded from the URL provided. If yes then we would had used the wgetcommand and if the files were not retrieved we would be had used the git clone.

Clarification by the mentor:

My mentor said that it is of lesser use to get the files from somewhere. So he asked me to use another Python module called requests. Which as far as I know converts the code on the browser to HTML. He suggested that in the next task we will be adding some condition to get the code from GitHub and BitBucket like websites.

After some clarifications from the mentor, I started writing the code. Templates to get the URL from the user were already implemented. Now I was going to write the basic logic. So I went to filetasks.py and wrote the following function.

This commit has everything that I did today: https://github.com/nexB/scancode-server/commit/fb72a4129fab31131f84622427a11b1675138821

In tasks.py file, first of all, I imported the requests and os module. Then I created the function that was going to be called when the user submits the form.

Now I check which folders are already present in the directory media/URL/. I used a mechanism to give recursive file names to the incoming files. The mechanism is as follows. It put all the subdirectories of the main directory into the list. After that, it sorts the list and put the file name of the next file into a variable by adding one to the list’s last element retrieved by using dir_list[-1].

When this is done, we use the requests module the HTML code present at the place and put the code into the file, if the status_code of the request is 200. After that, we call the same function that was used to scancode for the local code about which I have talked in the last few posts.

After, that I have to make the view. In the view, I took the URL from the user using the form and passed it the object of the tasks.py class.

Now somewhere in the code, you will encounter .encode('utf-8') . This is used to encode the text received in the utf-8 format. I know a little about these encoding formats. But this encounter helped me to dig into the world of Unicode.

A word about character sets:

A word about the requests module:

This module is used to send requests to a particular URL. Just use the following command to install the requests module.

$ pip install requests

Now to make use of the module import the module using the following command

>>> import requests

Simply use the module to send requests to the given URL. Using the following code.

>>> r = requests.get('https://github.com/singh1114')

The variable r will contain things like response headers, status code, error codes and many more.

>>> print(r.text)

Print the response text to get the output.

>>> r.status_code

The correct response code is 200. There might be other codes which I am unaware of. Please read them from the official documentation of requests.

>>> r = requests.get('https://api.github.com', auth=('username', 'password'))

We can easily pass a dictionary to the requests module. They will be used as the parameters to the request.


There are few tasks in my mind that need to be handled. I will keep you updated in this category.

A word about character sets – ASCII, unicode, UTF-8

Let’s talk about everything as it happened in the history. It was the time when UNIX was being built. ASCII came into the existence at that time. More such features were used before that but they are not being used now so talking about them is not valid anymore.


So the first thing that was useful and still being used is ASCII. This format used 8-bit space to store a character. So, we had 256 places to store everything. ASCII made a standard that assigned all the characters from 1-128. After that, the places were used by the people as they liked it. It was being used differently by all the people in different countries. This made difficult to send stuff from one country to another. The stuff sent by the person in one country was interpreted differently by the other countries.

Say, 129 is defined as some character in Greece. When this text is transferred to other place and they have defined some other character for 129, the text will change to that as we know when the text is transferred, it is transferred in the binary format.


This led to a problem which was needed to be solved quickly. At the same time, the internet started it’s journey and it was becoming difficult to everyone. This time the Unicode came into existence. Another misconception in the mind of people is that Unicode stores everything n 16-byte characters and there is a limit to the number of characters that can be stored in Unicode. But it is not true. Unicode uses the concept of code point. Code point is a magic value that is assigned to every character that exists in the world. Like U+0048 for H. Here U+ stands for Unicode and the number is a hexadecimal number.

In memory, they are stored differently. They are stored in the group of two bytes each. Like H is stored as

00 48

e as

00 65 and so on.

That’s why the misconception of 2 bytes storage came into existence. This looks good but they had a problem. The number stored as 00 48 can also be stored as 48 00. This happened because the early computer either used high to the low memory storage or low to high memory storage and the guys who implemented Unicode wanted to make it fast for both the people. So they had to reserve some bits where they can tell what type of encoding to be used which was either FE FF and FE FE.

Now the guys who wanted to store their stuff in the minimum space came to fight this new thing as most of the stuff they did usually use the letters in the English language so they will have more number of zeroes in the text when saved in the memory. This increased their total size. All these things lead to the birth of UTF-8


UTF-8 used the simple concept. It had no limit on the number of bytes in which characters were being stored. And they also included the ASCII format so there was no wastage of space and when new characters started coming into existence, they were assigned some new codes.

This is the way things work in the field of character sets.

BTW Thanks for reading.

FileField in forms to upload files to the server

Hello there,

In this post, I will be discussing the process that we used to process and upload files to the server.

Django provides an alternative field named FileField that allow us to add files in the forms. Before doing anything we need to add something to our settings.py file.

MEDIA_URL = '/media/'

MEDIA_ROOT = os.path.join(BASE_DIR, 'media')

These are used to create a special URL for the media files. It is useful when we are going to upload images because in this way we can attach an URL with the images. It can have various uses in the case of files.

After this, we are going to create a form in forms.py file and attach FileField to it.

class LocalScanForm(forms.Form):

    upload_from_local = forms.FileField(label='Upload from Local')

After this, we need to create a template for this.

{% extends 'scanapp/base.html' %}
{% load staticfiles %}
{% block content %}
    <form action="" method="post">
        {% csrf_token %}
        {{ form }}
        <input type="submit" value="Submit" />
{% endblock %}

Create views and attach the URL to that. Here is the commit: https://github.com/nexB/scancode-server/commit/a60cbbbb4d96d46df1232b96f7ff871e5f0e318d

After that, we wrote the code to handle the files. We used the subprocess module to handle the stuff. The post about that is written here.

Now we are trapped on downloading stuff from an URL. We will write some code and push stuff on this blog. Till then stay tuned.

A word about subprocess module

It has been a few days since I have been working on the GSoC project. Although I haven’t been able to write about all the things going on that side. From now on probably I will talk about it more often. Yesterday I managed to upload the code on our local experimental server and show the results to the mentor. Thank god, it didn’t break( I haven’t written any code to handle the errors) and everything worked as expected. Next thing is to attach celery so that the actions happens in the background. Now both celery thing and the server are for the other day. Today I will be talking about subprocess, a module of python that we are going to use for running bash commands.

What we have done till now:

We have made the models which were a hectic task because we don’t really know what was coming out of the scans. After that, I made two main views to start with. One for scanning URL and other for getting files from the local. Today we are going to talk about using subprocess module to run bash commands in a python program.

According to python docs, subprocess can be used as an alternative for the following purposes.


We used the module for our own purpose. We used the Django to upload the file to the server using the basic FileField type in the form. Everything about that task is written in the another post which you will find under the GSoC-2017 category. After the file was uploaded we wanted to apply some bash commands for scanning the code. I was thinking of placing the scancode-toolkit code in the root directory of the project then in the bash script, we will write the following initial commands.

$ cd ../scancode-toolkit/

As we are in the directory inside the scancode-server. This will give us the opportunity to scan the code using the following command.

$ ./scancode file_name

Then the mentor said that we can directly install scancode using the following command.

$ pip install scancode-toolkit

This changed my view and now I was going to simply put the scancode-toolkit into the requirements.txt file. After that, I was going to import subprocess module and write a few lines of code that can scan the code.

Following is the commit in which I wrote the code to do so.


In the code, I haven’t written anything to handle errors. In the next few commits, we will concentrate on doing so.

The rest part of the post will discuss some other features of the subprocess module. subprocess.call is frequently used in the case of subprocess module. A simple implementation of the subprocess is as follows:

>>> import subprocess

>>> subprocess.call(['ls', '-l'])

The result will be shown as standard output to the command line. In subprocess, we have an option to call a program by executing it like the shell does it. For that, we use the following code.

>>> subprocess.call('ls -l', shell=True)

For getting more information read. http://sharats.me/the-ever-useful-and-neat-subprocess-module.html

How to PIPE the result of one module to next module:


grep = Popen('grep ntp'.split(), stdin=PIPE, stdout=PIPE)
ls = Popen('ls'.split(), stdout=grep.stdin)
output = grep.communicate()[0]

Or an alternative of communicate() is to use.

>>> grep.wait()
>>> print grep.stdout.read()

That’s it for today. We will come back with more exciting things tomorrow.

Using Dynamic IP updater to connect two machines on different network

From a few days, we are working on a problem which is related to reaching two machines that are connected to different networks. This post is the update to the earlier post in which we talked about the installation of TightVNC. In that post, we have talked about the problem that we are facing in lot detail. You can read it if you want to.

Now coming back to today’s topic, like an amateur search engine user, I searched for a few random words that came into my mind about the problem. The search engine( Google), led me to this post. This post explains properly about how the things work in the networking. I would like to summarise the things in a paragraph.

According to the post, for a machine connected to a network, there are two kinds of IPs. One is dynamic IP and other is static IP. Most of the systems have a dynamic IP which keeps on changing from time to time and other is static IP which doesn’t change. Whenever we try to open up a website we either give the domain name of the website or we give the static of the server where the website is located. Now that system has a static IP which doesn’t change. That is why we are able to access the website on the same IP. Our ISP( Internet Service Provider) keeps a list of the related domain names and IPs associated with them.

Also, it is cheaper for the ISPs to give a dynamic IP to the home computer than to provide a static IP( I don’t know why). So they provide a dynamic IP to all the computer where their service is being used. Now to overcome this, some good people have created the concept of Dynamic DNS( Dynamic Domain Name Server). In this, a service is used which keep on telling the changing IP to the second machine that wants to connect to the first machine. In this way, the connection can be made easily.

Now the post gives out gives a few options for Dynamic DNS service providers but we love open source. Again I used the search engine to find a few open source dynamic DNS service provider.

After hours of searching and reading, I am able to figure out two things,

  1. We need a client that is going to listen to the changes that are made in our IP and redirect the request to your IP to the new IP.
  2. Another software that will stay on the computer and will share the changes in the IP with the client on the internet.

Redirection in the first software is handled by making use of a sub-domain. They provide you with a special domain name which can be easily reached. The IP will be taken care by them.( The subdomain will look something like yourusername.theirdomain.com)

I am able to figure out the first part. We are going to use another open source service called nsupdate.info.

The rest part in going to be figured out in the next post. Till then good bye.

Models for the scancode project app

code: https://github.com/nexB/scancode-server/commit/5b2ee1bfbff478dcd720b0797e30838b53fcddae

Hello there!

So this is my first post in which I am going to be talking about Google Summer of Code-2017. Yes, it’s true I have selected for GSoC-2017. My project is to create a RESTful API for an organisation called AboutCode. In this post and some coming posts, I will try to write the things that I learn in the way of making the project. I am trying to write all the things to retain as many things as possible.

It has been the second week since the things started and we are still on the models part. It is difficult for the people who don’t know Django. But anyone who has worked on an MVC(Model-View-Controller) framework will be easily able to understand what I am trying to say.

In this post, I am going to share what thought made me write these models in particular. I have tried to achieve as higher normalisation as I could have done.

It is bad strictly start writing the code without discussing the situation. This helps to understand the situation well.

First of all, I should be telling a little about what the software does. The main software grabs the code and gives the output regarding the things like licences, copyrights, packages and other details of the code given as command line argument.

Here I will be sharing a single file output JSON generated by the software.

 "scancode notice": "Some notice and a disclaimer",

"scancode_version": "version_of_scancode_used",

"files_count": 1,

"files": [


    "path": "/home/ranvir/Doc...",

    "scan_errors": ["errors_if_any",],



    "packages": [{}]



Now our work is to create a website that can take URL/local code as an input and show the result in graphical form. Pretty easy right?

So starting from the first task i.e. to create the models( if you are not familiar with MVC and still reading this – it is the task of creating the database in the normalised form). Before that, we did a lot of basic stuff that was related to creating the installation documentation. Setting virtual environments and much more. But the first big task under the hood was to create the models.

I had assigned the time of 2 weeks for this. So somehow we are running late for the task but I hope that we will manage somehow.

Firstly we were making the user table the base of every model and try to grab it into every other table in the database. But then our mentor said that in the website we are going to allow the anonymous scanning. This allowed me to add the concept of guests and users. In this way, we can keep everything almost similar and make least changes.

The next few bulleted points will create one table each.

  • The first table checks whether it is a URL scan or the code is being scanned from the local. This will help us differentiate between the two as the two need to store different things. We need to store URL for the scan from URL and path or directory name for the local scan. This is the most basic table and it associates a scan ID to each and every scan.
  • The next table will take care of the status of the person who applies the command. It can either be a authenticated user or it can be a guest. I have attached the scan ID with the user type.
  • The next table will store the URL for the scan applied by passing an URL.
  • Another table will store the path from local where the code is passed to the system.
  • One more table will store more attributes of the code like code base size, number of files
  • The result table will store the most basic results of the scan like scan errors, file name, file path and an Id for each file.

All the above-discussed tables will have a scan ID as foreign key except the first table which is having it as a primary key.

  • Next three tables will have file ID as a foreign key. The tables are licences, copyrights and packages.

We will definitely need few more models like the user Information model and some other which I am not able to think for now.

The other models can be made as the requirements changes.

Using tightvnc to attach desktop to the server


VNC is a virtual network computing which is used to add a desktop to the server and allows the users to use the server using mouse and keyboard. This allows the inexperienced user who wants to interact with the server to do this without using the commands.

Difference between the server and the desktop OS:

We know that we have to keep the server processor as free as possible. We do this by installing only the most important packages that can run the system efficiently. This allows us to skip the installation of the desktop system for the server. So, the server can only be used by some experienced user who knows few commands to use processes and adding various things to the server. Now as the server which does not have a desktop, it will be very difficult for a user who does not know some very basic commands.

For these users, VNC is introduced so that they can use the server easily. Now that we know why we need to have VNC, I will take this opportunity to explain installation procedures for the for VNC for Ubuntu 16.04.

In this tutorial, we are going to discuss the things for TightVNC which is an open source software that is available for both UNIX-based systems and Windows based systems. It is a kind of upgrade for VNC with lesser number of bugs. Here in this tutorial, we are going to discuss the installation of TightVNC for Ubuntu 16.04.


First of all, let’s test the procedure for the desktop. If you are doing the installation the server we need to install the basic desktop called xfce. For this, we need to apply this command.

$ sudo apt-get install xfce4 xfce4-goodies tightvncserver

Now after doing this installation, we need to configure the installation. For this apply this command,

$ vncserver

After applying this command, we have to give a password that will allow various users to connect to the server. The other option can allow the server to be view only. Using this option the server will become view only, that is we cannot change the things on the server but we can see what’s happening on the server.

VNC launches a default server instance at 5901 which is called the display port when the VNC is first set up. This port is known as display port. We open the other ports by applying the following command.

$ vncserver :2

This commands opens up the port 5902 and so on.

Now as we want to change the way the server works, we need to stop the current instance. For that apply the following command.

$ vncserver -kill :1

Now apply the following command to edit the startup script for VNC.

$ vim ~/.vnc/xstartup

Add the following stuff to the file using usual vim commands.

xrdb $HOME/.Xresources

startxfce4 &

Run this command to give executable permissions to file created if you created the new file but if you have changed the earlier content then there is no need to apply this command.

sudo chmod +x ~/.vnc/xstartup

After that, you have to restart the system or restart the VNC service.

After restarting the system, start the VNC server using the following command.

$ vncserver :1

Now apply the following command to ssh to the server using the display port.

$ ssh -L 5901: -N -f -l username server_ip_address

Where username is the user of the server and IP is the server IP address. Now apply the final command to see the desktop.

$ vncviewer

It will ask for the address. Give the following address.

$ localhost :1

You will see the desktop on the on your local screen.