Create Free Jekyll Blog on GitHub using JekLog.com

I found a great tool to create awesome blogs called JekLog or The Jekyll blog creator tool. This tool is used to create awesome blogs easily using Jekyll and GitHub pages. The creator has written an in detail post on How to use the tool over here: How to create a free blog using GitHub Pages and Jekyll with JekLog

I have always been curious about the development of the Jekyll blogs because of there speed. They provide static content which does not involve databases to passed in the backend. Due to this reason, it becomes pretty easy for the content provider to provide the HTML content.

To strengthen the fact let’s compare the two blogs:

First of all, let us see the performance check-up on this Blogspot blog:

Web performance check-up for blogspot.com
Web performance check-up for http://blog.jeklog.com

The above pictures tell the whole story in detail. This means that the blog created using JekLog are pretty fast as compared to blogs created using Blogger.

There are some other factors to consider too: Domain authority of blospot.com is 90 while the domain authority of github.io is 94. This means if you create a blog using JekLog, you are more chances to get indexed higher than the blogs on Blogspot.

Advertisements

GraphFrames

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides
high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX
and extended functionality taking advantage of Spark DataFrames. This extended functionality
includes motif finding, DataFrame-based serialization, and highly expressive graph queries.
What are GraphFrames? GraphX is to RDDs as GraphFrames are to DataFrames.
GraphFrames represent graphs: vertices (e.g., users) and edges (e.g., relationships between
users). If you are familiar with GraphX, then GraphFrames will be easy to learn. The key differ-
ence is that GraphFrames are based upon Spark DataFrames, rather than RDDs.
GraphFrames also provide powerful tools for running queries and standard graph algorithms.
With GraphFrames, you can easily search for patterns within graphs, find important vertices, and
more. Refer to the User Guide for a full list of queries and algorithms.
creating nodes using pagerank algorithm

# Create a Vertex DataFrame with unique ID column “id”
v = sqlContext.createDataFrame([
(“a”, “Alice”, 34),
(“b”, “Bob”, 36),
(“c”, “Charlie”, 30),
], [“id”, “name”, “age”])
# Create an Edge DataFrame with “src” and “dst” columns
e = sqlContext.createDataFrame([
(“a”, “b”, “friend”),
(“b”, “c”, “follow”),
(“c”, “b”, “follow”),
], [“src”, “dst”, “relationship”])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
g.inDegrees.show()
# Query: Count the number of “follow” connections in the graph.
g.edges.filter(“relationship = ’follow’”).count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select(“id”, “pagerank”).show()

NetworkX

NetworkX
Figure 4.2: NetworkX logo
NetworkX is a Python package for the creation, manipulation, and study of the structure, dy-
namics, and functions of complex networks.
Features
• Data structures for graphs, digraphs, and multigraphs• Many standard graph algorithms
• Network structure and analysis measures
• Generators for classic graphs, random graphs, and synthetic networks
• Nodes can be ”anything” (e.g., text, images, XML records)
• Edges can hold arbitrary data (e.g., weights, time-series)
• Open source 3-clause BSD license
• Well tested with over 90% code coverage
• Additional benefits from Python include fast prototyping, easy to teach, and multi-platform
Installation
sudo apt-get install python-pip python-virtualenv
virtualenv venv
source venv/bin/activate
pip install networkx
Algorithm PageRank computes a ranking of the nodes in the graph G based on the structure
of the incoming links. It was originally designed as an algorithm to rank web pages.
Graph types
• Undirected Simple
• Directed Simple
• With Self-loops
• With Parallel edges

OSMnX

OSMnx
Figure 4.1: OSMnx map of manhattan
OSMnx: retrieve, construct, analyze, and visualize street networks from OpenStreetMap.
OSMnx is a Python package that lets you download spatial geometries and construct, project,
visualize, and analyze street networks from OpenStreetMaps APIs. Users can download and con-
struct walkable, drivable, or bikable urban networks with a single line of Python code, and then
easily analyze and visualize them.
Features
• Download street networks anywhere in the world with a single line of code
• Download other infrastructure network types, place polygons, or building footprints as well• Download by city name, polygon, bounding box, or point/address + network distance
• Get drivable, walkable, bikable, or all street networks
• Visualize the street network as a static image or leaflet web map
• Simplify and correct the networks topology to clean and consolidate intersections
• Save networks to disk as shapefiles or GraphML
• Conduct topological and spatial analyses to automatically calculate dozens of indicators
• Calculate and plot shortest-path routes as a static image or leaflet web map
• Plot figure-ground diagrams of street networks and/or building footprints
• Download node elevations and calculate edge grades
• Visualize travel distance and travel time with isoline and isochrone maps
• Calculate and visualize street bearings and orientations
Installation
sudo apt-get install python-pip python-virtualenv
virtualenv venv
source venv/bin/activate
pip install osmnx
Usage
import osmnx as ox
G = ox.graph_from_place(’Punjab, India’, network_type=’drive’)
ox.plot_graph(ox.project_graph(G))

Open Street Map(OSM)

OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world.
The creation and growth of OSM has been motivated by restrictions on use or availability of map
information across much of the world, and the advent of inexpensive portable satellite navigation
devices.

OSM is considered a prominent example of volunteered geographic information.
Created by Steve Coast in the UK in 2004, it was inspired by the success of Wikipedia and
the predominance of proprietary map data in the UK and elsewhere. Since then, it has grown
to over 2 million registered users, who can collect data using manual survey, GPS devices, aerial
photography, and other free sources.

This crowdsourced data is then made available under the Open Database Licence. The site is supported by the OpenStreetMap Foundation, a non-profit
organisation registered in England and Wales.

Rather than the map itself, the data generated by the OpenStreetMap project is considered its
primary output. The data is then available for use in both traditional applications, like its usage
by Craigslist, OsmAnd, Geocaching, MapQuest Open, JMP statistical software, and Foursquare
to replace Google Maps, and more unusual roles like replacing the default data included with
GPS receivers. OpenStreetMap data has been favourably compared with proprietary datasources,
though data quality varies worldwide.

Map usage Map is available on the following platform.

  •  Web browser Data provided by the OpenStreetMap project can be viewed in a web browser
    with JavaScript support via Hypertext Transfer Protocol (HTTP) on its official website.
  • OsmAnd OsmAnd is free software for Android and iOS mobile devices that can use offline vector data from OSM. It also supports layering OSM vector data with prerendered raster map tiles from OpenStreetMap and other sources.

• Maps.me Maps.me is free software for Android and iOS mobile devices that provides offline
maps based on OSM data.
• GNOME Maps GNOME Maps is a graphical front-end written in JavaScript and intro-
duced in GNOME 3.10. It provides a mechanism to find the user’s location with the help of
GeoClue, finds directions via GraphHopper and it can deliver a list as answer to queries.
• Marble Marble is a KDE virtual globe application which received support for OpenStreetMap.
• FoxtrotGPS FoxtrotGPS is a GTK+-based map viewer, that is especially suited to touch
input. It is available in the SHR or Debian repositories.
• Emerillon Another GTK+-based map viewer.
• The web site OpenStreetMap.org provides a slippy map interface based on the Leaflet
JavaScript library (and formerly built on OpenLayers), displaying map tiles rendered by
the Mapnik rendering engine, and tiles from other sources including OpenCycleMap.org.
• Custom maps can also be generated from OSM data through various software including Jawg
Maps, Mapnik, Mapbox Studio, Mapzen’s Tangrams.
• OpenStreetMap maintains lists of online and offline routing engines available, such as the
Open Source Routing Machine. OSM data is popular with routing researchers, and is also
available to open-source projects and companies to build routing applications (or for any
other purpose).

Importance of logging in python

Not in the mood of writing much today so will probably leave the link which I found useful and will definitely help us if we want to know more about logging.

https://fangpenlin.com/posts/2012/08/26/good-logging-practice-in-python/

There’s only one thing that they have not explained properly and that is the use of “`__name__. Basically, this will tell the name of the file and improve the log message. This helps us to make more sense of the logging message and we can tell where the error occurred during the program execution.

Serialization: A week long struggle

Hello folks,

I have been away from my blog because there was nothing really to discuss. I was constantly trying to do some stuff and was constantly failing. But, after a week long struggle and some help I was able to get over this struggling period and now shifted to the next task in my task list.

So as a whole, this month was well spent learning new stuff, first unit tests and then serializers. Those who have worked with Django Rest Framework will get what I am trying to say in the post.

First things first, Why do we need serializers?

To answer this question, we need to know why were the serializers created anyway.

According to some reliable sources like Wikipedia, serialization is the process by which we convert the data into such a format so that it can be transferred easily through the different layers of electronic components.

We know that our data is present in the models. We also know that we cannot ship that data easily to different formats through our models. So, we use the simple concept of serialization that converts the models’ data or any other data into JSON, XML or YAML format which can be easily transmitted over the network.

Easy, right?

Let’s dive in and see some code snippets.

class ScanInfo(models.Model):
    def __str__(self):
        return self.scan_type

    scan_types = (
        ('URL', 'URL'),
        ('Local Scan', 'localscan'),
    )

    scan_type = models.CharField(max_length=20, choices=scan_types, default='URL')
    is_complete = models.BooleanField()

class UserInfo(models.Model):
    def __str__(self):
        return self.user.username

    user = models.OneToOneField(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
    scan_info = models.ForeignKey(ScanInfo)

class URLScanInfo(models.Model):
    def __str__(self):
        return self.URL

    scan_info = models.ForeignKey(ScanInfo)
    URL = models.URLField(max_length=2000)

class LocalScanInfo(models.Model):
    def __str__(self):
        return self.folder_name

    scan_info = models.ForeignKey(ScanInfo)
    folder_name = models.CharField(max_length=200)

class CodeInfo(models.Model):
    def __str__(self):
        return self.total_code_files

    scan_info = models.ForeignKey(ScanInfo)
    total_code_files = models.IntegerField(null=True, blank=True)
    code_size = models.IntegerField(null=True, blank=True, default=0)

Well, that’s not the all of the models, but you got the idea, right? So, we have multiple levels of inheritance between all those models( Well not really inheritance but in simple words, we can say this). Now the real test is to write the serializers about them.

I decided to use the simple ModelSerializers.

class ScanInfoSerializer(serializers.ModelSerializer):
    class Meta:
        model = ScanInfo
        fields = '__all__'
class UserInfoSerializer(serializers.ModelSerializer):
    class Meta:
        model = UserInfo
        fields = '__all__'
class URLScanInfoSerializer(serializers.ModelSerializer):
    class Meta:
        model = URLScanInfo
        fields = '__all__'
class LocalScanInfoSerializer(serializers.ModelSerializer):
    class Meta:
        model = LocalScanInfo
        fields = '__all__'
class CodeInfoSerializer(serializers.ModelSerializer):
    class Meta:
        model = CodeInfo
        fields = '__all__'

Now I checked the sample outputs of these serializers and to my surprise, I was not able to get the desired result. The JSON output created by them was totally opposite from what we were expecting it to be.

So, I did an experiment to create a GodSerializer( Which was the literal name of the serializer) along with a helper for it. The helper will tell the serializer in the way that it was going to work.

class GodSerializer(serializers.Serializer):
    """
    Another good serializer to handle all the serialization activities
    """
    code_info = CodeInfoSerializer()
    url_scan = UrlScanInfoSerializer()
    local_scan = LocalScanInfoSerializer()
    scan_result = ScanResultSerializer()
    scan_file_info = ScanFileInfoSerializer(many=True)
    license = LicenseSerializer(many=True)
    matched_rule = MatchedRuleSerializer(many=True)
    matched_rule_license = MatchedRuleLicenseSerializer(many=True)
    copyright = CopyrightSerializer(many=True)
    copyright_holder = CopyrightHolderSerializer(many=True)
    copyright_statement = CopyrightStatementSerializer(many=True)
    copyright_author = CopyrightAuthorSerializer(many=True)
    package = PackageSerializer(many=True)
    scan_error = ScanErrorSerializer(many=True)

After this, I created the GodSerializerHelper that helped the Serializer the way things were going to work. Here is the code for the helper.

class GodSerializerHelper(object):
    def __init__(self, scan_info):
        self.scan_info = scan_info
        self.code_info = CodeInfo.objects.get(scan_info=scan_info)
        self.url_scan = URLScanInfo.objects.get(scan_info=scan_info)
        self.local_scan = None
        self.scan_result = ScanResult.objects.get(code_info=self.code_info)
        self.scan_file_info = ScanFileInfo.objects.filter(scan_result=self.scan_result)
        self.license = License.objects.filter(scan_file_info__in=(self.scan_file_info))
        self.matched_rule = MatchedRule.objects.filter(license__in=(self.license))
        self.matched_rule_license = MatchedRuleLicenses.objects.filter(matched_rule__in=(self.matched_rule))
        self.copyright = Copyright.objects.filter(scan_file_info__in=(self.scan_file_info))
        self.copyright_holder = CopyrightHolders.objects.filter(copyright__in=(self.copyright))
        self.copyright_statement = CopyrightStatements.objects.filter(copyright__in=(self.copyright))
        self.copyright_author = CopyrightAuthor.objects.filter(copyright__in=(self.copyright))
        self.package = Package.objects.filter(scan_file_info__in=(self.scan_file_info))
        self.scan_error = ScanError.objects.filter(scan_file_info__in=(self.scan_file_info))

See the proper usage __in, this is used to remove a big error of calling a model by using multiple rows of the ForeignKey. This might seem weird explanation. But that’s it. Let me try it once more. We know when we use objects.filter it return more than one row. Now as the variable is storing more than one row, it cannot be passed to next objects.filter because it has more than one rows itself.

After this, for testing, I used the following code to see if the things are looking well.

s = GodSerializerHelper(ScanInfo.objects.get(pk=51))
s = GodSerializer(s)
s.data

Hope this post helps someone in future. Still, in some dilemma, join the conversation in the comments.

Have a good day.

Writing unit tests for the models

You must have heard about the term test-driven-development if you are into the developmental works. It is the development in which you write tests before writing the logic. That means first you write the stuff that can break the code and then you write the real code that doesn’t break which is unbreakable from that point of view.

I hope this makes sense. If not, keep on reading for some time and you will come to know more about the stuff.

Why do we need tests?

At the intermediate level of development, where I am right now, we merely write tests for our code. But it is regularly said that

Untested code is broken code.

That being said, I found a great presentation that will eventually strengthen my argument of writing automatic tests.

https://www.slideshare.net/wooda/philipp-von-weitershausen-untested-code-is-broken-code

No need to go beyond first 7-8 slides.

So the basic idea of automatic testing is to save someone from breaking our code in future. This also helps us to find some issues in the code that were not visible when we coded them. The logical errors in the repository having thousands of lines of code are hard to detect. That is why the good people introduced testing for the developers.

Similarly in future, if someone codes something for us and we add that to our main repository without testing, it can break everything for us. So automatic testing is there to save us. Run the tests before adding the new stuff into the main repository and go forward without worrying about your code.

What happens during testing?

During testing, we are given certain cases which are applied on the code. The output of the code is calculated and is compared with some recommended output that developer wants. If both the outputs are same then test passes otherwise it fails. As simple as that.

In the GSoC project, we wrote tests using the unittest module of python. The unit test is a software engineering term which means to test each and every module separately. This is the default testing module used by Django.

For the beginning, we used the module to write tests for the models in the code. Here is the commit for the code.

https://github.com/singh1114/scancode-server/commit/99a36d8fe0c9289a5fac608f02cbf34171abdf28

After applying the tests I found a few errors in the code that I removed in the same commit.

What should we test in models?

While testing models we should test all the custom methods in the models. We should also test that we cannot add stuff when the field is not given. Django takes care of most of the rest stuff.

As all the things are provided in Django by default so there is very less need of test most of the things in the recent versions. But you should test __str__ and plural name of the models visible in the admin panel.

As your tests start taking shape you will feel more confident about your code.

Let’s have a coding sample:

from django.test import TestCase
class ScanInfoTestCase(TestCase):
    def test_scan_info_added(self):
        scan_info = ScanInfo.objects.create(scan_type='URL', is_complete=True)
        self.assertTrue(scan_info.is_complete)
        self.assertEqual(scan_info.scan_type, str(url_scan_info))
        self.assertEqual('Scan Info', scan_info._meta.verbose_name_plural)

In the first line, we import the TestCase from django.test. After that, in the ScanInfoTestCase we inherited this TestCase and used assertTrue and assertEqual method to check if the tests pass or not. The assertTrue method is used to check if the value is True or not. Similarly, assertEqual checks if the two variables are equal or not.

For running tests in Django, apply:

$ python manage.py test

Conclusion

It is not difficult to write automatic tests for your code. You just have to be patient with tests. It takes some time to write tests and many times it feels useless. But in longer runs, it will benefit you and save a lot of your time.

Updating the models

Today I was working on another branch to write the code that can fill the database. I scanned some code and saved copied the results to open them in clean JSON format because I wanted to iterate over the results. I used the online the following JSON parser.

http://json.parser.online.fr/

It is a website that shows clean results and you can correctly see what’s going on. After checking out the results I found that there was some discrepancy with the database. So I left the work of filling database and started the work on updating the database.

{"files": [{"licenses": [{"category": "Permissive", "start_line": 2358, "short_name": "MIT License", "spdx_url": "https://spdx.org/licenses/MIT", "text_url": "http://opensource.org/licenses/mit-license.php", "spdx_license_key": "MIT", "homepage_url": "http://opensource.org/licenses/mit-license.php", "score": 10.0, "end_line": 2358, "key": "mit", "owner": "MIT", "dejacode_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit", "matched_rule": {"licenses": ["mit"], "license_choice": false, "identifier": "mit_14.RULE"}}, {"category": "Permissive", "start_line": 2532, "short_name": "MIT License", "spdx_url": "https://spdx.org/licenses/MIT", "text_url": "http://opensource.org/licenses/mit-license.php", "spdx_license_key": "MIT", "homepage_url": "http://opensource.org/licenses/mit-license.php", "score": 97.6, "end_line": 2547, "key": "mit", "owner": "MIT", "dejacode_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit", "matched_rule": {"licenses": ["mit"], "license_choice": false, "identifier": "mit.LICENSE"}}], "path": "URL/5", "packages": [], "scan_errors": [], "copyrights": [{"end_line": 2542, "holders": ["GitHub Inc."], "start_line": 2530, "statements": ["Copyright (c) 2011-2017 GitHub Inc."], "authors": []}, {"end_line": 2588, "holders": [], "start_line": 2582, "statements": ["(c) 2017 span title"], "authors": []}]}], "scancode_notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.", "scancode_version": "2.0.0rc3", "files_count": 1, "scancode_options": {"--format": "json", "--ignore": [], "--license": true, "--package": true, "--license-score": 0, "--copyright": true}}

This is the JSON result that helped me to create the changes in the database.

Following is the commit related to the change in the database.

https://github.com/singh1114/scancode-server/commit/de5a93a08741f1150ab3fbaa7b57b82f2bc5054a

After this, I went back to work on filling the database. I will be able to push something tomorrow. It is a tough task to keep everything. If I wrote some bad code then it is going to be very hard to debug.

Using celery to run long running task asynchronously

The heading itself is a bit confusing. So, let’s try to break it down by taking the example of what we wanted to achieve with this.

Our job was to run code scans when the user gives us an URL in the URL field. If you are reading the earlier posts, we have achieved somewhat similar results. But the problem is that the scanning is a long and tedious task and takes a lot of time. And for the time the server is generating the output, the user has to wait on the same page for a lot of time.

So what we wanted to do was that run these tasks on the server as separate operations, totally different from the simple tasks that are happening on regular basis. Only the tasks that are very simple are included in the main part.

This thing can be thought of being related to the threads. The main thread runs all the operations but on some occasions, we have some child threads that runs synchronously with the main thread. While this concept is used to reduce the execution time, we are using similar thing, celery to provide good user experience.

This form was created to take the URL from the user. Once the user gives in the URL, we handle the tasks in the tasks.py file, where all the stuff is defined.

Now we used a library called celery for running these tasks in the background. What we did was as follows:

We sent the users to a waiting URL. The same waiting URL will be used to show the results when the tasks are done and results are generated. The waiting URL will look somewhat like this:

We can also integrate the email system that will ask the user to give the email and the system will notify them when the tasks are done.

After the completion of tasks when the user reloads the page, he will get the following results. We are not looking to get proper formatting of the result for now. This is one of the upcoming task.

If you compare the URL of the one giving the output to wait and the URL that shows the results, you will find that both of them are same.

So what’s happening in the backend? 

Basically, I used a small trick up here. I coded the following model.

class CeleryScan(models.Model):
     scan_id = models.AutoField(primary_key = True)
     scan_results = models.CharField(max_length = 20000, null=True, blank=True)
     is_complete = models.BooleanField(default = False)

     def __str__(self):
          return str(self.scan_id)

The hack is to use is_complete attribute in the model. Whenever a user gives an URL to scan we generate an instance of the CeleryScan class and send it to the celery task manager. What we have to remember here is the scan_id. scan_results is initialized to null and is_complete variable is assigned to False.

Finally, when the tasks are done models are updated by using the same scan_id. We give the string output to the scan_results and set is_complete is set to True.

In views.py, I wrote the code that will take the scan_id from the URL. If the is_complete variable for the same scan_id is True then results are shown otherwise the same 'Please wait' message is shown.

How does celery works: 

Celery creates a queue of the incoming tasks. First, we register various tasks that are going to be executed by celery. Whenever such task is encountered by Django, it passes it on to celery. It also doesn’t wait for the results. Each task reaching the celery is given a task_id. We can check for various things about the task using this task_id.

How to use celery: 

I used the following sources to include celery in the project.

https://code.tutsplus.com/tutorials/using-celery-with-django-for-background-task-processing–cms-28732

http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html

It’s fairly simple. Please refer to the following commit for the code related to this post.

https://github.com/nexB/scancode-server/commit/4986bec7edad61f4f182b54e594ee3f5efe3c2e8