Sunday, September 25, 2016

HashSet Internals

Experimenting how to use Objects as elements in HashSet.
The code below explores how hashCode() and equals() functions are used in HashSet and how to override them so that you can tweak it for your own needs.

Comments contain the explanation and check the output to see when hashCode() and equals() functions are called.



The image below shows the map variable expanded.
Explanation why values have the same Object id : stackoverflow

Thursday, September 24, 2015

Unit-Testing+Mocking Nginx Module

The following are the frameworks that I looked into to finalize the one which is the most suitable for unit testing+mocking module in nginx.

  • unity + cmock
    • unit testing was almost okay (have to use ruby to generate test runner)
    • for mocking you need to extract all the functions that you need to mock into a new header file. The framework would then generate a mocked version to header files that can be used. The problem here is the extraction of functions to be mocked requires lots of time and patience.
  • gTest  +  gMock
    • unit testing was perfect, uses macro to register test ;)
    • the problem with mocking free functions is that you need to rewrite your code to use interface.
  • cmocka
    • unit testing was okay (ease of use : between google's and unity's)
    • for mocking all u need to do write the mocked-version of the function that you need to mock. Framework utilizes that wrap feature of linker.


The following is the way how I used cmocka to unit-test+mock the nginx module : 

Folder structure
--src
|-> nginx-1.8.0 [nginx source]
|
|-> nginx_module
|  |-> ngx_redis_module.c
|
|-> test
 |-> test_ngx_redis_module.c
 |-> test_ngx_redis_module.h
 |-> makefile

The function that I need to unit test is the ngx_redis_event_handler() in the ngx_redis_module.c. This function calls two other functions redisAsyncHandleWrite() and redisAsyncHandleRead(), since these functions are not under test it has to be mocked. As the framework uses wrap feature of linker for mocking, you need to define the mocked definition of these two functions with __wrap_ as the prefix for the mocked-function's name (see : test_ngx_redis_module.h ). Now declare the mocked-functions in the test file (see : test_ngx_redis_module.c) and write test for the function to be tested (refer the cmocka on how to write tests). 

In the example I have written three tests for unit-testing ngx_redis_event_handler().
  • test_ngx_redis_event_handler_write() : since line#22 and #23 of the function are executed you need to create the objects necessary to run them without exception. This test sets the ev->write = 1, so it flows through if() in line#25 of the function being tested, where the mocked version of redisAsyncHandleWrite() i.e __wrap_redisAsyncHandleWrite() is called at line#27 
  • test_ngx_redis_event_handler_read() : since line#22 and #23 of the function are executed you need to create the objects necessary to run them without exception. This test sets the ev->write = 0, so it flows through else in line#28 of the function being tested, where the mocked version of redisAsyncHandleRead() i.e __wrap_redisAsyncHandleRead() is called at line#30
  • test_ngx_redis_event_handler_ready : since this test enters the if() at line#18 and returns, line#22 to #33 won't be executed hence objects required may not be created.
Use expect_* and check_* to verify that the mocked functions are called.

Now to compile the test cases you need to include all the objects files that the test and module refers to. This can be found from the makefile that nginx uses to compile the nginx.o, located at 'nginx-1.8.0/objs/Makefile' under the section 'objs/nginx:' (everything between  '$(LINK) -o' and 'objs/ngx_modules.o'(included)). Make sure you remove the objects file of the nginx module (ngx_redis_module.o) from the above list of object files, it is included in src/test/makefile. 

The second thing you need to do is to remove 'main()' from the 'nginx.o', else ld will state that you have two mains. Use the 'strip' command to remove main from nginx.o (located at nginx-1.8.0/objs/src/core)

$> strip -N main -o nginx_without_main.o nginx.o

replace the nginx.o with nginx_without_main.o in your makefile that you use to compile the test.

Now, 'make run' to run the tests.

[==========] Running 3 test(s).
[ RUN           ] test_ngx_redis_event_handler_write
[             OK  ] test_ngx_redis_event_handler_write
[ RUN           ] test_ngx_redis_event_handler_read
[             OK  ] test_ngx_redis_event_handler_read
[ RUN           ] test_ngx_redis_event_handler_ready
[             OK  ] test_ngx_redis_event_handler_ready
[==========] 3 test(s) run.
[  PASSED     ] 3 test(s).






Thursday, August 27, 2015

CHasho : Hashing In C

Had a requirement to write a hashing library in C with 5 character keys and string as value.

Here is the git repo : https://github.com/melwinjose1991/CHasho
The hashing function is based on Java's HashMap.
 Info : http://melwin-jose.blogspot.in/2014/11/internal-working-of-hashmap.html

Code reference : https://gist.github.com/tonious/1377667

You are free to use/modify this as per your requirement.

-Chasho
     |--> src
     |      |-> chasho.h
     |      |-> chasho.c
     |      |-> com/melwin/chasho
     |
     |-> example
     |        |-> example.c
     |        |-> makefile


chasho.h and chasho.c : the header file and the source file.
example : contains a sample code on how to use CHasho
com/melwin/chasho : contains a Java implemtation of the hashing function used in CHasho
                    It shows how the 5Character keys are distributed into hashTable with 64 buckets.

Result when all possible 5-character keys were hashed into table with 16 buckets
Bucket[0] 3884771
Bucket[1] 3888063
Bucket[2] 3891167
Bucket[3] 3893639
Bucket[4] 3895132
Bucket[5] 3895427
Bucket[6] 3894480
Bucket[7] 3892435
Bucket[8] 3889583
Bucket[9] 3886323
Bucket[10] 3883140
Bucket[11] 3880547
Bucket[12] 3878969
Bucket[13] 3878653
Bucket[14] 3879648
Bucket[15] 3881803

Bucketized Tokens : 62193780
Generated Tokens  : 62193780

Hope you find it useful.

Wednesday, February 11, 2015

libuv intro : UDP Server based on libuv

This tutorial is about how to write a UDP Server in C that is based on asynchronous I/O. I would be introducing libuv briefly, explaining those concepts that are mainly required to write the UDP server. The detailed introduction and API documentation is available on the internet.

Event Driven Programming: In a event driven programming, a user register a set of events(in which he is interested in ) and callbacks to those events. In our case, libuv is responsible for gathering events from the operating system and monitoring them. Some of the examples of event loop are file is ready for writing, timer timed out, socket has data ready to be read, etc. When the registered events occur the callbacks invoked by the user are invoked.
Libuv uses asynchronous, non-blocking style to deal with events and callback. Not getting into the details or advantages of it, you can find plenty of articles and blog on it.

Networking: 

Hope you find this useful.

Thursday, January 15, 2015

Multivariate Linear Regression

Multivariate Linear Regression

Uni-variate Linear Regression : I will try to explain linear regression using gradient descent with problem of house price prediction which contains only a single feature.

The graph shows the relationship from the between the no of rooms in a house and its price. As you can see the house price is linearly dependent on only one feature, say the number of bedroom. That is, the house price increases with the increase in the no of bedroom.

Given a training set (sample data for which no of rooms and the corresponding house price are known), our goal is to develop a hypothesis(a linear equation, as it is visible from the graph) which trains on the training set and can be used to predict house prices of data outside the training set. There exists a linear relationship between no of rooms and price and the equation of LINE (of the form y=ax+b) could be simple hypothesis.
We can have an simple hypothesis of the form
h(x) = theta_0 + (theta_1 * x)
where x is the number of rooms in the house

Our job is to find the appropriate values of theta_0 & theta_1 so that our hypothesis gives a prediction with less error. In machine learning it is very difficult to arrive on a perfect hypothesis; the goal of ML is make "perfect guesses". In other words, its goal is to reduce the error in prediction. With different values of theta_0 and theta_1 we get different lines (hypothesis).

Line 1 : Here theta_0=0.25 and theta_1=0.25. This hypothesis doesn't fit the data given in the sample set. For example when no of bedroom is 1, the sample data says that the price of the house is $2K, but according to Line 1 its $ 0.5K. Here there is a large difference between outputs predicted by the hypothesis for the training set, hence this can't be hypothesis that fits it.

Line 2 : Here theta_0=0.5 and theta_1=0.5. As you can see, Line 2 predicts that the price of house for a house with 1 bedroom is $ 1K, which is better than the value predicted by Line 1. But still not the best, theta can be adjusted further to arrive at an even better hypothesis, that has fewer prediction error.

Line 3: Here theta_0=1 and theta_1=1. For x=1 it gives a good prediction of $2K, with zero error. Note, even though Line3 has made a good predicted when x=1, for other values of x it gives a prediction which is not 100% correct, but much better than other hypothesis.

As we can see Line 3 is the one that gave prediction that is close to a the original value. There won't a considerable amount of change in the prediction if we adjust our theta further ie our system has converged to a particular value of theta.

Correctness of the hypothesis is measured using 'cost function' J(theta).
here
x_i is the value the feature(ie no of rooms) of the ith entry in the sample data,
h(x_i) is the predicted output value(ie house price) using the hypothesis with theta_0 and theta_1
y_i is actual ouput price
m is the no of training data

Our aim is find theta_0 and theta_1 which has the least error i.e the value of theta for which cost function is least.The method that we are going to use to tune theta is called gradient descent, it can explained with following steps:

1> Assign some random initial values to theta_0 and theta_1
2> Repeat it 'I' number of times (I=no of iterations) or till theta converges
For each theta_i in theta (ie theta_0 and theta_1)
For each entry in the training sample

      1. let x be the no of bedroom and y be the corresponding output(house price which is known) of current training sample
      2. calculate the output h(x) using 'OLD' values of theta and x
      3. calculate the error, h(x)-y, for current training sample
      4. calculate (error * value of variable x whose coefficient is theta_i in the current sample)

sum all the products obtained in the above step
use this calculated-sum to adjust theta_i so as to make the hypothesis "less wrong"


where alpha is the learning rate; it determines how fast the gradient descent works
3> Use this tuned/trained hypothesis to predict values

In short, gradient descent changes the values of theta so that it finally arrives (converges)at a value of theta for which the hypothesis has least error (ie minimized cost function). The above steps can be represent in the following short equation
---------
---------

Note

  • Theta should be updated simultaneously as shown below
  • x(i) value for theta_0 is 1, h(x) = ( theta_0 * 1 ) + ( theta_1 * x )
You get the following 3D graph if you to plot the values of theta_0 and theta_1 against J(theta)
 
What the gradient descent is trying to do is find the global minimum value of J(theta) from the above graph. At the global minimum the value of J(theta) is the least i.e the prediction error is the least.

Multivariate Linear Regression
In multivariate problems the number of features on which the output depends will be greater than or equal to 2. Our goal here remains the same : Given a sample training data, derive a hypothesis which trains on it and can be used to predict output for inputs outside the training set. Since there are more than 1 feature here, the hypothesis is going to be complex. 
h(x1, x2....) = theta_0 + ( theta_1 * x1 ) + ( theta_2 * x2 ) + ...
As in uni-variate Linear regression, gradient descent is used here to find appropriate values of theta_0, theta_1 ,.. so that the hypothesis gives very less prediction error.

A problem on multivariate linear regression from hackerrank
And my solution to it using gradient descent




Sunday, November 16, 2014

Internal Working of HashMap<String, Integer> in Java

Internal Working of HashMap in Java

Was reading an blog on how hashMaps work in Java and I was actually surprised to know that it uses the concepts like bucket, chaining and collision in them. I had learned about these concepts in college but thought those books were outdated and now they use some complex techniques for Hashes in Java. So I started a investigation to verify if they were actually using these techniques which i learned in college. Read some articles in wiki, other blogs and went through some youtube videos to actually understand how a HashMap (with key as string) works. Wrote the following code, and debugged it line by line to understand the internals of HashMap.

I will try to explain how a hashMap(key-String) works with debugging the put() and get().
By default the HashMap in Java, is an array(size:16(buckets) and indexed:0,1..15) of Entry<K,V> (K:Key, V:Value).This array is named as table.
Entry is class with key(of type K), value (of type V), hash (int) and 'next' pointing to next Entry in the list. table[i] contains the first Entry of the linkedList for bucket with index i. If the linkedList contains more entries, those can be traced by looping through the `next` of the Entry.

Adding Entry into HashMap : put()
Lets take the example of adding (Key="ef",Value=1) to the HashMap
Steps:
1) The HashCode of the new key("ef") is constructed
  To calcluate the hash for a string java uses the following formula
h(s)=\sum_{i=0}^{n-1}s[i] \cdot 31^{n-1-i}
   For our example the hash for "ef" is 3233
  This hash is modified by the HashMap.class to ensure that hashCodes that differ only by constant      multiples at each bit position have a boundednumber of collisions (approximately 8 at default load      factor).
h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4);

2) The bucket index for the hash is calculated, i = h & (length-1);
3) The bucket[i](array of Entry) is checked to see if there exists an entry with the same hash and same key.
If such an entry exists, then value of the found entry is returned
else the new entry is inserted as the head of the linkedList for bucket[i]
( Note : before inserting the new entry, resizing of map is done if number of keys in this map reaches its threshold )

Retrieving a key's value: get()
Steps
1)  Calculate Hash for String/Key
2)  Find Bucket index,i
3) Loop through bucket[i] to find a Entry with hash = Key.hash and same key
if found: return Entry
else: return null

Diagram shows the structure of HashMap
Note : two entries are mapped into table[2], "ef" and "fG" because they have the same hash. The newly added entry becomes the head of the linkedList

Wednesday, November 12, 2014

Executing native shell commands from Java Program - Java.lang.Runtime.exec() Method

Executing native shell commands from Java Program - Java.lang.Runtime.exec() Method

Java.lang.Runtime.exec can be used to execute native shell commands from Java Program. Running native shell commands from Java is not recommended because by doing so you will lose platform independence which is one reason why we use JAVA.

Suppose you want to find files/directories in "C:\Users\josemel\", this can be done from terminal in two ways
1. Navigate to required directory and run "dir" OR
2. Run "dir " from anywhere

The above can be done from Java using exec() function as shown below.


JAVA DOC

Java.lang.Runtime.exec(String[] cmdarray, String[] envp) Method - Executes the specified command and arguments in a separate process with the specified environment.

Parameters:

  • cmdarray - array containing the command to call and its arguments.
  • envp - array of strings, each element of which has environment variable settings in the format name=value, or null if the subprocess should inherit the environment of the current process.

Returns:
A new Process object for managing the subprocess

Throws:

  • SecurityException - If a security manager exists and its checkExec method doesn't allow creation of the subprocess
  • IOException - If an I/O error occurs
  • NullPointerException - If cmdarray is null, or one of the elements of cmdarray is null, or one of the elements of envp is null
  • IndexOutOfBoundsException - If cmdarray is an empty array (has length 0)