Clustering Coefficient in Graph Theory

In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971; Watts and Strogatz, 1998).

Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes.

Global clustering coefficient

The global clustering coefficient is based on triplets of nodes. A triplet consists of three connected nodes. A triangle therefore includes three closed triplets, one centered on each of the nodes (n.b. this means the three triplets in a triangle come from overlapping selections of nodes). The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed). The first attempt to measure it was made by Luce and Perry (1949). This measure gives an indication of the clustering in the whole network (global), and can be applied to both undirected and directed networks.

Local clustering coefficient

A graph G=(V,E) formally consists of a set of vertices V and a set of edges E between them. An edge e_{ij} connects vertex v_{i} with vertex v_{j}.

The neighborhood N_{i} for a vertex v_{i} is defined as its immediately connected neighbors as follows:

N_i = \{v_j : e_{ij} \in E \or e_{ji} \in E\}.

We define k_{i} as the number of vertices, |N_{i}|, in the neighbourhood, N_{i}, of a vertex.

The local clustering coefficient C_{i} for a vertexv_{i} is then given by the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them. For a directed graph, e_{ij} is distinct from e_{{ji}} , and therefore for each neighborhoodN_{i} there are k_{i}(k_{i}-1)links that could exist among the vertices within the neighborhood ( k_{i} is the number of neighbors of a vertex). Thus, the local clustering coefficient for directed graphs is given as [2]

C_{i}={\frac  {|\{e_{{jk}}:v_{j},v_{k}\in N_{i},e_{{jk}}\in E\}|}{k_{i}(k_{i}-1)}}.
An undirected graph has the property that e_{ij} and e_{{ji}} are considered identical. Therefore, if a vertex  v_{i} has  k_{i} neighbors, {\frac  {k_{i}(k_{i}-1)}{2}} edges could exist among the vertices within the neighborhood. Thus, the local clustering coefficient for undirected graphs can be defined as

C_{i}={\frac  {2|\{e_{{jk}}:v_{j},v_{k}\in N_{i},e_{{jk}}\in E\}|}{k_{i}(k_{i}-1)}}.
Let \lambda _{G}(v) be the number of triangles on v\in V(G) for undirected graph G. That is,  \lambda _{G}(v) is the number of sub-graphs of G with 3 edges and 3 vertices, one of which is v. Let \tau _{G}(v) be the number of triples on v\in G . That is, \tau _{G}(v) is the number of sub-graphs (not necessarily induced) with 2 edges and 3 vertices, one of which is v and such that v is incident to both edges. Then we can also define the clustering coefficient as

C_{i}={\frac  {\lambda _{G}(v)}{\tau _{G}(v)}} .
It is simple to show that the two preceding definitions are the same, since

\tau _{G}(v)=C({k_{i}},2)={\frac  {1}{2}}k_{i}(k_{i}-1) .
These measures are 1 if every neighbor connected to v_{i} is also connected to every other vertex within the neighborhood, and 0 if no vertex that is connected to v_{i} connects to any other vertex that is connected to v_{i}.

cc

Example local clustering coefficient on an undirected graph. The local clustering coefficient of the blue node is computed as the proportion of connections among its neighbours.

Here is the code to implement the above clustering coefficient in a graph. It is a part of the networkx library and can be directly accessed using it.

def average_clustering(G, trials=1000):
    """Estimates the average clustering coefficient of G.
    The local clustering of each node in `G` is the
    fraction of triangles that actually exist over
    all possible triangles in its neighborhood.
    The average clustering coefficient of a graph
    `G` is the mean of local clusterings.
    This function finds an approximate average
    clustering coefficient for G by repeating `n`
    times (defined in `trials`) the following
    experiment: choose a node at random, choose
    two of its neighbors at random, and check if
    they are connected. The approximate coefficient
    is the fraction of triangles found over the
    number of trials [1]_.
    Parameters
    ----------
    G : NetworkX graph
    trials : integer
        Number of trials to perform (default 1000).
    Returns
    -------
    c : float
        Approximated average clustering coefficient.
   
    """
    n = len(G)
    triangles = 0
    nodes = G.nodes()
    for i in [int(random.random() * n) for i in range(trials)]:
        nbrs = list(G[nodes[i]])
        if len(nbrs) < 2:
            continue
        u, v = random.sample(nbrs, 2)
        if u in G[v]:
            triangles += 1
    return triangles / float(trials)

Note: The above code is valid for undirected networks and not for the directed networks.
The code below has been run on IDLE(Python IDE of windows). You would need to download the networkx library before you run this code. The part inside the curly braces represent the output. It is almost similar as Ipython(for Ububtu users).

 

>>> importnetworkx as nx
>>> G=nx.erdos_renyi_graph(10,0.4)
>>> cc=nx.average_clustering(G)
>>> cc
#Output of Global CC
0.08333333333333333
>>> c=nx.clustering(G)
>>> c
# Output of local CC
{0: 0.0, 1: 0.3333333333333333, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0,
 7: 0.3333333333333333, 8: 0.0, 9: 0.16666666666666666}

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Minimum Deletions

Given a string of S as input. Your task is to write a program to remove or delete minimum number of characters from the string so that the resultant string is palindrome.

Note: The order of characters in the string should be maintained.

Input:
First line of input contains a single integer T which denotes the number of test cases. Then T test cases follows. First line of each test case contains a string S.

Output:
For each test case, print the deletions required to make the string palindrome.

Constraints:
1<=T<=100
1<=length(S)<=10000

Example:
Input:
2
aebcbda
geeksforgeeks
Output:
2
8

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Change the string

Given a string S, the task is to change the string according to the condition; If the first letter in a string is capital letter then change the full string to capital letters, else change the full string to small letters.

Input:
The first line of input contains an integer T denoting the number of test cases. Then T test cases follow. Each test case contains a string S.

Output:
For each test case, print the changed string in a new line.

Constraints:
1<=T<=200
1<=|string length|<=104

Example:
Input:
2
geEkS
FoR
Output:
geeks
FOR

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Ordered arrangement

What is the minimum number of students required in a discrete mathematics class to be sure that at least six will receive the same grade, if there are five possible grades, A, B, C, D, and F?Let S = {1, 2, 3}. The ordered arrangement 3, 1, 2 is a permutation of S. The ordered arrangement 3, 2 is a 2-permutation of S.

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Difference between sums of odd and even digits

Given a long integer, we need to find if the difference between sum of odd digits and sum of even digits is 0 or not. The indexes start from zero (0 index is for leftmost digit).

Input:
First line consists of T test cases. Only line of every test case consists of a number N.

Output:
Print “Yes” if difference is 0 else print “No”.

Constraints:
1<=T<=200
1<=N<=10^18

Example:
Input:

2
1212112
123
Output:
Yes
No

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Amazon Interview Experience | Set 392 (For SDE 2)

Telephonic Round:
Q1: Question was a little tricky one but finally it was converted to Next greater Element
Q2: Given a Singly Linked list, Update the second half of the list such that n-th element becomes sum(1st + nth) element, (n-1)st element becomes sum(2nd + n-1st) element and so on. Eg: 2->3->4->5->6 => 2->3->(4+4)->(5+3)->(6+2)

Face to Face Interview:
Round 1:
Q1: Design the payment module of the Uber App.

Round 2:
Q1: Design Memory Management System and tell about all the data structures you will use and why. How will you allocate and deallocate the memory using those data structure and Time & Space complexity of the operations.
Q2: Given an array of words, print all anagrams together in the output array.

Round 3:
Discussion about the projects worked in previous company and how did you handle certain situation occurred while working on the project.

Round 4:
Q1: You are provided with different Excel files and the data format those files contain. You are also provided with low level parser. You have to design a system which takes the excel file and its data type as the input and returns the list of Data objects in the file.

Round 5:
Discussion about project in previous company.
Q1: Design multiplayer generic board game (like chess or ludo)
Q2: You are provided with 2D char array. You have to provide the 2D char array as response which contains the multiplication od the input array. For eg: input=> {{a,b},{c,d}}, output => {{a,c},{a,d},{b,c},{b,d}}

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

DBMS | How to find the highest normal form of a relation

 

Steps to find the highest normal form of a relation:

  1. Find all possible candidate keys of the relation.
  2. Divide all attributes into two categories: prime attributes and non-prime attributes.
  3. Check for 1st normal form then 2nd and so on. If it fails to satisfy nth normal form condition, highest normal form will be n-1.

Example 1. Find the highest normal form of a relation R(A,B,C,D,E) with FD set {A->D, B->A, BC->D, AC->BE}

Step 1.   As we can see, (AC)+ ={A,C,B,E,D}  but none of its subset can determine all attribute of relation, So AC will be candidate key. A can be derived from B, so we can replace A in AC by B. So BC will also be a candidate key. So there will be two candidate keys {AC, BC}.

Step 2.  Prime attribute are those attribute which are part of candidate key {A,B,C} in this example and others will be non-prime {D,E} in this example.

Step 3.  The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or composite attribute.

The relation is not in 2nd Normal form because A->D is partial dependency (A which is subset of candidate key AC is determining non-prime attribute D) and 2nd normal form does not allow partial dependency.

So the highest normal form will be 1st Normal Form.

 

Example 2. Find the highest normal form of a relation R(A,B,C,D,E) with FD set as {BC->D, AC->BE, B->E}

Step 1.   As we can see, (AC)+ ={A,C,B,E,D}  but none of its subset can determine all attribute of relation, So AC will be candidate key. A or C can’t be derived from any other attribute of the relation, so there will be only 1 candidate key {AC}.

Step 2.  Prime attribute are those attribute which are part of candidate key {A,C} in this example and others will be non-prime {B,D,E} in this example.

Step 3.  The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or composite attribute.

The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not proper subset of candidate key AC) and AC->BE is in 2ndnormal form (AC is candidate key) and B->E is in 2nd normal form (B is not a proper subset of candidate key AC).

The relation is not in 3rd normal form because in BC->D (neither BC is a super key nor D is a prime attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to satisfy 3rd normal for, either LHS of an FD should be super key or RHS should be prime attribute.

So the highest normal form of relation will be 2nd Normal form.

 

Example 3. Find the highest normal form of a relation R(A,B,C,D,E) with FD set {B->A, A->C, BC->D, AC->BE}

Step 1.   As we can see, (B)+ ={B,A,C,D,E}, so B will be candidate key. B can be derived from AC using AC->B (Decomposing AC->BE to AC->B and AC->E). So AC will be super key but (C)+ ={C} and (A)+ ={A,C,B,E,D}. So A (subset of AC) will be candidate key. So there will be two candidate keys {A,B}.

Step 2.  Prime attribute are those attribute which are part of candidate key {A,B} in this example and others will be non-prime {C,D,E} in this example.

Step 3.  The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or composite attribute.

The relation is in 2nd normal form because B->A is in 2nd normal form (B is a super key) and A->C is in 2nd normal form (A is super key) and BC->D is in 2nd normal form (BC is a super key) and AC->BE is in 2nd normal form (AC is a super key).

The relation is in 3rd normal form because LHS of all FD’s are super keys. The relation is in BCNF as all LHS of all FD’s are super keys. So the highest normal form is BCNF.

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Database Management System – Introduction | Set 1

Database: Database is a collection of inter-related data which helps in efficient retrieval, insertion and deletion of data from database and organizes the data in the form of tables, views, schemas, reports etc. For Example, university database organizes the data about students, faculty, and admin staff etc. which helps in efficient retrieval, insertion and deletion of data from it.

Database Management System: The software which is used to manage database is called Database Management System (DBMS). For Example, MySQL, Oracle etc. are popular commercial DBMS used in different applications. DBMS allows users the following tasks:

Data Definition: It helps in creation, modification and removal of definitions that define the organization of data in database.

Data Updation: It helps in insertion, modification and deletion of the actual data in the database.

Data Retrieval: It helps in retrieval of data from the database which can be used by applications for various purposes.

User Administration: It helps in registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control and recovering information corrupted by unexpected failure.

Paradigm Shift from File System to DBMS

 File System manages data using files in hard disk. Users are allowed to create, delete, and update the files according to their requirement. Let us consider the example of file based University Management System. Data of students is available to their respective Departments, Academics Section, Result Section, Accounts Section, Hostel Office etc. Some of the data is common for all sections like Roll No, Name, Father Name, Address and Phone number of students but some data is available to a particular section only like Hostel allotment number which is a part of hostel office. Let us discuss the issues with this system:

  • Redundancy of data: Data is said to be redundant if same data is copied at many places. If a student wants to change Phone number, he has to get it updated at various sections. Similarly, old records must be deleted from all sections representing that student.
  • Inconsistency of Data: Data is said to be inconsistent if multiple copies of same data does not match with each other. If Phone number is different in Accounts Section and Academics Section, it will be inconsistent. Inconsistency may be because of typing errors or not updating all copies of same data.
  • Difficult Data Access: A user should know the exact location of file to access data, so the process is very cumbersome and tedious. If user wants to search student hostel allotment number of a student from 10000 unsorted students’ records, how difficult it can be.
  • Unauthorized Access: File System may lead to unauthorized access to data. If a student gets access to file having his marks, he can change it in unauthorized way.
  • No Concurrent Access: The access of same data by multiple users at same time is known as concurrency. File system does not allow concurrency as data can be accessed by only one user at a time.
  • No Backup and Recovery: File system does not incorporate any backup and recovery of data if a file is lost or corrupted.

These are the main reasons which made a shift from file system to DBMS.

 

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Microsoft Interview Experience | Set 132 (Software Engineer for Bing Team)

I recently attended Microsoft Interview for Software Engineer role in Bing Team.

Round 1:

Microsoft Interview Experience | Set 131

Its initial screening round. It has happened through skype.

1. Given n*n matrix with some elements in each cell. If there is “0” in any cell then we have to make that corresponding row and column to “0”
Time complexity: O(n*n) Space complexity: O(1)

2. Given an array of elements with size n. You should find the number which is repeated more than n/2 times

Time complexity: O(n)

After screening round, they asked me to come for 4 F2F rounds. I have visited Microsoft IDC hyderabad.

Round 2:

1. Given sorted array of numbers and a sum. we have to find any two numbers whose sum is equal to the given sum.

Time Complexity: O(n)

2. Given Binary tree with parent pointer and two nodes. Find LCA of the given two nodes in a given binary tree

struct TreeNode
{
int data;
TreeNode *left,*right,*parent;
};

Parent pointer of each node points to its parent. Root node’s parent pointer points to NULL

Time Complexity: O(logn)

Round 3:

1. Given two linked lists. Find the intersection point of those two linkedlists

2. Its based on Binary tree. I forgot the question 🙂

3. Design discussion on Search functionality available in smart phones


Round 4:

1. Given a linkedlist with random pointer for each node which points to some random number in the given list. Clone the linkedlist

struct ListNode
{
 int data;
 ListNode *next,*random;
};

2. Design and implement DNS Cache.

Requirements:

Cache must be fixed in size and it will be decided by the user who wants to use this cache.

If the entry is not available in Cache then it should call server to get the details of ipAddress and store it in cache.

 

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org

Pseudo-polynomial Algorithms

What is Pseudo-polynomial?
An algorithm whose worst case time complexity depends on numeric value of input (not number of inputs) is called Pseudo-polynomial algorithm.
For example, consider the problem of counting frequencies of all elements in an array of positive numbers. A pseudo-polynomial time solution for this is to first find the maximum value, then iterate from 1 to maximum value and for each value, find its frequency in array. This solution requires time according to maximum value in input array, therefore pseudo-polynomial. On the other hand, an algorithm whose time complexity is only based on number of elements in array (not value) is considered as polynomial time algorithm.

Pseudo-polynomial and NP-Completeness
Some NP-Complete problems have Pseudo Polynomial time solutions. For example, Dynamic Programming Solutions of 0-1 Knapsack, Subset-Sum and Partition problems are Pseudo-Polynomial. NP complete problems that can be solved using a pseudo-polynomial time algorithms are called weakly NP-complete.

 

Disclaimer: This content belongs to geeksforgeeks, source: http://geeksforgeeks.org