Thursday, October 29, 2015

Analyzing Toronto Business Licenses Data.

New Resturant licenses by number in the city of Toronto for 2014.

So, someone came with a request asking us to show a heat map of restaurants licenses granted in the city of Toronto in 2014 to compare it with 2015 (maybe to see the effect of certain urban changes on restaurants opening )
Why restaurants  ?, I think they do represent quite a good indicator of extra spending and the health of the economy of certain areas. just look at the difference between 1994 when the city issued 569 licenses to 2014 where the number shot up to 1519



So in this post we will discuss the steps I went through to transfer the CSV data available from the city into a heat map of city licenses in the city.





Downloading the Code

The Code is available on GitHub 

Data Acquisition


OK This is the easiest part, just go to the City of Toronto Portal and you can easily Acquire the file which is called 'Business Licenses.csv'


Another file we will need is a file that contains all Canadian Postal Codes and their Latitude and Longitude (I contemplated using google maps API's but the sheer number of Postal code calls made me realize I will run my 2400 calls per day quota in no time).

Once you Acquire the data , you can use the Code in PostalCodeBuilder.ipynb notebook to isolate the 'ontario.csv' from all of Canada which will make searches 3 times faster (and believe me you will need it !).

If you pull the whole project from github you can also find the ontario.csv included.

This is the code involved

all_postals = pd.read_table('canada.csv', sep=',') 
on_postals = all_postals[all_postals['prov']=='ON']
on_postals.to_csv('ontario.csv')


Munging the Data


Focus on the Study Subset


Reporting and Visualization


Wednesday, October 28, 2015

Civic Engagement and Open Data.

"The opinions expressed here are my own personal views and will be used in a paper to fulfill requirements of University of Toronto Management of Big Data Analytics Certification"

Community Activism in the age of Big Data

Open Data and the city of Toronto.

Toronto joined a host of other Canadian and international cities that is posting data related to the city for the public on the Toronto Open Data Portal, This is becoming an increasingly important topic just look at the following news in the past few months alone :
So the city has started encouraging third parties to use its openly published data, most notable example is the TTC busses and trains data that is currently used multiple mobile Apps.

Further more the city encouraged the community to get involved, and this is going to be the topic of my next few blogs and paper.


We the people.

Imagine if you will, a community that wants to reduce speed limits in their streets or maybe is concerned about the size of a mega condo unit being planned or the presence of a new mega store at the heart of their area.

Any of these and other events could have a big effect on the quality of life in the neighbourhood, and the big business behind a project will come armed with 'paid' expert opinions and studies to support their case.

The goal of this work and my hypothesis is that we can use open data (traffic, licenses, accidents, weather, etc.) to give voice to the voiceless, to help those who need help by providing them with the data that supports their well being, the availability of such data will also isolate the rational and emotional resistance and/or support for many decisions, paving the road to a smooth process of community engagement in many projects.

The anticipated users will be

  • Community organizers
  • Campaigners (Political, Social)
  • Individuals
  • Local small business owners
  • School boards
  • Local event boards 
  • ..

Challenges for Open Data Providers

Government open data does face a lot of challenges from regulations to considerations of safety and privacy, etc. but the municipal level of government can have some specific challenges.
  • Limited resources (compared to Provincial and Federal levels of government).
  • Heightened privacy concerns, as the small size of the data set could expose personal information, specially in municipalities with small population (so maybe more in Georgetown or Woodstock compared to Toronto or London).
  • The need to not just make more data available but to budget and acquire new sets of data.

Civic Engagement effect on Government Open Data

The topic of big data is moving from the hype stage slowly into main stream, Public Data still  in and on itself deserves a closer look at some of its attributes.

One of the most intriguing attributes of public data is that so far the type, quality and size of data available is a bottom up/ inside out process, where the city decides what data maybe useful to the public and takes input from the technical startup community.

Once the public starts using the data a new channel of feedback will start to flow with requests focusing on :
  • Quality 
    • Field expansion.
    • Data integrity issues.
    • ..
  • Availability
    • Missing data.
    • New Data acquisition (I just realized that the pedestrian/traffic data is collected at an intersection once a year ?! ).
  • Context
    • As the public starts using data, New contexts will appear as a result of mixing data sets (Can we graph Federal interest rates, household debt and number of new business licenses issued ? ).
    • Those results could pose a challenge as they may require some co-ordination between different levels of governments.
    • Some of those contexts may pose threats to privacy, security and/or regulation so constant revision maybe needed.
In response to those challenges the city may need to partner with tech providers and private sector and the local tech community to provide ideas on how to fill the gaps and provide the best data assets to the public.

Data Activism !

So now we have the data, but how does one provide this data in a way to help the community there are generally two types of approaches and .. well a hybrid third option.

  1. Ad-Hoc approach.
    • In this approach the data is acquired and searched for a specific topic 
    • In the next post of this series I will use this approach to study new business licenses in Toronto in a certain neighbourhood, finding out how many business opened in the area through the years. which could be used to prove the effect of certain events on a neighbourhood business quality.
    • This approach is perfect for certain small targeted issues, such as zoning, speed limits, or even city councillor level campaigning. 
    • This graph shows the number of restaurant licenses issued from 1990 to date and is built using IPython notebook , Pandas and matplotlib.
  2. General public Service approach.
    • In this approach the data is collected en-masse and is manipulated and hosted and made available for the public.
    • For the example above, the data will be provided on a web interface where the user can see a heat-map of the city and the business licenses opened at a certain year or have the option to pick a neighbourhood and choose range of years to search the number of business licenses.
    • A variant of that is the TTC apps currently available by 3rd party vendors using TTC data from the city Toronto (although in this case the data is collected at run-time and on request).
    • This type of undertaking is large and unless the site providing it has some revenue stream from future traffic this type of application is generally hard to be done on volunteer bases.
  3. Hybrid.
    • In this approach the developers will pick certain sets of data that they are interested in and provide them with a certain level of customization available to the user.
    • So for example you can provide data about accidents reports and allow the user to choose their data set on a map.
    • A great example of this is the wonderful work at http://censusmapper.ca/, They are probably the original 'Data Activists' on the federal level, providing some of the data to any consumer who wishes to view it on a federal level as a way to highlight the importance of the full-form census.

Conclusion

Big Data Analytics is becoming an essential tool for decision making in every business and level of government, it is about time this power is handed to the public in the most suitable way, from community to federal, from schools to political campaigning, Open Data mixed with new technologies and a little bit of community give-back will reshape the face of civic engagement and community campaigning in the future.

Coming Soon to a notebook near you !

A detailed Blog on using IPython notebooks to analyze some City of Toronto new licensing data.

Monday, September 22, 2014

The roaming dinosaur series, Episode 1 : Using Self-Signed Certificates to secure Node.js HTTP traffic.

Node.js for the enterprise !

Introduction

Once I was annoyed by a novice question regarding how my phone worked, shared that thought with a friend at work who smiled kindly and reminded me that "We are all a bunch of dinosaurs roaming the mobile land", I do tend to disagree as most of the concepts we are dealing with today are actually not new at all, but I digress.

So, as someone who considers himself a master of the old world of legacy web apps I decided to explore how the new kid on the block (Node.js) respond to my Mesozoic Era standards (that is the era when dinosaurs lived if you are wondering).

The first of this series is going to go through a quick configuration to run Node.js over HTTPS using self signed Cert, but first let us understand why do we want to do that (there are good reasons for doing that , trust me).

DMZ, SSL and those nasty security checks

A few years back I worked with a brilliant (also neurotic) application server administrator that was adamant about turning off all HTTP traffic to his application servers farm.
"Only HTTPS is allowed" , "HTTP is for welcome page, go get that 'bleeb' of a web server not from my App Server".

He was right,  this is the typical layout of an Enterprise Application server



Using firewalls to protect enterprise application server.

A potential hacker will have no access to any of application servers directly but rather has to go to perimeter we call DMZ zone, If by any chance he gained access to any of the servers in that zone he must not have any 'powerful tools' under his command. which will leave him stranded until the security team discover and deal with the breach.

This is why the DMZ zone will have computers with very limited abilities, no JVM's, and definitely no Node.js, a proxy gateway like Nginx or even a simple HTTP server like Apache would be safe to put there. those servers are usually 'hardened' i.e. only minimal required software is installed and only minimal required ports are open, so no telnet and some extreme cases no ssh even, you have to physically be in the room to upload files into this machine.

But there is something wrong with this picture, the fact that the Traffic between Nginx and the Node.js servers, the fact it is done in HTTP is a security breach.

Take a close look at the problem


It is the 'Admin' problem, any administrator of those machines can install a network sniffer (wire shark anyone !), and voila , he has access to all unencrypted data going in and out of the servers, that is user names, birth dates, addresses, social insurance numbers, account numbers , anything that the user does submit that is not encrypted.

I have seen many customers forgo using HTTPS internally siting the data is in the "Trusted Zone", I quickly ask the question my security mentor asked me once "Is your company policy such that your server administrator can view the birth dates and social insurance numbers of all your employees" ? while they figure out the answer to this question, we start configuring HTTPS.

Unlike external HTTPS, internal HTTPS does not require a publicly signed and trusted certificates (those are costly and are not easy to obtain). for internal it is sufficient to use self signed certificates, the ones that administrators can create easily and renew at well.

Now you understand the method behind my madness, Node.js should run always in HTTPS and if possible exclusively on HTTPS  , So make it a good practice to configure Node.js with HTTPS all the time after all, it will only take you 5 minutes as you see below.

Configuring Node.js for HTTPS

HTTPS is HTTP over SSL, explaining SSL is beyond the scope of this post and to be honest only one man explained to me well 10 years ago (that was the mentor of my security mentor), and I am always amazed at how little understood such a widely used concept.

So in order to configure HTTPS you will need a pair, of keys that is, it is actually a Key and a certificate, the certificate is what your server will present to browsers, if the client (or user) choose to trust your certificate (i.e. trust your server) then they will upload the public key from your site into their end and use that key to encrypt messages that only your site can decrypt using the private key.

In this sample I used the name 'klf' as my organization name to configure my keys so you can use whatever your project name is, I am also using openssl which is an open tool to generate keys, I find it much easier to use for Node.js , when it comes Java servers keytool is a more suitable tool.

 1 - Create the Key pair 

First you create your private key this way.

openssl genrsa -out klf-key.pem

2 - Create a signature request

openssl req -new -key klf-key.pem -out klf-csr.pem

This command will ask you a few questions to identify your server (you can see my sample replies in the following screen shot), note I used the password 'passw0rd' I am hoping you have more sense than to do that :).



3 - Self sign that request 

openssl x509 -req -in klf-csr.pem -signkey klf-key.pem -out klf-cert.pem


4 - Export the PFX file 

Now that you have that self signed certificate , you will need to export the PFX file that you can use in your Node.js to start your HTTPS server

openssl pkcs12 -export -in klf-cert.pem -inkey klf-key.pem  -out klf_pfx.pfx


You will be asked to specify your pfx password (I used passw0rd again to make writing this post simple, please use something else !), this will be needed by your Node.js code as you will see in the next step.


5 - Start your Node.js server

Here is the code snippets necessary to start your Node.js (I am using Express here  , and this is not all the code, this is just what you need to modify in your express(1) app.js

app.set('port', process.env.PORT || 3000);

app.set('ssl_port', process.env.SPORT || 3443);
var https = require('https');
var fs = require ('fs');
     var options = {
     pfx : fs.readFileSync('ssl/klf.pfx'),
     passphrase : 'passw0rd',
     requestCert : false
     }  ;
https.createServer(options,app).listen(app.get('ssl_sport'), function(){
console.log('Express server listening on SSL' + app.get('sport'));

}); 

6 - Start the server and test from a browser 

Next you need to access your server using https://localhost:3443 .
Your browser will likely warn you against the certificate (because the browser does not see any signing authority on it) so tell the browser to trust it.

You can click 'Show Certificate' and you will see that Node.js is presenting you with the certificate that you have self signed


Conclusion

Using Nginx or any other HTTP router to terminate SSL request that will be using a publicly signed certificate and initiate a second SSL request from the DMZ to the Node.js in the corporate trusted zone is a good practice for on premise enterprise Node.js applications.


Sunday, April 20, 2014

Enterprise IT migrations and/or transformation challenges

 Introduction

In my first blog entries I discussed the brave new ways of building mobile applications and in specific the use of cloud hosted technologies. 

As IT departments scramble shifting to Mobile/Cloud/Analytics technologies  and Dev-ops/Agile methodologies, it is very important to keep an eye on how to approach existing ecosystem with these changes, it is a great opportunity to revitalize IT teams, and they must become partners in the movement instead of being dragged along for the rough ride.

From Migration to Transformation.


For the past few decades and since the introduction of the PC, Change has been the only constant in IT, and it did come in Waves, Each wave provides quick cycles of change at first, then it matures, the cycles of change slow down until another wave hits and we repeat.

The PC was the first wave, the 'Killer App's coming fast and furious (remember Word Star ?), Once DOS matured and slowed down the GUI wave came, Followed by the 'Internet' wave and its sibling, the "Internet Application" wave and once those matured and the cycle of development slowed down, the Mobile App wave came.

The mobile App wave is still raging with cycles of change, expect this to cool down sometime in the near future only to be followed by the Internet of things , wearable gear waves, apps every where waves.

 Every wave 'cycle' brings with it a 'Migration',  Enterprises know very well how painful those have been and can still be in the future, but they also know the benefits of them, and the danger of not doing it at the right time.

I have seen 4 types of Migrations and/or Transformations in enterprises in the past decades.

Release Migration

This type of migration happens fast and often specially in the early days of a wave . Think J2EE release frequency in the early days of the 'online application' wave. (before it was JAVA2EE)  or think the insane amount of frameworks and languages for building mobile applications and dev-ops in the current mobile application wave.

Competitive Migration

Ahh , those were fun and they are usually heated at the early days of a wave where technology providers jostled for space in the emerging market, I do have fond memories of being the 'Websphere guy' in BEA weblogic (then) environments.

Technology Migration

This type of migration happens just before the previous two usually but it happens much less often (And we should thank the IT Gods for that).
They are usually chaotic, disruptive and painful. with many jobs lost, new jobs created and it carries a shift in IT department culture along (more on that below).
There was a threshold where mainframe 'screen scraping' just did not cut it and that good old trusted main frame application was going to be rewritten in JAVA (horror of all horrors).  people like myself try to blog easing the pain of people moving from one technology to another, but no matter what, there will be causalities on both the department and personal levels.

Methodology Change

Methodology change is what happens when we moved from waterfall to Agile or when gigantic corporate like wall-mart moves to dev-ops.
These changes are equally painful to technology migrations and usually are a result of such moves.
The hallmark of Methodology change is "Resistance", it is almost impossible to do these changes by just issuing the 'top-down' commands, leadership from behind is a must, and building grassroots movement to support such change is key.

One of the best books I enjoyed reading that discuss only this type of change was Succeeding with Agile . it is a great read not on 'Agile' itself but rather on the group psychology of change and how it does impact organizations.

Key challenges for successful change.

From my work in many migrations through the years, if I am to choose the key challenges that IT managers need to keep an eye on it will be

Resistance to change 

This is just human nature, we are creatures of habit, and it professionals (the really good ones that you want to keep happy) identify on a personal level with their work, and change brings with it all types of insecurities and vulnerabilities.

Need for grass root 

this will go hand in hand with the resistance to change, the bigger the change , the more we will need to build grass roots, introduce change gradually and generally lead from behind.

Operation impact 

This is not "Operation Migration" but rather  the impact of this migration on the development/operation ecosystem. factors like bringing outside consultants to help the migration or promoting (related to grassroots factors above).

Skill gaps 

it is a fact of life, every change has a skill gap , not all skill gaps are created equal and to complicate matters , not all IT professionals respond to skill gaps equally.


Conclusion

There had never been a greater need for grassroots support and 'push from behind' type of migrations than these days, with modern technologies , the shift to cloud based and outsourcing of IT services, the focus is slowly shifting back to development and developers. and any IT migration and transformation process in the mobile/cloud/analytics era must take that into account.



IIB service development, pitfalls to avoid for new JAX-WS migrants.

DISCLAIMER : The following article discusses the behaviour of IIB 9.x , WMB 8.xx and prior releases, future IIB versions may (and most likely will) alter this behaviour. These are my personal views and do not represent IBM official position.

Introduction 

The more I work with IIB (and WMB), the more I feel it is the best ESB solution available from IBM and arguably from any provider (my bias notwithstanding), Developers moving from WESB to IIB will be pleased with the powerful range of features available and with the speed and ease of development.
However, a deeper understanding that the roots of IIB in MQ does make it a different environment than WESB which was a J2EE implementation, once you take that into consideration, migrating projects from WESB to IIB should be a much easier task.

The following blog entry should help you avoid some of the most common pitfalls and make your journey into WS development in IIB a lot more friendly.

You can find a quick tutorial on how to build IIB web services in my Worklight to IIB series here 

Use ESQL SET  with caution.

The order of elements in an XSD is important for XSD validation, breaking the order will lead to 'invalid xml'.

Take a look at this XSD and note the order of the elements.



When using the ESQL content assist you will see what you expect , the XSD is now part of the SOAP and XMLNC parsers.


The first thing to notice here is that content assist does not display elements in the same order as the XSD (this is a sign of things to come !).

In the code snippet above you also see the use of 'REFERENCE' which is a common technique to write neater source  ( and arguably faster  running code ).

Now we run the code and examine the resulting XML.
 

Now you get the picture, follow both the green arrows and the blue arrows and you will discover that the order of the XML output actually DEPENDS on the order of your code execution.

 That SOAP message that the code above generates is INVALID.

Conclusion.

  • Your code must be written in the same order as the Original XSD.
  • As you seen earlier the content assist is not in the right order so do not be fooled by that.
  • If you plan to populate an XSD model through multiple steps (by using Environment or LocalEnvironment), you must make sure that all your code pieces execute in the same order every time.
  • To guard against such mistakes and for peace of mind, I do highly recommend using a 'validate' node before you return your code (performance implications not withstanding).

Avoid the namespace mayhem.

"Mayhem : A state or situation of great confusion, disorder, trouble or destruction; chaos. " Wiktionary.

It is common for a single WSDL or even a single web service operation to have multiple namespaces in the sample I have here have two name spaces
  • Service elements : "http://KLFSamples.ESB.WL/service/"
  • Data elements     : "http://www.example.org/DB_Data"
Now look at this simple ESQL code that sets the SOAP response body to fixed values.


NOTE : The use of 'Reference' is highly recommended but you can not use a 'REFERENCE' until an element is created so a 'set' to an element in 'Person' object has to happen before I can reference it.  Personally I think it is one of those bugs that became a 'feature' as time went on. but I digress.

This previous code looks straight forward, but the XML generated in the SOAP message would carry a real "interesting" surprise to those used to  JAX-RPC and JAX-WS.

It seems that IIB 9.0 (Have not tested IIB10 yet) and Previous versions of WMB  have their own peculiar  way of adding and duplicating the same name space in multiple places.

The problem gets worse as the same 'NSx' could be defined with two different values in different places in the same SOAP message but within different hierarchical structures. 

So far, this is just an 'Ugly' message but not 'invalid'. But the problem will compound if you are actually writing into the input of a SOAPRequest node, and if that node is making a call into another IIB (or WMB) service those unexpected namespaces become impeded in the elements as they travel through the IIB engine, the reply from that SOAPRequest node will for sure carry an invalid XML and will probably break your flow. (do not ask me how I know ).

Solution 

Thanks to my friend and WMB Guru in ISSW Scott Rippley, the answer lies in the developer enumerating manually all Namespaces in their SOAP message, and inserting them all manually into the XMLNC or SOAP domain, thus forcing the parsers to use them and cleaning up output.


 The resulting SOAP message now will look like what you would have expected originally.

Technical Summary

  • Keep track of all the name spaces used in your code abbreviated  by  NSxx as you use the ESQL content assist.
  • Keep in mind those namespaces may span multiple files.
  • Insert all of them using your own naming convention (NSyy) at the top of your SOAP message using the NamespaceDecl 
  • Feel free to read more about it in the Infocenter topic here
  • Remember to declare all your namespaces at the top of your code because .. well.. Order is important ! right ?.