Sunday, 20 January 2013

The CAP theorem

When designing distributed systems such as distributed grids, distributed caches, distributed web services, etc. I often see solution architects and technical architects struggle with the design of the system while trying to provide three main characteristics : consistency, availability and partition tolerance.
It appears to be widely accepted, nowadays, that any enterprise technical architecture design should guarantee those three features, almost like a standard.
Unfortunately, when it comes to designing distributed system, that's not immediate and you do see architects continuously reviewing their design in order to provide their systems with those three guarantees and, ultimately, having to find a trade off.
Let's first see what those features mean in a distributed systems:

Consistency
Consistency simply means that each server in the grid returns the right response to each request. That means that the entire distributed system provides, consistently, the same data on each server where the request is made.

Availability
Availability means that each request eventually receives a response. The system is always available.

Partition tolerance
Communication between servers in the grid can be faulty. Messages can be delayed or lost and servers can crash. If that happens, the system should continue working correctly

Trying to satisfy all the three guarantees in a distributed grid is , theoretically, impossible.
In 2000, professor Eric Brewer, from the University of California, Berkeley , at the 2000 Symposium on Principles of Distributed Computing, made a conjecture of what became known as the CAP theorem.
According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.

This is something every architect should know when dealing with distributed systems.
So, how do we deal with that? If you google "CAP Theorem" you will find many articles with solutions, panacea, people claiming to have beaten this restriction, etc.
In reality, all comes to two steps:

1- drop one of the three constraints
2- design around it (architects already do that even not knowing the CAP theorem)

Consider that dropping one of the constraints does not mean that the system will never be consistent or available or partition tolerant. It means that there will be a time window in which the system is not providing that feature (it can either be milliseconds or seconds).
NOSQL databases are a good example of distributed systems that face the CAP theorem.

Antonio

Sunday, 19 August 2012

Big Data -The three Vs

A bit of taxonomy for this blog. I'm currently taking part in a Big Data initiative within my company and I was thinking of sharing some important findings here.
Before doing that, let's define Big Data .
I will ask Wikipedia for some help:
“Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set".

Basically, we are talking about massive volume of data that commonly used software are not able to process quickly.
Is that all? Not quite.
Big Data is defined by three characteristics, which form what is better known as the three Vs: Volume, Variety and Velocity.

Volume

We already talked about this in the definition. Single data sets being over dozen of terabytes up to many petabytes.

Variety

This is a key characteristic. Variety represents all types of data. We shift from traditional structured data (e.g. normal database tables), to semi structured (e.g.. XML files with free description) to completely unstructured data like web logs, tweets and social networking feeds, sensor logs, surveys, medical checks, genome mappings, etc..

Traditional analytics platform can't handle variety.

Velocity

Commonly, in the traditional data world,velocity means how fast the data are coming in and getting stored. Traditional analytics platform deals with pretty much static data. RDBMS tables with a velocity that is pretty much under control.

In the Big Data world data are fast mutable, continuously changing and increasing in volume. We are talking about real-time, streaming data. A web log, a tweet hash-tag, are continuously growing and flowing. In the Big Data world, velocity is the speed at which the data is flowing.

Putting the three characteristics together, a Big Data initiative should be able to cope with massive volume of fast flowing, mostly unstructured, data.

To achieve what? This will be the subject of my next few posts.

Antonio

Saturday, 12 May 2012

Rapid Prototyping : key for innovation

I recently attended a technology summit in London. Quite an interesting one, which I will write about in the next few days.

I was part of the CAB (Customer Advisory Board) of the technology vendor who actually organized the summit.

One of the key questions that arose at the CAB meeting was around innovation. What do you need to innovate? What are the key factors that you find important for innovation?

I didn’t have to think much about the answer, given that I’m actually facing it every day.

Rapid prototyping is becoming a key factor for innovation.

In the era of SAAS explosion all the software features are out there. We are no longer developing features, we are composing features offered by different vendors.

Composing working applications through services orchestration is now faster than ever. Tools for rapid prototyping of service composition will play a key role in the future of innovation

Customers are asking everyday for working prototypes “Give me something that I can see and use rather than showing me another power point”, was a recent comment I heard.

Showing a working prototype on a mobile device, an IPad or Android tablet rather than on a power point, can be that difference that will make your innovative idea stick.

That’s what it does, rather than that’s what I promise it will do.

Antonio

Thursday, 23 February 2012

How do you move to SOA? A pragmatic approach.

This is a common question and there isn't a simple answer.
How do you actually make your entire software architecture to be 100% SOA compliant? Which means that all the components of your software architecture are designed to be services?
Well, in my opinion, you need to define a clear roadmap for legacy applications governance and design standards for new components. It is not something that can happen overnight, but an incremental approach is generally necessary; it is a journey for the whole organization, not a just a new process to implement.

Someone else though, realized that a quicker and more pragmatic approach would work better.

I stumbled upon an old email sent by the Amazon CEO to his software developers, designers and architects, in order to introduce SOA. One of the employees shared the email, by mistake, on his public technology blog.

Before the email was sent, Amazon' software architecture was a silo architecture, without any internal API.
To tackle this problem, the CEO "suggested" this:

subject: Amazon shall use SOA!

body:

“All teams will henceforth expose their data and functionality through service interfaces
Teams must communicate with each other through these interfaces
There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network
It doesn't matter what [API protocol] technology you use.
Service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
Anyone who doesn't do this will be fired.
Thank you; have a nice day!”

Quite a pragmatic approach for SOA governance, but looking at what the Amazon Cloud has become nowadays, it appears to me that it worked quite well.

Antonio

Friday, 3 February 2012

Volunia Search Engine to be launched soon

Volunia, the Italian web search engine, is due to be launched on Monday the 7th of February.

Developed by Massimo Marchiori (one of the top 100 researchers in the world, in the picture with Tim Berners-Lee), it will be launched via web conference from the University of Padua, where Massimo currently teaches Databases and Information Systems.

Massimo was the creator of the HyperSearch technique which is currently adopted by most of the search engines (Google above all).

He claims the new engine will have some new features that all the other search engines will adopt within the next four or five years. He stated that it is more than another search engine, "It will be different, it will do things that Google is currently not able to do".

I'm looking forward to the launching event. I will post an update here to discuss the new features and the core technology concepts behind Volunia search engine.

To attend the event, check this web site: http://launch.volunia.com/

Antonio

Friday, 23 December 2011

Will the future of professional education be online?

I'm a great believer in on-line education. I often attend live web-casts, follow youtube videos, download shared presentations. I generally find them very helpful, but I believe that, with the current technologies, more can be done for professional education.
Why don't offer complete courses on-line?

It looks like I'm not the only one believing that.
Over three months ago, Stanford University launched three online classes and I joined two of them:

Both courses are offered on-line for free and they cover very interesting subjects.
I was more interested in Machine Learning, which is just a branch of artificial intelligence, but I decided to attend also "Introduction to Artificial Intelligence" in order to have a wider idea of the whole artificial intelligence world.

Last week I completed the final assessments for both courses.
AI professors have already sent the certificate of accomplishment. I'm still waiting for the Machine Learning one.

Introduction to AI was really theoretical, and we covered the following topics:

Overview of AI, Search
Statistics, Uncertainty, and Bayes networks
Machine Learning
Logic and Planning
Markov Decision Processes and Reinforcement Learning
Hidden Markov Models and Filters
Adversarial and Advanced Planning
Image Processing and Computer Vision
Robotics and robot motion planning
Natural Language Processing and Information Retrieval

Machine Leaning was really practical. We had programming exercises each week to implement using Octave, a high level, open source, interpreted language suitable for numerical computation . Below the topics we covered over the last few weeks:

Introduction to Machine Learning
Linear regression with one variable
(Optional) Linear algebra review
Linear regression with multiple variables
Octave tutorial
Logistic Regression
One-vs-all Classification
Regularization
Neural Networks
Backpropagation Algorithm
Practical advise for applying learning algorithms
How to develop and debug learning algorithms
Feature and model design, setting up experiments
Support Vector Machines (SVMs)
Survey of other algorithms: Naive Bayes, Decision Trees, Boosting
Unsupervised learning: Agglomerative clustering, k-Means, PCA
Combining unsupervised and supervised learning.
Independent component analysis
Anomaly detection
Other applications: Recommender systems. Learning to rank
Large-scale/parallel machine learning and big data.
Machine learning design / practical methods
Team design of machine learning systems

I found both courses very interesting, I really learned a lot. While AI gave me very deep theoretical understanding of the whole artificial intelligence world, machine learning was very hands on, and I found many topics really applicable in my every day job.
How? I will talk about some of the topics I learned in the next few weeks.

In the meantime, Stanford has launched another series of online classes:

I'm planning to attend a few of them, and I strongly advise all my readers to enrol to a couple of them, at least.

I have only one suggestion to make to the Stanford University. It is about the assessments. How to prove that students are really following the Stanford Honour Code?
I think Stanford should host the examinations to other testing centres, like Prometric or others. This would give to the certificates of accomplishment more formal value and better industry recognition.

Stay tuned, in the next few days we will talk about Machine Learning and Artificial Intelligence and see if we can find applications in the business world.

Antonio

Saturday, 3 December 2011

BPM and ESB: pick the right one

I recently noticed, in different enterprise architectures, an overlapping between BPM (Business Process Management) and ESB (Enterprise Service Bus) technologies.
Architects tend to use one or the other based on previous choices, company partnerships, costs, know-how, rather than real technical benefits or functional reasons.
It is becoming common to come across problems solved using BPM, where an ESB solution would have been more suitable and vice versa.

I attended different ESB seminars (online and in person) and some of the most common questions have always been:

"Will BPM replace ESB in the near future?
Do you see BPM and ESB working together in enterprise architectures?
How do we pick one or the other?"

Let’s try to answer to those questions discussing BPM and ESB separately.

BPM

Business Process Management assists business analysts in optimizing processes of the organizations.

They’re generally human centric long-lived processes, lasting days, weeks, months or even years, where a certain degree of human interaction is generally expected.

BPM solutions consist of process definition, execution and monitoring, and are generally orchestrated between different systems, people and processes.

BPM does provide connectivity between different systems, but doesn’t support a large number of protocols and data transformation formats

Example of human centric business processes may be HR, payroll or low volume, high value e-commerce transactions.

In general, we can say that BPM should be used for human centric long lasting business processes, not real time, where performances are not a key factor and low latency time won’t affect the process execution.

ESB

Enterprise Service Bus (ESB) is optimized for high volume, low latency, system-to-system real time communication, in the order of milliseconds or seconds.

ESB is also a flexible and scalable solution, which becomes very important when transactions‘ volume is significantly high.

ESB support many different transport protocols and data transformation formats.

In general, ESB should be used in high volume transaction scenarios, not human centric, where real time and low latency are key and the orchestration occurs between different systems that support different kind of protocols and data format.

BPM and ESB together

BPM and ESb are not only complimentary, but they can be used together; there are different scenario and case studies.

A very popular scenario would be an ESB kicking off different BPM processes, orchestrating them, collecting the results and sending those results to different systems.

Another scenario, albeit less common, could be the other way around: BPM to orchestrate a main business process and ESB to handle the single tasks of communicating with different systems, managing transactions, availability, scalability and data transformation.

In conclusion, BPM and ESB are fit for different purposes, based on business scenario, transactions volume, performances, protocols and data format. But they are not only complimentary. They can be used together in different business cases.

Antonio