THE SEVEN DEADLY SINS OF SCIENCE GATEWAYS INITIATIVES


    1. They act as a Grid “cache-misère”. They maintain the illusion that grids can be “fixed” or “recycled” or made “more appealing”. They delay a necessary moratorium on a costly and obsolete technology and paradigm whose ineluctable death became today as obvious as the sun.

    2. They assume that the only way to interact with a federated infrastructure is a job scheduler of some kind. By providing a “federation layer” to e-Infrastructure, they make everything look like a grid. Such a choice compromises the interaction design that could be envisaged at the user facing layer. In particular, interactive computing and real-time collaboration are not any more possible. The grid mentality should die. Interactive computing (the IPython way) should receive more focus.

    3. They envisage the infrastructure with a pre-cloud mind-set. Before elasticity, the most compelling feature of clouds is scriptability: few lines of code can describe and bring to life a complex hardware/software architecture, the back-end for computation can and should be built on-the-fly, on real-time, based on libraries of infrastructure-describing scripts. Everything should be targeting Infrastructure-as-a-Service-style clouds and make use of their full potential.

    4. They consider Graphic User Interfaces as just software that can be built by developers and researchers. The challenge of building usable man-machine interfaces requires expertise and should be done by people whose job is to design interaction. Usability is hard, it doesn’t just happen. Systematic involvement of interaction designers is key.

    5. They overlook the fact that building sustainable engineering artifacts is different from research and that the structures and frameworks that work for research projects may not be effective in building and delivering infrastructures and tools for science. They keep reinventing the wheel and proposing yet another “middleware”. They build software in conditions and with processes that do not enable to build high quality software. They reproduce again and again the “death march” (E. Yourdon) towards software doomed to fail.They get overwhelmed by the technical complexity and forget that the survival of a software is a more daunting task than its design and building. Right after the software delivery and in the absence of an ecosystem, starts another “death march” towards obsolescence.They should either take the software ecosystem building challenge seriously. Involve in the project and in the strategic thinking experts in software design and software ecosystems. Consider the ecosystem to grow as the core objective. Potentially get the necessary guidance from a central European agency (to be invented) that would provide expertise and coaching. or get connected from day one to an existing ecosystem and shape the project’s outcome towards becoming an artifact valuable to an established community.

    6. They overlook the fact that if an application is based on a frozen set of requirements, it can’t be a tool for science where everything is moving, exploratory, transient by nature. Scientists love Matlab, R, Python, etc. because those tools allow them to progress towards understanding their data, building their models, comparing their results with others’: They follow a “Brownian motion” towards the unknown. R, Python, Matlab allow them to capture their non-predictable-in-advance trajectory towards a scientifically relevant/”publishable” result in the form of a “script”. That script can be shared and reused as is or in the form of a component/library/module/package that others can import in their own environments to reproduce their peers’ trajectory before envisaging to explore a new one of their own. Science Gateways and the workflow-paradigm they often rely on fail short in allowing such a “hyper agile”, traceable and reproducible scientific process. If science gateways should ever be useful to more than a handful of scientists, they have to comply with and empower this way of work, in particular: (a) No IT people should be involved in creating those science gateways, scientists should be able to build them and deploy them from the R, Python or Matlab command lines. Interaction components, views for data visualization, etc. should be scriptable and easy to combine with the tools scientists use to program with data. (b) Significant added value should come with the science gateways to convince the scientists to consider them. For instance enabling real-time collaboration (the Google-docs way) while accessing/analysing/visualising data would bring to the scientists’ desk capabilities they are currently eager to have. Also, adding social components that would allow them to engage with each other as small groups or communities would be valuable. Those scenarios are not any more science fiction thanks to the capabilities of cloud technologies and to the maturity reached by hundreds of open source tools, frameworks, computational libraries and infrastructure software.

    7. They lobby to give the science gateway/e-Infrastructure they build a fictitious appearance of popularity. The incentives “force lines” currently in operation create a bubble of fictitious use cases, imposed software and “non-organic” communities. Darwinism should rule to discard the “dancing bears” (a metaphor of software that hardly works for people, coined by A. cooper). Darwinism led to the long-lasting success of R, python, OpenStack, GitHub, ResearchGate, Hadoop, Spark, etc.



    Karim Chine

    Advertisements
THE SEVEN DEADLY SINS OF SCIENCE GATEWAYS INITIATIVES

2 thoughts on “THE SEVEN DEADLY SINS OF SCIENCE GATEWAYS INITIATIVES

  1. The term “science gateway” need precision, otherwise there is the risk to throw out the baby with the bath water.

    In a paper my colleagues and I wrote in 2013 (http://dx.doi.org/10.2481/dsj.GRDI-013) the term “science gateway” is considered “just” a synonym of Virtual Research Environment (VRE) and this term was defined as “innovative, web-based, community-oriented, comprehensive, flexible, and secure working environments conceived to serve the needs of modern science”. In such a definition there are a number of signs about the expectations scientists have with respect to supporting environment. There is no need to expose scientists (VRE users) to the technical issues of the underlying infrastructure(s), no matter how many and diverse they are.

    Besides the paper, my colleagues and I have actually realised the ideas in it by developing a software system (www.gcube-system.org) enabling to build and operate a Hybrid Data Infrastructure, i.e. an infrastructure built with the system-of-systems approach that is able to integrate – in a coherent whole – diverse infrastructures (including Grid and Cloud) and systems. This software is currently supporting the D4Science infrastructure (www.d4science.org) and its VREs.

    In the light of these experiences, please find below some comments on the seven sins.

    1. VREs must abstract from the underlying technologies and infrastructure(s). It is a duty of their developers to hide the complexity of what is behind the scene and provide scientists with the environments they deserve for doing science. Moreover, they should be developed to maximise the exploitation of what is available, if a Grid infrastructure is available then it can be used for “batch” distributed computing.

    2. It is definitely true, there are different infrastructures and systems to be integrated when developing a VRE each having its own peculiarities, job-based interaction is just one of the possible patterns. When realising a “federation”/“integration” layer one should offer more than one “interfaces”/“abstraction” concepts and take care of implementing all of them on the offering of the diverse systems/infrastructures, e.g. via a set of specific mediators/plug-ins.

    3. Cloud is definitely important, however its delivery models goes well beyond the IaaS. To build a VRE scientists need the entire stack of delivery models, actually they need the “as-a-Service” delivery of everything including the entire VRE.

    4. I do agree with this, in gCube GUIs are part of the offering. However, each functionality should be offered “as-a-Service” both via a GUI and via an API. In some cases a GUI is appropriate, in other cases an API is appropriate.

    5. The willingness to realise the “as-a-Service” model is definitely imposing constraints and requirements. VRE should not be built from scratch, rather their realisation should be oriented to reuse what exists and just focus on complementing/adapting it to the concrete needs of the target scenario. In order to make it a norm, there is the need for informing communities of cutting-edge offerings and help them to compare, in a cost-benefit analysis, the exploitation of this offering with respect to the from scratch development.

    6. “Dynamicity” is definitely among the most challenging requirements for who is going to provide scientists with a system. Very often scientists are looking for “openness” of the proposed approaches, i.e. approaches where scientists can enlarge the capabilities by themselves without being exposed to IT technicalities. However, finding the right “abstractions” is not an easy task. In addition to functionalities, data are important too. Having the possibility to dynamically and transparently get access to existing data repositories is equally important than having the possibility to load and analyse “proprietary” data. Moreover, having the possibility to easily share datasets produced in one context to make it possible for others to benefit from it is among the principles that must become a norm in science.
    In “our” approach there are a number of solutions to “openness” and “added value”:
    * VREs can be created and modified dynamically by simply selecting components and data from a portfolio of existing ones hosted by the underlying infrastructure. No IT espertice is needed to create a VRE;
    * Some of the components a VRE can be equipped with are actually engines that enable scientists to either use existing facilities or easily plug their owns. For instance, This is the case of the analytics environment that offers a number of methods and enables scientists to inject their own methods (including R scipts). The analytics environment executes these methods by exploiting the computing capacity of the underlying infrastructure(s). A description is given here http://dx.doi.org/10.1002/cpe.3435;
    * Some components are oriented to provide scientists with cutting-edge solutions promoting an approach to science that is collaborative and compliant with Science 2.0 practices. Our VREs are equipped with (a) a workspace enabling to store and share any research outcome, (b) a social networking environment enabling to share any research outcome, e.g. to comment no it, to get access and reuse, (c) feature-rich characterizations of the research objects that are automatically produced to carefully describe “how” the objects was produced, by whom, when, etc. with the goal to actually enable its (re)use by others. An innovative approach aiming at equipping any Infrastructure with Science 2.0 practices is described here http://dx.doi.org/10.1045/january2015-assante

    7. It is definitely true that the success of any “VRE” strongly depends from the diffusion and impact it has on scientists working practices. However, there is a large variety in science and there is the need to cross fertilise less developed domains by advertising “best practices“.

    If you would like to have more information on our experiences, please do not hesitate to contact me … we are coping with VREs development since 2004.

    Like

  2. couldn’t agree more that we need improved interfaces. We have a role model: SQL has proven tremendously successful for business and administration data; not in science and engineering, though, due to missing support for core data types. (And, yes, NoSQL already has (re)discovered the value of query languages.)
    Leveraging the success and experience of query languages is possible: “Science SQL” (ISO 9075 SQL Part 15: MDA) is going to lead to flexible and scalable interfaces for science and engineering just as flexible and commodity-style as classic SQL has in business – a move that will effectively change the game for e-Infrastructures and beyond. Such languages are easier to use as an interface (scripting by experts, or as hidden API for client developers) and have proven highly optimizable and parallelizable in the backend. For example, the rasdaman Array Database has distributed an incoming query ad-hoc and transparently over 1,000+ cloud nodes.
    Time to put our minds together – databases, programming languages, HPC (and possible more domains) should finally unite their forces for a better service world. Yes, we can – if we want.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s