Big data analyses are critical for decision support in business data
processing. These analyses involve the execution of many activities such as:
programs to explore data from the web, databases, data warehouses and
files; data cleaning procedures; programs to aggregate data; core programs
that perform analyses; and tools to visualize and interpret the results. Each
step (activity) of the analysis is performed isolated from the other and the
analysts need to manually manage the larger life cycle of big data analysis.
Big data analysis started to be represented as pipelines or dataflows.
However, current approaches lack features to provide a consistent view of
many different explorations and activities as part of a broader analysis, like a
computational experiment. Scientific workflows have long provided such
features for scientific experiments, and although originally designed for
science, they may be useful to support the life cycle of big data analysis.
Scientific analyses typically involve experimenting with several steps using
different datasets and computer programs. Scientists need to manage the
composition, execution and analysis of their experiments carefully, so the
results can be trusted and the experiments reproducible. To help managing
experiments, scientific workflow management systems (SWfMS) have been
proposed to let scientists design workflows of different complexities and
manage their execution, including high performance computing (HPC) in cloud
environments. Most SWfMS also have provenance data support. Provenance
tracks how the results of the experiments were produced, which is essential to
make an experiment (big data analysis) reproducible and trustworthy.
Business Process Workflows are focused on modeling the process rather
than managing big data flows with provenance and HPC. In this talk we
discuss on provenance support along the big data analysis workflow as an
alternative to improve results of big data analysis, especially in a long-term