1
A study on big data technologies, commercial considerations,
associated opportunities and challenges
Zeituni Baraka
Opportunities to manage big data
efficiently and effectively
2
Zeituni Baraka
2014-08-22
Dublin Business School, tunibaraka@yahoo.com
Word count 20,021
Dissertation
MBA
3
Acknowledgements
I would like to express my gratitude to my supervisor Patrick O’Callaghan who has taught
me so much this past year about technology and business. The team at SAP and partners have
been key to the success of this project overall.
I would also like to thank all those who participated in the surveys and who so generously
shared their insight and ideas.
Additionally, I thank my parents for proving a fantastic academic foundation on which I’ve
leveraged on at post graduate level. I would also like to thank them for modelling rather than
preaching and for driving me on with their unconditional love and support.
4
5
TABLE OF CONTENT
ABSTRACT ……………………………………………………………………………………………………………………………………..
7
BACKGROUND ……………………………………………………………………………………………………………………………….
8
BIG DATA DEFINITION, HISTORY AND BUSINESS CONTEXT ……………………………………………………………………
9
WHY IS BIG DATA RESEARCH IMPORTANT? ………………………………………………………………………………………
11
BIG DATA ISSUES ………………………………………………………………………………………………………………………….
12
BIG DATA OPPORTUNITIES …………………………………………………………………………………………………………….
14
Use case- US Government………………………………………………………………………………………………………………………. 16
BIG DATA FROM A TECHNICAL PERSPECTIVE …………………………………………………………………………………….
17
Data management issues ……………………………………………………………………………………………………………………….. 18
1.1
Data structures …………………………………………………………………………………………………………………. 19
1.2
Data warehouse and data mart ………………………………………………………………………………………….. 21
Big data management tools ……………………………………………………………………………………………………………………. 23
Big data analytics tools and Hadoop ………………………………………………………………………………………………………… 24
Technical limitations relating to Hadoop ………………………………………………………………………………………………….. 26
1.3
Table 1. View of the difference between OLTP and OLAP
………………………………………………………… 29
1.4
Table 2. View of a modern data warehouse using big data and in-memory technology ……………… 30
1.5
Table 3. Data life cycle- An example of a basic data model …………………………………………………….. 31
DIFFERENCES BETWEEN BIG DATA ANALYTICS AND TRADITIONAL DBMS ………………………………………………
32
1.6
Table 4: View of cost difference between data warehousing costs in comparison to Hadoop ……… 33
1.7
Table 5. Major differences between traditional database characteristics and big data
characteristics …………………………………………………………………………………………………………………………… 34
BIG DATA COSTS- FINDINGS FROM PRIMARY AND SECONDARY DATA ………………………………………………….
35
1.8
Table 6: Estimated project cost for 40TB data warehouse system –big data investment …………….. 38
RESEARCH OBJECTIVE ……………………………………………………………………………………………………………………
41
RESEARCH METHODOLOGY ……………………………………………………………………………………………………………
42
Data collection ……………………………………………………………………………………………………………………………………… 44
Literary review ……………………………………………………………………………………………………………………………………… 46
Research survey ……………………………………………………………………………………………………………………………………. 47
1.9
Table 7: Survey questions …………………………………………………………………………………………………… 48
SUMMARY OF KEY RESEARCH FINDINGS
…………………………………………………………………………………………..
53
RECOMMENDATIONS ……………………………………………………………………………………………………………………
57
Business strategy recommendations ……………………………………………………………………………………………………….. 57
6
Technical recommendations …………………………………………………………………………………………………………………… 58
SELF-REFLECTION ………………………………………………………………………………………………………………………….
59
Thoughts on the projects ……………………………………………………………………………………………………………………….. 59
Formulation …………………………………………………………………………………………………………………………………………. 63
Main learnings ……………………………………………………………………………………………………………………………………… 64
BIBLIOGRAPHY
……………………………………………………………………………………………………………………………..
66
Web resources ……………………………………………………………………………………………………………………………………… 67
Other recommended readings
………………………………………………………………………………………………………………… 68
APPENDICES…………………………………………………………………………………………………………………………………
69
Appendix A: Examples of big data analysis methods ………………………………………………………………………………….. 69
Appendix B: Survey results
……………………………………………………………………………………………………………………… 72
7
Abstract
Research enquiry: Opportunities to manage big data efficiently and effectively
Big data can enable part-automated decision making. By by-passing the possibility of human-
error through the use of advanced algorithm, information can be found that otherwise would
be hidden. Banks can use big data analytics to spot fraud, government can use big data
analytics for cost cuts through deeper insight, the private sector can use big data to optimize
service or product offering as well as targeting of customers through more advanced
marketing.
Organization across all sectors and in particular government is currently investing heavily in
big data (Enterprise Ireland, 2014). One would think that an investment in superior
technology that can support competitiveness and business insight should be of priority to
organization, but due to the sometimes high costs associated with big data, decision makers
struggle to justify the investment and to find the right talent for big data projects.
Due to the premature stage of big data research, the supply has not been able to keep up with
the demand from organizations that want to leverage on big data analytics. Big data explorers
and big data adopters struggle with access to qualitative as well as quantitative research on
big data.
The lack of access to big data know-how information, best practice advice and guidelines
drove this study. The objective is to contribute to efforts being made to support a wider
adoption of big data analytics. This study provides unique insight through a primary data
study that aims to support big data explorers and adopters.
8
Background
This research contains secondary and primary data to provide readers with a
multidimensional view of big data for the purpose of knowledge sharing. The emphasis of
this study is to provide information shared by experts that can help decision makers with
budgeting, planning and execution of big data projects.
One of the challenges with big data research is that there is no academic definition for big
data. A section was assigned to discussing the definitions that previous researchers have
contributed with and the historical background of the concept of big data to create context
and background for the current discussions around big data, such as the existing skills-gap.
An emphasis was placed on providing use cases and technical explanations to readers that
may want to gain an understanding of the technologies associated with big data as well as the
practical application of big data analytics.
The original research idea was to create a like-for-like data management environment to
measure the performance difference and gains of big data compared to traditional database
management systems (DBMS). Different components would be tested and swapped to
conclude the optimal technical set up to support big data. This experiment has already been
tried and tested by other researchers and the conclusions have been that the results are
generally biased. Often the results weigh in favor of the sponsor of the study. Due to the
assumption that no true conclusion can be reached in terms of the ultimate combination of
technologies and most favorable commercial opportunity for supporting big data, the
direction of this research changed.
An opportunity appeared to gain insight and know-how from big data associated IT
professionals who were willing to share their experiences of big data project. This
dissertation focuses on findings from a surveys carried out with 23 big data associated
professionals to help government and education bodies with the effort to provide guidance for
big data adopters (Yan, 2013).
9
Big data definition, history and
business context
To understand why big data is an important topic today it’s important to understand the term
and background. The term big data has been traced back to discussions in the 1940’s. Early
discussions where just like today about handling large groups of complex data sets that were
difficult to manage using traditional DBMS. The discussions were led by both industry
specialists as well as academic researchers. Big data is today still not defined scientifically
and pragmatically however the efforts to find a clear definition for big data continue (Forbes,
2014).
The first academic definition for big data was submitted in a paper in July 2000 by Francis
Diebold of University of Pennsylvania, in his work in the area of econometrics and statistics.
In this research he states as follows:
“Big Data refers to the explosion in the quantity (and sometimes, quality) of available and
potentially relevant data, largely the result of recent and unprecedented advancements in
data recording and storage technology. In this new and exciting world, sample sizes are no
longer fruitfully measured in “number of observations,” but rather in, say, megabytes. Even
data accruing at the rate of several gigabytes per day are not uncommon.”
(Diebold.F, 2000)
A modern definition of big data is that it is a summary of descriptions, of ways of capturing,
containing, distribute, manage and analyze often above a petabyte data volume, with high
velocity and that has diverse structures that are not manageable using conventional data
management methods. The restrictions are caused by technological limitations. Big data can
also be described as data sets that are too large and complex for a regular DBMS to capture,
retain and analyze (Laudon, Laudon, 2014).
10
In 2001, Doug Laney explained in research for META Group that the characteristics of big
data were data sets that cannot be managed with traditional data management tools. He also
summaries the characteristics into a concept called the ‘’Three V’s’’: volume (size of datasets
and storage), velocity (speed of incoming data), and variety (data types). Further discussions
have led to the concept being expanded into the “Five V’s”: volume, velocity, variety,
veracity (integrity of data), value (usefulness of data) and complexity (degree of
interconnection among data structures), (Laney.D, 2001).
Research firm McKinsey also offers their interpretation of what big data is:
“Big data” refers to datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyze. This definition is intentionally subjective and
incorporates a moving definition of how big a dataset needs to be in order to be considered
big data—i.e., we don’t define big data in terms of being larger than a certain number of
terabytes (thousands of gigabytes). We assume that, as technology advances over time, the
size of datasets that qualify as big data will also increase. Also note that the definition can
vary by sector, depending on what kinds of software tools are commonly available and what
sizes of datasets are common in a particular industry. With those caveats, big data in many
sectors today will range from a few dozen terabytes to multiple petabytes (thousands of
terabytes)’’
(McKinsey&Company,2011)
The big challenge with big data definition is the lack of measurable matrix associated, such as
a minimal data volume or a type of data format. The common understanding today, is that big
data is linked with discussions around data growth which is linked with data retention law,
globalization and market changes such as the growth of web based businesses. Often it’s
referred to data volumes above a petabyte of Exabyte, but big data can be any amount of data
that is complex to manage and analyze to the individual organization.
11
Why is big data research important?
The reason why this research is relevant is because big data has never been as business-
critical as today. Legal pressures and competition is adding to pressures to not just retain
data, but leverage on it for smarter, faster and more accurate decision making. Having the
ability to process historical data for analysis of patterns trends and to gain previously
unknown facts, provides a more holistic view for decision makers.
Decision makers see value in having the ability to leverage on larger sets of data which will
give them granule analysis to validate decision. The sort of information that organization look
for can be also contrasting information, discrepancies in data, evidence of quality and
credibility. The rationale behind the concept of big data is simple, the more evidence gathered
from current and historical data; the easier it is to turn a theory into facts and the higher the
probability is that what the data shows is conclusive. It sounds like a simple task; simply
gather some data and use a state of the art Business Intelligence solution (BI) to find
information. It has proven to not be easy as management of larger sets of data is often time
consuming, resource heavy and in many cases expensive (Yan, 2013).
Big data can help to improve prediction, improve efficiency and create opportunities for cost
reductions (Columbus 2012). The inability to find information in a vast set of data can
sometimes affect competitiveness and halter progression as decision makers don’t have facts
to support justification. Yet, organizations struggle to justify the investment in big data
despite awareness of the positive impact that big data analytics can have.
12
Big data issues
Decision makers find it difficult to decide on budget allocation for big data. Is big data an IT
matter that should be invested in using IT budget? Or is big data a marketing and sales
matter? Perhaps big data is a leadership matter that should be funded by operations
management budget? There is no wrong or right answer. Another issue that decision makers
are struggling with is defining the measurement and key performance indicators (KPI’s) to
assess potential and results. What defines return on investment (ROI) can be difficult to
establish and results can often not be proven before the investment is already made
(Capgemini, 2014).
Performance demanding applications and internet based applications, along with data
retention laws has forced software developers to rethink the way software is developed and
the way data management is carried out. Business processes are today often data driven and
business decisions often rely on business intelligence and analytics for justification. There are
global pressures around accountability and emphasis on the importance of historical
documentation for assurance of compliance and best practice.
Governments are currently working to narrow technology knowledge gaps associated with
big. They’re also working to provide guidelines, policies, standards and to enforce
regulations for use of big data technologies (Yan, 2013). Moral issues around big data are
mainly around legislations and privacy laws. Experts worry that if lagers data volumes are
retained, the risk is higher would the data be compromised. There are few internationally
agreed standards in terms of data management. The lack of legislations around web data in
particular can lead to misuse.
Big data is subject to laws like the Data Protection (Amendment) Act 2003 and ePrivacy
regulations Act 2011. However they don’t give much guidance in terms of best practices.
Data sources such as social media are also very loosely regulated (Data Protection
Commissioner, 2014). New legislations around security and accounting law require
13
organization’s to retain email archives for longer than before, in the US for example it’s 5
years (Laudon, Laudon, 2014)
Organizations are challenged with preparing legacy systems and traditional IT environments
for big data adoption. If for example a company struggle with data quality or poor
implementation results of previous hardware and software, a big data investment would be
ineffective. To ensure success, knowledgeable management is needed. McKinsey state that
that there is a need for 1.5 million data knowledgeable managers in the US to take advantage
of the potential that big data brings along with a need for 140,000-190,000 analytical
professionals (McKinsey&Company,2011).
In a study in 2002 commissioned by ITAC, the top most sought after IT skills were identified.
SQL Server, SQL Windows, IT security skills, Windows NT Server, Microsoft Exchange and
wide area networks skills topped the list (McKeen, Smith, 2004). Just 12 years later, the
demand looks very different with advanced analytics, cloud, mobility technology, and web
skills being at the forefront of discussions. All of these skills are relevant for big data
projects.
14
Big data opportunities
Researcher David J.Teece talks about competitive advantage in his publication Managing
Intellectual Capital in a chapter called The Knowledge Economy. He points out that
competitive advantage has transformed as a concept with the emergence of advanced
information technology. Already in 1981, he stated that ‘’economic prosperity rests upon
knowledge’’ and it’s fair to say that today, 33 years later, that history shows that he was
accurate in his statement (Teece, 2002). The complex business issues that have been solved
through big data analytics is testament to the importance of using technology for innovation
and innovation for business gains.
Steve Ellis, explained in 2005 that knowledge-based working is when intellectual assets is
used collectively to create a superior ability to meet market challenges before the
competition. The emphasis is to move away from tacit-knowledge which is knowledge only
held by an individual for individual task completion (Ellis, 2005). It’s been proven that
explicit knowledge benefits organizations as it leaves originations less vulnerable to staff
turnover and change management issues when intelligence is widely accessible (Dalkir,
2005). Knowledge-based working requires a change of approach to organizational operations.
This change can be supported only through faster, deeper and more accurate intelligence
gathering, which is something that big data analytics can provide. With the use of big data,
knowledge-based-workings can be applied optimally.
Organizations seek predictability for stability and sustainability. The ability to see ahead
provides security and the ability to introduce initiatives that can help avoid risks as well as
initiatives that leverages on opportunities that change can bring. The demands for insight due
to web traffic, growth of email massaging, social media content and information from
connected machines with sensors such as GPS products, mobile devices, and shop purchasing
devices drives data growth. The constant flow of large volumes of data drives organization’s
to invest in new data management tools to be able to capture, store and gain business
intelligence through analytics of larger sets of data.
15
Big data has many known use-cases. Most commonly it’s used by government’s or associated
agencies to provide things like national statistics, weather forecasting, traffic control, fraud
prevention, disaster prevention, finance management, managing areas around national
education, national security, health care and many other use cases in the private sector such
as retail, banking, manufacturing, wholesale, distribution, logistics, communications industry,
and utilities. In short, there’s a use case in most sectors (Yan, 2013).
Gartner Inc, estimated in 2012, that organizations would spend 28 billion USD on big data
that year and that the number would rise to 56 billion USD by 2016. Market revenues are
projected to be 47.5 billion USD by 2017. According to Gartner, a general profile of a big
data user is an organization with a database larger than 1.5TB and that has a data growth rate
of 20% per year.
Research McKinsey estimate a 40% projected data growth generated by the US year on year.
Adoption of big data and improved data management could help for example reduce US
health care expenditure by 8% and retailers can expect a possible 60% increase in operational
margin if they adopt big data (McKinsey&Company,2011).
Peter Sondergaard, Senior Vice President and Global Head of Research at research company
Gartner, stated in 2013 that “By 2015, 4.4 million IT jobs globally will be created to support
big data’’ (Gartner, 2012). Another demonstration of the growing interest can be found in
statistics provided by Google that shows a 90% increase of big data searches has been
recorded between 2011 and March 2014 (Google, 2014).
16
Use case- US Government
The US Government formed a new organization called the Big Data Senior Steering Group
in 2010, consisting of 17 agencies to support research and development. Two years later the
Big Data Research and Development Initiative was provided a 200 million USD budget to
accelerate the technical science and engineering effort of big data for the purpose of
improved national security.
In 2013 the Big Data Community of Practice was founded, which is a collaboration between
the US government and big data communities. This was followed by the Big Data
Symposium which was founded to promote big data awareness. Furthermore, significant
investments have been made to support higher education programs to train data scientist to
cover the existing knowledge gap around big data (The White House, 2012).
An example of benefits that have been seen is the case of the Internal Revenue Service in the
US. They have documented a decrease of time spent on loading tax returns from over 4
months in 2005, to 10 hours through big data initiatives in 2012 (Butler.J, 2012).
17
Big data from a technical perspective
To understand big data it’s helpful to understand data from a corporate management point of
view and associated concerns. When decisions makers review analytics one of the main tasks
is also to review the accuracy, completeness, validity and consistency of the data
(Chaffey,Wood, 2005). One big threat to any analytics initiative is the lack of access to
usable data. Poor data is a threat to companies and inhibits organizations from leveraging on
analytics. Big data depends on high quality data but can also be used to spot data
discrepancies.
Historically businesses have been strained by lack of processing power, but this has changed
today due to decreasing costs for hardware and processing. The new strain is growing data
volumes that hamper efficient data management. To be able to leverage on big data
organization’s need to prepare systems to be able to take full advantage on the opportunity to
big data brings. Data needs to be prepared and existing IT systems need to have the capability
to not just handle the data volume but also maintain the running of business applications.
Organizations worry about latency, faultiness, lack of atomicity, consistency, isolation issues,
durability (ACID) security and access to skilled staff that can manage the data and systems
(Yan, 2013)
18
Data management issues
One common cause for IT issues is architectural drift, which is when implementation of
software deviates from the original architectural plan over time and causes complexity and
confusion. Experts point to that complexity can be caused by lack of synchronization,
standardization and awareness as new code is being developed and data models change.
Furthermore, architects are often reluctant to revisiting problematic areas due to time
constraints, sometimes lack of skills and demotivation.
For data to be used efficiently the data modelling structure needs to consists of logic that
determines rules for the data. Experts talk about entity, attribute and relationship. Data
modelling enables identification of relationships between data and the model defines the logic
behind the relationship and the processing of the data in a database. One example is data
modelling for data storage. The model will determine which data will be stored, how it will
be stored and how it can be accessed (Rainer, Turban, 2009).
Data is often managed at different stages and often in several places as it can be scattered
across an organization and be managed by multiple individuals, leaving room for
compromise of data integrity. One big issue is data decay, which is an expression used to
explain data change. An example could be a change of a customer surname, change of
address or an update of product pricing.
There are multiple conditions that can affect data management such as poorly written
software, software compatibility, hardware failures can be caused by insufficient storage
space affecting the running of software. Data can be affected by operator errors caused by for
example the wrong data being entered or a script could be instructing the computer to do the
wrong thing, which can affect the mainframe and mini computers which can cause issues
with batch jobs. Multiple issues can cause down-time and disruption to business operations.
Hardware also affects data management. A computer can fail to run a back-up or install new
software or struggle with multiple tasks such as processing real-time data at the same time as
restoring files to a database. This might affect multiple simultaneous tasks causing confusion
and time loss (Chartered Institute of Bankers, 1996)
19
Big data discussions are directly linked to data management and database discussions.
Databases enable management of business transactions from business applications. Using
DBMS enables data redundancy which helps with avoidance of losing data through data
storage at different locations. It also helps with data isolation as a precaution to enable
assignment of access rights for security. By using one database inconsistencies can be
avoided. It can act as a point of one truth rather than having different sets of the same data
that can be subject to discrepancies.
Previously, transactional data was the most common data and due to its simple structured
format it could with ease be stored into a row or column of a relational database. However
with the introduction of large volumes of web data which is often unstructured or semi-
structured, traditional relational databases do no longer suffice to manage the data. The data
can no longer be organized in columns and rows and the volume adds additional strain on
traditional database technologies. Big data enables management of all types of data formats,
including images and video, which makes it suitable for modern business analytics (Harry,
2001)
1.1 Data structures
One issue associated with big data is management of different data structures. The lack of
standardization makes data difficult to process. As mentioned in the section above, the
introduction of large volumes of data makes it difficult for traditional DBMS to organize the
data in columns and rows. There are several types of data formats such as structured
transactional data, loosely structured data as found in social media feeds, complex data as
what can be found in web server log files and unstructured data such as audio and video files.
The data mix can also be referred to as an enterprise mashup, basically integration of
heterogeneous digital data and applications from multiple sources, used for business
purposes.
The difficulty with web data is that the data was inserted without following rules, like for
example the rules that a database administrator would follow as standard. The data can
generally be divided into three categories; Structured, unstructured and semi structured.
Structured data is often described as data that is modelled in a format that makes it easy to
20
shape and manage. The reason it may be easier to manage is because formal data modelling
technique has been applied that are considered standard. A great example of a solution based
on structured data is an excel spread sheet.
The opposite to structured data is unstructured data which is difficult to define as it is both
language based and non-language based like for example pictures, audio and video. Popular
websites like Twitter, Amazon and Facebook contain a high volume of unstructured data
which can make reporting and analysis difficult due to the mixture of data and difficulty to
translate image and video for example into text language, to make the items easier to search
for (Laudon, Laudon, 2014)
Semi structured data has a combination of structured and unstructured data. Semi-structured
data is when the data does not fit into fixed fields but do contain some sort of identifier, tag or
markers that give it a unique identity. In a scenario of building a database with this sort of
data set, part of it would be easier to manage than other parts. The online companies
mentioned above along with the likes of Linkedin, Google, and Yahoo.com will all have
databases containing this sort of data. XML and HTML tagged text is an example of semi-
structured data (McKinsey, 2011)
To give an example of the importance of data structure the following scenario can be
considered. If UK retailer Tesco’s would want to release a new product on their site, they
would decide on the order and structure of associated data for that product in advance to
enable quick search and reporting relating to that product. The product would have attributes
like example color, size, salt level and price, inserted in a structure, order and format that
make associated data easier to manage than data from a public online blog input for example.
The ability to identify the individual product is critical to being able to analyze sales and
marketing associated with the product. If the product is not searchable, opportunities can be
missed (Laudon, Laudon, 2014)
21
1.2 Data warehouse and data mart
Traditional transactional DBMS is the core of big data but it does not allow retrieval of
optimal analytics in the same way as data warehousing. Data warehouses have been used for
over 20 years to consolidate data from business applications into a single depository for
analytical purposes. Many businesses use it as the source of truth for validation and data
quality management. Data warehouses provide ad-hoc and standardized query tools,
analytical tools and graphical reporting capability.
Data warehouse technologies started to be widely promoted in the 1990, a little while before
ERP systems were introduced. Just like today, the associated data consisted of feeds from a
transactional database. The addition today is that the data can also be feed from an analytical
database and faster than ever before using in-memory transactional capability. Previously,
data warehousing has not been used for daily transactions. This is shifting with the
introduction of real-time data processing.
Data warehouse management requires significant upfront development and effort to be able
to provide value. A common implementation project scenario would be a follows:
Create a business model of the data
Create logical data definition (schema)
Create the physical database design
Create create-transform-load (ETL) process to clean, validate and integrate the data
Load data it into the data warehouse
Ensure format conforms to the model created
Create business views for data reporting
(Winter Corporation, 2013)
22
One of the most common data sources for data warehouses is ERP data. The ERP systems
feeds data warehouses and vice versa. Many organization’s use Enterprise Resource Planning
solutions (ERP) to consolidate business applications onto one platform. An ERP system
provides the ability to automate a whole business operation and retrieve reports for business
strategy. The system also provides a ready-made IT architecture and is therefore very relevant
to big data.
The most important data that makes up a data warehouse is Meta data, which can be
described as data about the data. It provides information about all the components that
contributes to the data, relationships, ownership, source and information about whom can
access the data. Meta data is critical as it gives the data meaning and relevance, without it, a
data warehouse is not of value. Data warehouse data needs to be readable and accurate to be
of useful and in particular in relation to big data analytics as it would defeat the purpose of
the use case if the information provided was questionable (McNurlin, Sprague, 2006)
Users use ETL (extract, transform, and load) tools for data uploading. This uploading process
or the opposite, data extraction can be a tedious process. However the biggest associated
issues around data warehouses are search times due to the query having to search across a
larger data set. Sometimes organizations want to have segmented data warehouses for
example to enable faster search, minimizing data access, and to separate divisions or areas of
interest. In those cases data marts can be used. It is a subset of a data warehouse, stored on a
separate database. The main issue around data marts is to ensure that the Meta data is unified
with the Meta data in the data warehouse so that all the data uses the same definition
otherwise there will be inconsistencies in the information gathered (Hsu, 2013).
As the data volume grows and becomes big data and new tools are introduced to manage the
data, data warehousing remains part of the suite of tools used for processing of big data
analytics.
23
Big data management tools
The process flow of big data analytics is data aggregation, data analysis, data visualization
and then data storage. The current situation is such that there is not a package or single
solution that tends to suffice to fulfill all requirements and therefore organizations often use
solutions from multiple vendors to manage big data. This can be costly, especially is the
decision makers don’t have enough insight about cost saving options.
The key tools needed to manage big data apart from a data warehouse are tools that enable
semi-structured and unstructured data management and that can support huge data volumes
simultaneously. The main technologies that need consideration when it comes to big data in
comparison to traditional DBMS are storage, computing processing capability and analytical
tools.
The most critical element of a big data system is data processing capability. This can be
helped using a distributed system. A distributed system is when multiple computers
communicate through a network which allows division of tasks across multiple computers
which gives superior performance at a lower cos. This is because lower end clustered
computers, can be cheaper than one more powerful computer. Furthermore, distributed
systems allow scalability through additional nodes in contrast to replacement of central
computer which would be necessary for expansion in the scenario where only one computer
is used. This technology is used to enable cloud computing.
24
Big data analytics tools and Hadoop
To enable advanced statistics, big data adopters use a programming languages pbdR and R.
for development of statistical software. The R language is standard amongst statisticians and
amongst developers for development of statistical software. Another program used is
Cassandra which is an open source DBMS that is designed to manage large data sets on a
distributed system. Apache Software foundation is currently managing the project however it
was originally developed by Facebook.
Most important of all big data analytics tools is Hadoop. Yahoo.com originally developed
Hadoop but it is today managed by Apache Software foundation. Hadoop has no proprietary
predecessor and has been developed through contributions in the open-source community.
The software enables simultaneous processing of huge data volumes across multiple
computers by creating sub sets that are distributes across thousands of computer processing
nodes and then aggregates the data into smaller data sets that are easier to manage and use for
analytics.
Hadoop is written in Java and built on four modules. It’s designed to be able to process data
sets across multiple clusters using simple programming models. It can scale up to thousands
of servers, each offering local computation and storage, enabling users to not have to rely
extensively on hardware for high-availability. The software can library itself, detect and
handle failures at application layer which means that there is a backup for clusters
(McKinsey&Company,2011).
Through Hadoop data processing, semi-structured and unstructured data is converted into
structured data that can be read in different format depending on the analytics solution used
(Yan, 2013). Each Hadoop cluster has a special Hadoop file system. A central master node
spreads the data across each machine in a file structure. It uses a hash algorithm to cluster
data with similarity or affinity and all data has a three-fold failover plan to ensure processing
is not disrupted in case the hardware fails (Marakas, O’Brien, 2013)
25
The Hadoop system captures data from different sources, stores it, cleanses it, distributes it,
indexes, transforms it, makes it available for search, analyses it and enables a user to
visualize it. When unstructured and semi-structured data gets transformed into structured
data, it’s easier to consume. Imagine going through millions of online video and images in an
effort to uncover illegal content, such as inappropriately violent content or child pornography
and be able to find the content automatically rather than manually. Hadoop enables ground
breaking ability to make more sense out of content.
Hadoop consist of several services: the Hadoop Distributed File System (HDFS), MapReduce
and HBase. HDFS is used for data storage and interconnects the file systems on numerous of
nodes in a Hadoop cluster to them turn them into one larger file system. MapReduce enables
advanced parallel data processing and was inspired by Google File System and Google
MapReduce system which breaks down the processing and assigns work to various nodes in a
cluster. HBase is Hadoops non-relational database which provides access to the data stored in
HDFS and it’s also used as a transactional platform on which real-time applications can sit.
From a cost perspective, Hadoop is favorable. Its open source and runs on clusters of most
cheap servers and processors can be added and removed if needed. However one area that can
be costly is the tools used for inserting and extracting data to enable analytics within Hadoop
(Laudon, Laudon, 2014)
As mentioned above, a Hadoop license is free of cost and only requires hardware for the
Hadoop clusters. The administrator only needs to install HDFS and MapReduce, transfer data
into a cluster and begin processing the data in the set up analytic environment. The area that
can be problematic is the configuration and implementation of the cluster. This can be costly
if an organization does not have in-house skills.