Given below is a listing of key Big Data terms that you should know and a very brief explanation of what it is in simple language. Hope you find it useful.
1. Hadoop: System for processing very large data sets
2. HDFS or Hadoop Distributed File System: For storage of large volume of data (key elements – Datanodes, Namenode and Tasktracker)
3. MapReduce: Think of it as Assembly level language for distributed computing. Used for computation in Hadoop
4. Pig: Developed by Yahoo. It is a higher level language than MapReduce
5. Hive: Higher level language developed by Facebook with SQL like syntax
6. Apache HBase: For real-time access to Hadoop data
7. Accumulo: Improved HBase with new features like cell level security
8. AVRO: New data serialization format (protocol buffers etc.)
9. Apache ZooKeeper: Distributed co-ordination system
10. HCatalog: For combining meta store of Hive and merging with what Pig does
11. Oozie: Scheduling system developed by Yahoo
12. Flume: Log aggregation system
13. Whirr: For automating hadoop cluster processing
14. Sqoop: For transfering structured data to Hadoop
15. Mahout: Machine learning on top of MapReduce
16: Bigtop: Integrate multiple Hadoop sub-systems into one that works as a whole
17. Crunch: Runs on top of MapReduce, Java API for tedious tasks like joining and data aggregation.
18. Giraph: Used for large scale distributed graph processing
Also, embedded below is an excellent TechTalk by Jakob Homan of LinkedIn on the subject explaining these tech terms.
Posted in Big Data
Tagged Big Data
With the hype surrounding Big Data and current focus on tools and technology such as Hadoop, it is easy to forget that success of any technology project rests more on strategy and less on technology/tools. That’s true even in the case of Big Data solutions.
Architects and managers implementing Big Data solutions would do well to remember that in order to truly leverage and derive insights from Big Data, it is important to have a Master Data Management (MDM) solution in place with a repository of relevant non-transactional data entities (also known as master data).
For example, if an organization wants to leverage social media data for better sales, marketing or customer support, it is important that a master database of all customers and prospects is in place with information on social media profiles/handles for each customer. Master Data Management (MDM) “comprises a set of processes, governance, policies, standards and tools that consistently defines and manages the master data of an organization” (for more, see this).
Trying to implement a Big Data solution without a repository of relevant master data is a recipe for disaster in my opinion. What to you think? Do you agree that MDM is key to Big Data success? Please share your thoughts:
Posted in Big Data
Tagged Big Data
Force Multiplier, a noun, means something that increases effect of the force. In military usage, force multiplication refers to “an attribute or a combination of attributes which make a given force more effective than that same force would be without it” (for more, see this).
Big Data, which is characterized by three Vs, namely Volume, Variety and Velocity can be a major force in running of any large or medium sized businessas as it adds tremendous value by improving quality of decision making. Thanks to Big Data revolution, it is possible to process large volumes of structured and unstructured data in real time and derive insights from large data sets. This by itself is a huge improvement over pre-Big Data era.
What’s even better is that predictive analytics makes it possible not only to analyze the past, but predict the future too with high degree of confidence level. For example, social media data can enrich risk modeling and help auto insurance companies prepare much better risk profile of an individual. Car sensor data can help auto insurance companies better assess risk posed by a driver’s habits (like speeding, fast acceleration or braking) and come up with auto insurance policy tailored to that specific individual with individual level premium (not at a zip code or a city level).
Another good example is assessment of customer life time value (CLV). Using big data, companies can come up with much better assessment of customer life time value. What makes it even better is that predictive modeling can be used on social media or sensor data in arriving at a much better estimation of CLV so that companies can better target customers with high CLV. This has been very effectively used in Travel and Hospitality industry.
Point I want to highlight here is that Big data is a revolution in itself as it enables organizations identify, store, process and analyze data sets from outside the organizations in a way which was not possible thus far. Add predictive analytics to this mix and it pushes Big Data capability to a whole new level – a true force multiplier. Don’t you agree?
Question is how many large and medium sized companies are in a position to take advantage of not only Big Data revolution, but also effectively leverage Predictive Analytics for driving better insights and decision making. Not too many in my opinion. What do you think? Please do share your opinion: