My thinking loud: July 2016

Lets start with basic. Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems). The widely adopted RDBMS has long been regarded as a one-size-fits-all solution, but the demands of handling big data have shown otherwise. Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. You can think of the relationship between big data and data science as being likethe relationship between crude oil and an oil refinery. Data science and big data evolved from statistics and traditional data management but are now considered to be distinct disciplines.

The characteristics of big data are often referred to as the three Vs:
Volume—How much data is there?
Variety—How diverse are different types of data?
Velocity—At what speed is new data generated?
Often these characteristics are complemented with a fourth V, veracity: How accurate is the data? These four properties make big data different from the data found in traditional data management tools. Consequently, the challenges they bring can be felt in almost every aspect: data capture, correction, storage, search, sharing, transfer, and visualization. In addition, big data calls for specialized techniques to extract the insights.

Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics. In a research note from Laney and Kart, Emerging Role of the Data Scientist and the Art of Data Science, the authors sifted through hundreds of job descriptions for data scientist, statistician, and BI (Business Intelligence) analyst to detect the differences between those titles. The main things that set a data scientist apart from a statistician are the ability to work with big data and experience in machine learning, computing, and algorithm building. Their tools tend to differ too, with data scientist job descriptions more frequently mentioning the ability to use Hadoop, Pig, Spark, R, Python, and Java, among others. Don’t worry if you feel intimidated by this list; most of these will be gradually introduced here, though we’ll focus on Python. Python is a great language for data science because it has many data science libraries available, and it’s widely supported by specialized software. For instance, almost every popular NoSQL database has a Python-specific API. Because of these features and the ability to prototype quickly with Python while keeping acceptable performance, its influence is steadily growing in the data science world. As the amount of data continues to grow and the need to leverage it becomes more important, every data scientist will come across big data projects throughout their career.

Data science and big data are used almost everywhere in both commercial and noncommercial settings. The number of use cases is vast, and the examples we’ll provide throughout this book only scratch the surface of the possibilities. Commercial companies in almost every industry use data science and big data to gain insights into their customers, processes, staff, completion, and products. Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings. A good example of this is Google AdSense, which collects data from internet users so relevant commercial messages can be matched to the person browsing the internet. MaxPoint (http://maxpoint.com/us) is another example of real-time personalized advertising. Human resource professionals use people analytics and text mining to screen candidates, monitor the mood of employees, and study informal networks among coworkers. People analytics is the central theme in the book Moneyball: The Art of Winning an Unfair Game. In the book (and movie) we saw that the traditional scouting process for American baseball was random, and replacing it with correlated signals changed everything. Relying on statistics allowed them to hire the right players and pit them against the opponents where they would have the biggest advantage.

Financial institutions use data science to predict stock markets, determine the risk of lending money, and learn how to attract new clients for their services. At the time of writing this book, at least 50% of trades worldwide are performed automatically by machines based on algorithms developed by quants, as data scientists who work on trading algorithms are often called, with the help of big data and data science techniques.

Governmental organizations are also aware of data’s value. Many governmental organizations not only rely on internal data scientists to discover valuable information, but also share their data with the public. You can use this data to gain insights or build data-driven applications. Data.gov is but one example; it’s the home of the US Government’s open data. A data scientist in a governmental organization gets to work on diverse projects such as detecting fraud and other criminal activity or optimizing project funding. A well-known example was provided by Edward Snowden, who leaked internal documents of the American National Security Agency and the British Government Communications Headquarters that show clearly how they used data science and big data to monitor millions of individuals. Those organizations collected 5 billion data records from widespread applications such as Google Maps, Angry Birds, email, and text messages, among many other data sources. Then they applied data science techniques to distill information.

Nongovernmental organizations (NGOs) are also no strangers to using data. They use it to raise money and defend their causes. The World Wildlife Fund (WWF), for instance, employs data scientists to increase the effectiveness of their fundraising efforts. Many data scientists devote part of their time to helping NGOs, because NGOs often lack the resources to collect data and employ data scientists. DataKind is one such data scientist group that devotes its time to the benefit of mankind.
Universities use data science in their research but also to enhance the study experience of their students.

The rise of massive open online courses (MOOC) produces a lot of data, which allows universities to study how this type of learning can complement traditional classes. MOOCs are an invaluable asset if you want to become a data scientist and big data professional, so definitely look at a few of the better-known ones: Coursera, Udacity, and edX. The big data and data science landscape changes quickly and MOOCs allow you to stay up to date by following courses from top universities. If you aren’t acquainted with them yet, take time to do so now; you’ll come to love them as we have.

In data science and big data you’ll come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these:

Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming

The data science process typically consists of six steps, as you can see in the mind map. We will introduce them briefly here and handle them in more detail Later.

Setting the research goal
Collect Data or Retrieving data
Data preparation, Validation
Data exploration Analysis
Data modeling or model building
Presentation and automation

This is AN ITERATIVE PROCESS The previous description of the data science process gives you the impression that you walk through this process in a linear way, but in reality you often have to step back and rework certain findings. For instance, you might find outliers in the data exploration phase that point to data import errors. As part of the data science process you gain incremental insights, which may lead to new questions. To prevent rework, make sure that you scope the business question clearly and thoroughly at the start.

Now in nest blog we will understand "The big data ecosystem and data science". Feel free to share your suggestion at RavindraPande@gmail.com.

My thinking loud

Saturday, July 2, 2016

Big Data to Machine Learning