Subscribe to our Newsletter

All Posts (204)

From time to time I keep pondering on what could be the future and I am sure lot of us get this science fiction imagery where the future data analyst will be given just a pair of holographic gloves and perform three dimensional analysis. Let us stop day dreaming and get to the basics. Dashboards have come a long way. These days lot of vendors are catering towards consuming big data etc. but I remember the days when the common target of BI vendors was "Excel". Looks like the focus has totally shifted from "Excel as the enemy" to "Big Data as the elephant".
Read more…

Originally posted on Data Science Central

This infographic on Shopper Marketing was created by Steve Hashman and his team. Steve is Director at Exponential Solutions (The CUBE) Marketing. 

Shopper marketing focuses on the customer in and at the point of purchase. It is an integrated and strategic approach to a customer’s in-store experience which is seen as a driver of both sales and brand equity.

For more information, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Why Your Brain Needs Data Visualization

Why Your Brain Needs Data Visualization

This is a well-known fact nowadays: a goldfish has higher attention span than an average Internet user. That’s the reason why you’re not interested to read huge paragraphs of text. Research by Nielsen Norman Group showed that Internet users have time to read at most 28% of the words on a web page. Most of them read only 20%. Visual content, on the other hand, has power to hold your attention longer.

If you were just relying on the Internet as a casual user, not reading all the text wouldn’t be a problem. However, when you have a responsibility to process information, things get more complicated. A student, for example, has to read several academic and scientific studies and process a huge volume of data to write a single research paper. 65% of people are visual learners, so they find text difficult to process. The pressuring deadline will eventually lead the student to hiring the best coursework writing service. If they present the data visually, however, they will need less time to process it and gettheir own ideas for the paper.   

Let’s explore some reasons why your brain needs that kind of visualization.

1.     Visual Data Triggers Preattentive Processing

Our low-level visual system needs less than 200-250 milliseconds to accurately detect visual properties. That capacity of the brain is called pre-attentive processing. It is triggered by colors, patterns, and forms. When you use different colors to create data visualization, you emphasize the important details, so those are the elements your eye will first catch. You will use your long-term memory to interpret that data and connect it with information you already know. 

2.     You Need a Visual Tier to Process Large Volumes of Data

When you’re dealing with production or sales, you face a huge volume of data you need to process, compare, and evaluate. If you represented it through a traditional Excel spreadsheet, you would have to invest countless hours looking through the tiny rows of data. Through data visualization, you can interpret the information in a way that makes it ready for your brain to process.

3.     Visual Data Brings Together All Aspects of Memory

The memory functions of our brain are quite complex. We have three aspects of memory : sensory, short term (also known as working memory) and long term. When we first hear, see, touch, taste, or smell something, our senses trigger the sensory memory. While processing information, we preserve it in the working memory for a short period of time. The long-term memory function enables us to preserve information for a very long time.

Visual data presentation connects these three memory functions. When we see the information presented in a visually-attractive way, it triggers our sensory memory and makes it easy for us to process it (working memory). When we process that data, we essentially create a new “long-term memory folder” in our brain.

Data visualization is everywhere. Internet marketing experts understood it, and the most powerful organizations on a global level understood it, too. It’s about time we started implementing it in our own practice.

Read more…

Originally posted on Data Science Central

Do you want to learn the history of data visualization? Or do you want to learn how to create more engaging visualizations and see some examples? It’s easy to feel overwhelmed with the amount of information available today, which is why sometimes the answer can be as simple as picking up a good book.

These seven amazing data visualization books are a great place for you to get started:

1) Show Me the Numbers: Designing Tables and Graphs to Enlighten, Second Edition

Stephen Few

2) The Accidental Analyst: Show Your Data Who’s Boss

Eileen and Stephen McDaniel

3) Information Graphics

Sandra Rendgen, Julius Wiedemann

4) Visualize This: The FlowingData Guide to Design, Visualization, and Statistics

Nathan Yau

5) Storytelling with Data

Cole Nussbaumer Knaflic

6) Cool Infographics

Randy Krum

7) Designing Data Visualizations: Representing Informational Relationships

Noah Iliinsky, Julie Steele

To check out the 7 data visualization books, click here. For other articles about data visualization, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

10 Dataviz Tools To Enhance Data Science

Originally posted on Data Science Central

This article on data visualization tools was written by Jessica Davis. She's passionate about the practical use of business intelligence, predictive analytics, and big data for smarter business and a better world.

Data visualizations can help business users understand analytics insights and actually see the reasons why certain recommendations make the most sense. Traditional business intelligence and analytics vendors, as well as newer market entrants, are offering data visualization technologies and platforms.

Here's a collection of 10 data visualization tools worthy of your consideration:

Tableau Software

Tableau Software is perhaps the best known platform for data visualization across a wide array of users. Some Coursera courses dedicated to data visualization use Tableau as the underlying platform. The Seattle-based company describes its mission this way: "We help people see and understand their data."

This company, founded in 2003, offers a family of interactive data visualization products focused on business intelligence. The software is offered in desktop, server, and cloud versions. There's also a free public version used by bloggers, journalists, quantified-self hobbyists, sports fans, political junkies, and others.

Tableau was one of three companies featured in the Leaders square of the 2016 Gartner Magic Quadrant for Business Intelligence and Analytics Platforms.

Qlik

Qlik was founded in Lund, Sweden in 1993. It's another of the Leaders in Gartner's 2016 Magic Quadrant for Business Intelligence and Analytics Platforms. Now based in Radnor, Penn., Qlik offers a family of products that provide data visualization to users. Its new flagship Qlik Sense offers self-service visualization and discovery. The product is designed for drag-and-drop creation of interactive data visualizations. It's available in versions for desktop, server, and cloud.

Oracle Visual Analyzer

Gartner dropped Oracle from its 2016 Magic Quadrant Business Intelligence and Analytics Platform report. One of the company's newer products, Oracle Visual Analyzer, could help the database giant make it back into the report in years to come.

Oracle Visual Analyzer, introduced in 2015, is a web-based tool provided within the Oracle Business Intelligence Cloud Service. It's available to existing customers of Oracle's Business Intelligence Cloud. The company's promotional materials promise advanced analysis and interactive visualizations. Configurable dashboards are also available.

SAS Visual Analytics

SAS is one of the traditional vendors in the advanced analytics space, with a long history of offering analytical insights to businesses. SAS Visual Analytics is among its many offerings.

The company offers a series of sample reports showing how visual analytics can be applied to questions and problems in a range of industries. Examples include healthcare claims, casino performance, digital advertising, environmental reporting, and the economics of Ebola outbreaks.

Microsoft Power BI

Microsoft Power BI, the software giant's entry in the data visualization space, is the third and final company in the Leaders square of the Gartner 2016 Magic Quadrant for Business Intelligence and Analytics Platforms.

Power BI is not a monolithic piece of software. Rather, it's a suite of business analytics tools Microsoft designed to enable business users to analyze data and share insights. Components include Power BI dashboards, which offer customizable views for business users for all their important metrics in real-time. These dashboards can be accessed from any device.

Power BI Desktop is a data-mashup and report-authoring tool that can combine data from several sources and then enable visualization of that data. Power BI gateways let organizations connect SQL Server databases and other data sources to dashboards.

TIBCO Spotfire

TIBCO acquired data discovery specialist Spotfire in 2007. The company offers the technology as part of its lineup of data visualization and analytics tools. TIBCO updated Spotfire in March 2016 to improve core visualizations. The updates expand built-in data access and data preparation functions, and improve data collaboration and mashup capabilities. The company also redesigned its Spotfire server topology with simplified web-based admin tools.

ClearStory Data

Founded in 2011, ClearStory Data is one of the newer players in the space. Its technology lets users discover and analyze data from corporate, web, and premium data sources. It includes relational databases, Hadoop, web, and social application interfaces, as well as ones from third-party data providers. The company offers a set of solutions for vertical industries. Its customers include Del Monte, Merck, and Coca-Cola.

Sisense

The web-enabled platform from Sisense offers interactive dashboards that let users join and analyze big and multiple datasets and share insights. Gartner named the company a Niche Player in its Magic Quadrant report for Business Intelligence and Analytics Platforms. The research firm said the company was one of the top two in terms of accessing large volumes of data from Hadoop and NoSQL data sources. Customers include eBay, Lockheed Martin, Motorola, Experian, and Fujitsu.

Dundas BI

Mentioned as a vendor to watch by Gartner, but not included in the company's Magic Quadrant for Business Intelligence and Analytics Platforms, Dundas BI enables organizations to create business intelligence dashboards for the visualization of key business metrics. The platform also enables data discovery and exploration with drag-and-drop menus. According to the company's website, a variety of data sources can be connected, including relational, OLAP, flat files, big data, and web services. Customers include AAA, Bank of America, and Kaiser Permanente.

InetSoft

Inet Software is another vendor that didn't qualify for the Gartner report, but was mentioned by the research firm as a company to watch.

InetSoft offers a colorful gallery of BI Visualizations. A free version of its software provides licenses for two users. It lets organizations take the software for a test drive. Serious users will want to upgrade to the paid version. Customers include Flight Data Services, eScholar, ArcSight, and Dairy.com.

You can find the original article, here. For other articles about data visualization, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

The Top 5 Benefits of Using Data Visualization

Whether you're working on a school presentation or preparing a monthly sales report for your boss, presenting your data in a detailed and easy-to-follow form is essential. It's hard to keep the focus of your audience if you can't help them fully understand the data you're trying to explain. The best way to understand complex data is to show your results in a graphic form. This is the main reason why data visualization has become a key part of all presentations and data analysis. But let's see what are the top 5 benefits of using data visualization in your work.

Easier data discovery

Visualization of your data helps you and your audience to find specific information. Pointing out an information strictly as one-dimensional graphics can be difficult if you have a lot of data to work with. Data visualization can make this effort a whole lot easier.

Simple way to trace data correlations

Sometimes it's hard to notice the correlation between two sets of data. If you present your data in graphic form, you can notice how one set of data influences another. This is a major benefit as it reduces a great amount of work effort you need to invest.

Live interaction with data

Data visualization offers you the benefit of live interaction with any piece of data you need. This enables you to spot the change in data as it happens. And you don't get just simple information regarding the change, you also get a predictive analysis.

Promote a new business language

One of the major benefits of data visualization over simple graphic solutions is the ability to "tell a story" through data. Per example, with a simple graphic chart, you get an information and that's it. Data visualization enables you to not only see the information but also to know the reasons behind it.

Identify trends

Ability to identify trends is one of the most interesting benefits that data visualization tools have to offer. You can watch the progress of certain data and see the reasons for those changes. With predictive analysis, you can also predict the behavior of those trends in the future.

Conclusion

Data visualization tools have become a necessity in modern data analysis. This need grew start of many businesses that offer data visualization services. 

All in all, data visualization tools have shifted the analytics to a whole new level and allowed a better insight into business data. Let us know about your experience with data visualization tools and how you use them, we'd love to read how it improved your work. 

Read more…

3D Data Visualisation Survey

Back in 2012 we released Datascape, a general purpose 3D immersive data visualisation applications. We are now getting ready to release our 2nd generation application – Datascape2XL, which allows you to plot and interact with over 15 million data points in a 3D space, and view them with either a conventional PC screen or an Oculus Rift virtual reality headset (if you must...).

In order to inform our work we have created a survey to examine the current "state of the market" in terms of what applications people are using for data visualisation, how well they are meeting needs, and what users want of a 3D visual analytics application. The survey builds on an earlier survey we did in 2012 and the results of which are still available on our web site.

We will again be producing a survey report for public consumption, which you can sign up to receive at the end and we'll also post up here.

The aim of this survey is to understand current use of, and views on, data visualisation and visual analytics tools. We recognise that this definition can include a wide variety of different application types from simple Excel charts and communications orientated infographics to specialist financial, social media and even intelligence focussed applications. We hope that we have pitched this initial survey at the right level to get feedback from users from across the spectrum.

I hope that you can find 5 minutes to complete the survey - which you can find here: https://www.surveymonkey.co.uk/r/XXMXPP2

Thanks

David

 

Read more…

Originally posted on Data Science Central

This article on going deeper into regression analysis with assumptions, plots & solutions, was posted by Manish Saraswat. Manish who works in marketing and Data Science at Analytics Vidhya believes that education can change this world. R, Data Science and Machine Learning keep him busy.

Regression analysis marks the first step in predictive modeling. No doubt, it’s fairly easy to implement. Neither it’s syntax nor its parameters create any kind of confusion. But, merely running just one line of code, doesn’t solve the purpose. Neither just looking at R² or MSE values. Regression tells much more than that!

In R, regression analysis return 4 plots using plot(model_name) function. Each of the plot provides significant information or rather an interesting story about the data. Sadly, many of the beginners either fail to decipher the information or don’t care about what these plots say. Once you understand these plots, you’d be able to bring significant improvement in your regression model.

For model improvement, you also need to understand regression assumptions and ways to fix them when they get violated.

In this article, I’ve explained the important regression assumptions and plots (with fixes and solutions) to help you understand the regression concept in further detail. As said above, with this knowledge you can bring drastic improvements in your models.

What you can find in this article :

Assumptions in Regression

What if these assumptions get violated ?

  1. Linear and Additive
  2. Autocorrelation
  3. Multicollinearity
  4. Heteroskedasticity
  5. Normal Distribution of error terms

Interpretation of Regression Plots

  1. Residual vs Fitted Values
  2. Normal Q-Q Plot
  3. Scale Location Plot
  4. Residuals vs Leverage Plot

You can find the full article here. For other articles about regression analysis, click here. 

Note from the Editor: For a robust regression that will work even if all these model assumptions are violated, click here. It is simple (it can be implemented in Excel and it is model-free), efficient and very comparable to the standard regression (when the model assumptions are not violated).  And if you need confidence intervals for the predicted values, you can use the simple model-free confidence intervals (CI) described here. These CIs are equivalent to those being taught in statistical courses, but you don't need to know stats to understand how they work, and to use them. Finally, to measure goodness-of-fit, instead of R-Squared or MSE, you can use this metric, which is more robust against outliers. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

BI Tools for SMEs? Not Just Maybe, But DEFINITELY

I am working as BI consultant and aim to provide best BI Solutions to my clients. Focusing on BI for Tally and upgrading Tally customers to self-servicing BI environment with interactive reports and Dashboard for Tally. Apart from this I like traveling, participating in Business Intelligence forums, reading and social networking.
Read more…

KPIs and Business Intelligence: What and Why to Measure

‘Key Performance Indicators’ or KPIs as we say, are very important to the enterprise and nearly every company is talking about them, these days. But, there are still a lot of businesses that don’t know how to define the right KPIs to get a good picture of success.

To really understand where you are succeeding and where you are falling short, you have to measure the right things. For example, if your goal is to increase sales in the Minneapolis store by 5% in the year 2015, you couldn’t determine success by establishing a KPI to measure the number of shopping bags you have on hand in the store. Do we care about the number of sales people on staff at a certain time of day, and whether that affects our sales? Do we want to look at the store hours for a particular day of the week to determine whether extended hours in a certain season or on a certain day may result in more sales? Should we look at the impact of sales rep training on closed sales?

In like manner, if you want to establish metrics to evaluate the effectiveness of your internet marketing program you’ll probably have to look at your program from various perspectives. That is true of nearly every initiative in your company and that is where many businesses go awry. They assume that they can establish one metric for each goal when, in fact your business is more complex than that and your goals usually have more than one factor or aspect that will determine success.

Let’s consider the KPIs for an internet marketing program. We can’t just say that we want to increase sales. We have to decide how we will determine success. Will we include site visits, visits per page, the click to conversion ratio, the number of email and newsletter ‘unsubscribe’ requests, the click through rates for visitors coming to the website from a social media site, etc. These factors might tell us which internet marketing techniques are driving traffic to our site, but do they tell us whether this traffic is coming from our target audience, or what percentage of the traffic from each source is actually resulting in a purchase? Do they measure the time of day, the day of the week, or the season in which these sales conversions are most likely?

Of course, every business, industry, location and team is different and you have to look carefully at your own business to determine what is relevant. The most important thing to ask yourself when you establish KPIs is, ‘how does this measurement correlate to our success?’ If I measure this particular thing, does the resulting number or data point give me any insight into how well we are doing, how much money we are making, and whether this task, activity or goal is actually having an effect on the overall performance of the business?

There is one final point to consider when establishing  Key Performance Indicators (KPIs) and an integrated business intelligence approach to decision-making. Enterprise culture and communication is important. There are industry standard, and business function-specific business intelligence tools with KPI modules, but these solutions still have to be tailored to the individual organization, and to their targets, and the minimums and maximums to be defined and then gradually moved to the teams for adoption. In order to get a true picture of KPIs and business intelligence, the enterprise must integrate data from disparate data sources and systems and that takes careful planning and implementation.

Throughout this process, the business must be committed to building a performance driven culture, and to streamlining and improving communication, and, in all likelihood, the process of getting to the desired state will be an iterative process. It may seem like the enterprise is taking the long way around. But, the business team must focus on building for the long-term, and to achieving solid results and a culture that supports clear, concise, objective decision-making and full commitment to business success at every level.

If a business is committed to performance-driven management, it must link its goals to its processes and create key performance indicators that objectively measure performance and keep the company on track. Whether your goal is to create a successful eCommerce site, increase customer satisfaction by 15% or reduce expenses, you must have a good understanding of what you mean by the word ‘success’.

Read more…

Successful sales force management is dependent on up-to-date, accurate information. With appropriate, easy access to business intelligence, a Sales Director and Sales Managers can monitor goals and objectives. But, that’s not all a business intelligence tool can do for a sales team. In today’s competitive market, marketing, advertising and sales teams cannot afford to wait to be outstripped by the competition. They must begin to court and engage a customer before the customer has the need for an item. By building brand awareness and improving product and service visibility, the sales team can work seamlessly throughout the marketing and sales team channel to educate, and enlighten prospects and then carry them through the process to close the deal. To do that, the sales staff must have a comprehensive understanding of buying behaviors, current issues with existing products, pricing points and the impact of changing prices, products or distribution channels. With access to data integrated from CRM, ERP, warehousing, supply chain management, and other functions and data sources, a sales manager and sales team can create personalized business intelligence dashboards to guide them through the process and to help them analyze and understand trends and patterns before the competition strikes.

The enterprise must monitor sales results at the international, national, regional, local, team and individual sales professional. As a sales manager, you should be able to manage incentives and set targets with complete confidence, and provide accurate sales forecasts and predictions to ensure that the enterprise consistently meets its goals and can depend on the predicted revenue and profits for investment, new product development, market expansion and resource acquisition.

Business Intelligence for the sales function must include Key Performance Indicators (KPI) to help the team manage each role and be accountable for objectives and goals. If a sales region fails to meet the established plan, the business can quickly ascertain the root cause of the issue, whether it is product dissatisfaction, poor sales performance, or any one of a number of other sources.

Since the demand generated by the  sales force management directly affects the production cycle and plan, the sales team must monitor sales targets and objectives with product capacity and production to ensure that they can satisfy the customer without shortfalls or back orders. If some customers are behind on product payments, a business must be able to identify the source of the issue and address that issue before it results in decreased revenue and results.

The ten benefits listed below comprise a set of ‘must haves’ for every sales team considering a business intelligence solution:

  1. Set targets and allocate resources based on authentic data, rather than speculation
  2. Establish, monitor and adapt accurate forecasts and budgets based on up-to-date, verified data and objective KPIs
  3. Analyze current data, and possible cross-sell and up-sell revenue paths and the estimated lifetime value of a customer
  4. Analyze the elements of sales efforts (prospecting, up-selling, discounts, channel partners, sales collaterals, presentations) and adapt processes that do not provide a competitive edge and strong customer relationships and client loyalty
  5. Measure the factors affecting sales effectiveness to improve sales productivity and correct strategies that do not work
  6. Achieve a consistent view of sales force performance, with a clear picture of unexpected variations in sales and immediate corrective action and strategic adjustment based on trends and patterns
  7. Understand product profitability and customer behavior, by spotlighting customers and products with the highest contribution to the bottom line
  8. Revise expense and resource allocation using the net value of each customer segment or product group
  9. Identify the most effective sales tactics and mechanisms, and the best resources and tools, to meet organizational sales objectives
  10. Establish a personalized, automated alert system to identify and monitor upcoming opportunities and threats

When the enterprise provides a single source, integrated view of enterprise data from numerous sources and enables every user to build views, dashboards and KPIs, every member of the sales team is engaged in the pursuit of strategic, operational and tactical goals. In this way, the enterprise can acquire new clients, retain existing clients, and sell new products and services without a misstep.

Read more…

Taxonomy of 3D DataViz

Been trying to pull together a taxonomy of 3D data viz. Biggest difference is I think between allocentric (data moves) and egocentric (you move) viewpoints. The difference between whether you then view/explore the egocentric 3D visualisation on a 2D screen or in a 3D headset is I think a lesser distinction (and actually an HMD is possible less practical in most cases).

We have a related benefits escalator for 2D->3D Dataviz, but again I'm not convinced that "VR" should represent another level on this - its more of an orthogonal element - another way to view the upper tiers.

Care to discuss or extend/expand/improve?

Read more…

Guest blog post by SupStat

Contributed by Sharan Duggal.  You can find the original article here.

Introduction


We know that war and civil unrest account for a significant proportion of deaths every year, but how much can mortality rates be attributed to a simple lack of basic resources and amenities, and what relationship do mortality rates have with such factors? That’s what I set out to uncover using WorldBank data that covers the globe for up to the last 50 odd years, and I found a strong relationship with some of the available data.

If you were to look at overall mortality rates, the numbers would be muddied by several factors, including the aforementioned causes of death, so I decided to look at two related, but more specific outcome variables – infant mortality as well as risk of maternal death.

Infant mortality is defined as the number of infants dying before reaching one year of age, per 1,000 live births in a given year.

Lifetime risk of maternal death is the probability that a 15-year-old female will die eventually from a maternal cause assuming that current levels of fertility and mortality (including maternal mortality) do not change in the future, taking into account competing causes of death.

While I am sure these numbers can also be impacted by things like civil unrest, it does focus on individuals who are arguably more subject to be impacted by things like communicable diseases and lack of basic provisions like clean water, electricity or adequate medical resources, among others.

So, what do overall mortality rates even look like?

The density plot below includes the overall infant mortality distribution along with some metrics indicating the availability of key resources. Infant mortality rates peak at around 1% and the availability of resources peak closer to 100%. In both cases we see really long tails, indicating that there is a portion of the population experiencing less than ideal numbers.

So to drill down further, let’s have a closer look at the distribution of both outcome variables by year. The boxplots below suggest that both Infant mortality rates as well as risk of maternal death have shown not only steady overall improvements over the years but also a reduction in the disparity of cases across country-specific observations. But the upper end of these distributions still represent shocking numbers for some countries with: over 10% of infants dying every year (down from a high of 24% in 1961) and a 7.5% probability that a 15 year old girl living today will eventually die of a maternal cause (down from over 15% twenty-five years ago).

Please note: points have been marginally jittered above for clearer visual representation

Mortality Rates across the Globe


The below map plots the 2012 distribution of infant mortality rates by country. I chose 2012 because most of the covariates I would eventually like to use contain the best information from this year, with a couple of exceptions. It also presents a relatively recent picture of the variables of interest.

As can be seen, the world is distinctly divided, with many African, and some South Asian, countries bearing a bigger burden of infant mortality. And if it wasn’t noticeable on the previous boxplot, the range of values, as shown in the scale below is particularly telling of the overall disparity of mortality rates, pointing to a severe imbalance across the world.

The map representing the risk of maternal death is almost identical, and as such has been represented in a different color for differentiation. Here, the values range from close to 0% to over 7%.

Bottom Ranked Countries Over the Years


After factoring in all 50+ years of data for infant mortality and 26 years of data for risk of maternal death, and then ranking countries, the same set of countries feature at the bottom of the list.

The below chart looks at the number of times a country has had one of the worst three infant mortality rates in any given year since 1960.

The chart for maternal data goes from 1990 through to 2015. It’s important to note that Chad and Sierra Leone were ranked in the bottom 3 for maternal risk of death in every year since 1990.

Please note that numbers may be slightly impacted by missing data for some countries, especially for earlier years in the data set.

Relationship between Mortality & Resources


Getting back to the original question, are there any low hanging fruit and easy fixes for such a dichotomous situation? While my efforts during this analysis did not include any regressions, I did want to get an initial understanding of whether the availability of basic resources had a strong association with mortality rates, and if such a relationship existed, which provisions were more strongly linked with these outcomes? The findings could serve as a platform to do further research.

The below correlation analysis helped home in on some of the stronger linkages and helped weed out some of the weaker ones.

Note, the correlation analysis was run using 2012 data for all metrics, except for “Nurses and Midwives (per 1000 people)” and “Hospital beds (per 1000 people)” for which 2010 and 2009 data was used respectively, due to poorer availability of 2012 data for these measures.

 

Focusing on the first two columns of the above correlation plot, which represent risk of maternal death and infant mortality, we see a very similar pattern across the variables included in the analysis. Besides basic resources, I had also included items like availability of renewable freshwater resources and land area, to see if naturally available resources had any linkages to the outcomes in question. They didn’t and so they were removed from the analysis. In the plot above, it can also be seen that average rainfall and population density dont have much of a relationship with the mortality rates in question. What was also surprising was that access to anti-retroviral therapy too had a weak correlation with mortality rates in general.

The metrics that had the strongest relationship (in the 0.75 to 0.85 range) were:

  • Percent of population with electricity
  • Percent of population with access to non-solid fuel
  • Percent of population with access to improved sanitation facilities, and
  • Percent of population with access to improved water sources


The first two require no definitional explanation, but access to improved sanitation facilities ensure the hygienic separation of human excreta from human contact. Access to improved water sources refers to the percentage of the population using an improved drinking water source including piped water on premises and other improved drinking water sources (public taps or standpipes, tube wells or boreholes, protected dug wells, protected springs, and rainwater collection).

Analyzing the strongly correlating factors by Region


The following 4 charts look at regional performance of the key identified metrics. The pattern follows the same as that seen on the static world map from 2012, but this also gives us a view into how things have been trending on the resources that seem to be strongly linked with infant and maternal mortality over the past 25 years. We see a fairly shallow slope for Sub-saharan Africa on access to non-solid fuel as well as on improved sanitation facilities. Improvements in drinking water access have been much better.

South Asian countries ranked lowest on the provision of sanitation facilities in the early ’90s, but have made improvements since.

Conclusion


My analysis found a very strong relationship between mortality rates and basic provisions. It also weeded out some factors which were less important. As a next step, it may be helpful to do a deeper country-specific analysis for African and South Asian nations that suffer from a chronic lack of basic infrastructure, to see where investments would be most fruitful in bringing these countries to a closer state of parity with the developed world.

Read more…

Who are you, Data Scientist? Answer with a survey

Originally posted on Data Science Central

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding”

Hal Varian, Chief Economist at Google and emeritus professor at the University of California, better known as Berkeley, said on the 5th of August 2009.

Today, what Hal Varian said almost seven years ago has been confirmed, as is highlighted in the following graph taken from Google Trends, which gives a good idea of the current attention to figure of the Data Scientist.

The Observatory for Big Data Analytics & BI of Politecnico di Milano has been working on the theme of Data Scientists for a few years, and has now prepared a survey to be submitted to Data Scientists that will be used to create a picture of the Data Scientist, within their company and the context in which they operate.

If you work with data in your company, please support us in our research and take this totally anonymous survey here. Thank-you from the Observatory for Big Data Analytics & BI.

 

Graph 1: How many times the term "Data Scientist" has been searched on Google. The numbers in the graph represent the searched term in relation to the highest point in the graph. The value of 100 is given to the point with the maximum number of searches, the others values are proportional.

Mike Loukides, VP of O’Reilly Media, summarized the Data Scientist’s job description in these words:

"Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others."

We are in the era of Big Data, in an era where 2,5 quintillion (10^18) of bytes are generated every day. Both the private and public sector everywhere are adapting so that they can exploit the potential of Big Data by introducing into their organizations people who are able to extract information from data.

Getting information out of data is of increasing importance because of huge amount of data available. As Daniel Keys Moran, programmer and science fiction writer, said:“You can have data without information, but you cannot have information without data”.

In companies today, we are seeing positions like the CDO (Chief Data Officer) andData Scientists more often than we were used to.

The CDO is a business leader, typically a member of the organization’s executive management team, who defines and executes analytics strategies. This is the person actually responsible for defining and developing the strategies that will direct the company’s processes of data acquisition, data management, data analysis and data governance. This means that new governance roles and new professional figures have been introduced in many organizations to exploit what Big Data offer them in terms of opportunities.

According to the report on “Big Success with Big Data” (Accenture, 2014), 89% of companies believe that, without a big data analytics strategy, in 2015 they risk losing market share and will no longer be competitive.

Collecting data is not simply retrieving information: the Data Scientists’ role is to translate data into information, and currently there is a dearth of people with this set of skills.

It may seem controversial, but both companies and Data Scientists know very little about what skills are needed. They are operating in a turbulent environment where frequent monitoring is needed to know who actually uses which tools, which tools are considered old and becoming obsolete, and which are those used by the highest and lowest earners. According to a study by RJMetrics (2015), the Top 20 Skills of a Data Scientist are those contained in the following graph. 

The graph clearly shows the importance of tools and programming languages such as Rand PythonMachine LearningData Mining and Statistics are also high up in the set of most requested skills. Those relating to Big Data are at about the 15th place.

The most recent research on Data Scientists showed that these professionals are more likely to be found in companies belonging to the ICT sectorinternet companies andsoftware vendors, such as Microsoft and IBM, rather than in social networks(Facebook, LinkedIn, Twitter) AirbnbNetflix etc. The following graph, provided – like the previous one - by RJMetrics, gives the proportion of Data Scientists by industry.

It is important to keep monitoring Data Scientists throughout industrial sectors, their diffusion and their main features, because, in the unsettled business world of today, we can certainly expect a great many changes to take place while companies become aware, at different times and in different ways, of the importance of Data Scientists

Read more…

Guest blog post by Kevin Smith

I teach AP Statistics in China at an International school and I believe it's important to not only show my students how to do plots and inferential statistics on their TI Nspire calculators, but also in R using ggplot, dplyr, and R Markdown.

We are starting the third unit in AP Statistics and we will be learning about scatter plots and regression. I will teach them how to do this in R and use R Markdown to export to Word.

I have already gone over some of the basics of opening RStudio and entering some data and saving to their home directory. We have R and RStudio on all forty of our school computers. They are also required to install R and RStudio on their home computer. I’ll keep the online Microsoft Data Scientist Workbench as a backup.

Here are some ggplot basics that I’ll start with.

I’ll use examples from our AP stats book and the IB book. We are using The Practice of Statistics 4th edition  by Starnes, Yates and Moore (TPS4e) for AP Statistics class. I want to recreate some of the plots in the textbook so I can teach my students how they can create these same plots. We can probably improve in some way on these plots and at the same time, teach them the basics of regression and R programming.

Here is my general plan:

  • Enter the data into the TI nspire cx.
  • Generate a scatter plot on the TI.
  • Use the Smartboard to show the code in R using RStudio.
  • On the first day use an R Script for the R code.
  • All following days, use R Markdown to create and annotate the scatter plots.
  • Publish to our Moodle page or maybe saturnscience website.

Making a scatter plot

Now let’s make a scatter plot with the example in the TPS4e book Chapter 3, page 145.

The general form of a ggplot command will look like this:

myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis)) + geom()

Here is the data from page 145 in the TPS 4e textbook and how we enter it in. We use the “c” command to combine or concatenate into a vector. We then turn these two vectors into a data frame.

body.wt=c(120,187,109,103,131,165,158,116)  
backpack.wt=c(26,30,26,24,29,35,31,28)
TPS145= data.frame(body.wt,backpack.wt) TPS145

Now we put this data frame into the ggplot object and name it scatter145 and call the ggplot2 package.
library(ggplot2) scatter145=ggplot(data=TPS145, aes(body.wt,backpack.wt)) +      
geom_point()

Here is the scatter plot below produced from the above code:

This is a starting point and we can add to this plot to really spruce it up.

I added some blue color to the plot based on the body weight.

scatter145=ggplot(data=TPS145, aes(body.wt,backpack.wt,colour=body.wt)) + 
geom_point()

scatter145




Adding Labels And Adjusting The Size Of The Data Point

To add the x, y and main labels, I add on to my plot with the xlab, ylab, and main arguments inside ggplot’s scatter plot. I also increased the size of the plotted data to make it easier to see.

scatter145 = scatter145+ geom_point(size=2) +     
xlab("Body Weight (lb)") +
ylab("Pack weight (lb)") +
ggtitle("Backpack Weight")

scatter145

How To Add The Regression Line.

I will keep adding to the plot by plotting the regression line. The function for adding a liner model is “lm”. The gray shaded area is the 95% confidence level interval.

Here is the final code for creating the scatter plot with the regression line.

  scatter145=scatter145+ geom_point(size=3) +    
xlab("Body Weight (lb)") +
ylab("Pack weight (lb)")+
ggtitle("Backpack Weight")+
geom_smooth(method = "lm")

Here is the scatter plot with the regression line.



My motivation for working in R Markdown is that I want to teach my students that R Markdown is an excellent way to integrate their R code, writing, plots and output. This is the way of the near future in Introductory Statistics. I also want to model how reproducible research should be done.

Two research papers I read recently support this view.

Some Recent Research On Reproducible Research And Intro Statistics

The authors Deborah Nolan and Jamis Perrett in their paper Teaching and Learning Data Visualization: Ideas and Assignments paper here argue that statistical graphics should have a more prominent role in an introductory statistics course.

This article discusses how to make statistical graphics a more prominent element of the undergraduate statistics curricula. The focus is on several different types of assignments that exemplify how to incorporate graphics into a course in a pedagogically meaningful way. These assignments include having students deconstruct and reconstruct plots, copy masterful graphs, create one-minute visual revelations, convert tables into `pictures’, and develop interactive visualizations with, e.g., the virtual earth as a plotting canvas.

Another paper R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics by Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray,Linda Loi and Nicholas J. Horton argue that teaching students R Markdown helps them to grasp the concept of reproducible research.

R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation.

 
 
Read more…

R for SQListas (1): Welcome to the Tidyverse

Guest blog post by Sigrid Keydana

R for SQListas, what's that about?

This is the 2-part blog version of a talk I've given at DOAG Conference this week. I've also uploaded the slides (no ppt; just pretty R presentation ;-) ) to the articles section, but if you'd like a little text I'm encouraging you to read on. That is, if you're in the target group for this post/talk.
For this post, let me assume you're a SQL girl (or guy). With SQL you're comfortable (an expert, probably), you know how to get and manipulate your data, no nesting of subselects has you scared ;-). And now there's this R language people are talking about, and it can do so many things they say, so you'd like to make use of it too - so now does this mean you have to start from scratch and learn - not only a new language, but a whole new paradigm? Turns out ... ok. So that's the context for this post.

Let’s talk about the weather

So in this post, I'd like to show you how nice R is to use if you come from SQL. But this isn't going to be a syntax-only post. We'll be looking at real datasets and trying to answer a real question.
Personally I’m very interested in how the weather's going to develop in the future, especially in the nearer future, and especially regarding the area where I live (I know. It’s egocentric.). Specifically, what worries me are warm winters, and I'll be clutching to any straw that tells me it's not going to get warmer still ;-)
So I’ve downloaded / prepared two datasets, both climate / weather-related. The first is the average global temperatures dataset from the Berkeley Earth Surface Temperature Study, nicely packaged by Kaggle (a website for data science competitions; https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data). This contains measurements from 1743 on, up till 2013. The monthly averages have been obtained using sophisticated scientific procedures available on the Berkeley Earth website (http://berkeleyearth.org/).
The second is daily weather data for Munich, obtained from www.wunderground.com. This dataset was retrieved manually, and the period was chosen so as to not contain too many missing values. The measurements range from 1997 to 2015, and have been aggregated by taking a monthly average.
Let’s start our journey through R land, reading in and looking at the beginning of the first dataset:

library(tidyverse)
library(lubridate)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)

## 1 1743-11-01 6.068 1.737 Århus
## 2 1743-12-01 NA NA Århus
## 3 1744-01-01 NA NA Århus
## 4 1744-02-01 NA NA Århus
## 5 1744-03-01 NA NA Århus
## 6 1744-04-01 5.788 3.624 Århus
## # ... with 3 more variables: Country , Latitude ,
## # Longitude

Now we’d like to explore the dataset. With SQL, this is easy: We use WHERE to filter rows, SELECT to select columns, GROUP BY to aggregate by one or more variables...And of course, we often need to JOIN tables, and sometimes, perform set operations. Then there’s all kinds of analytic functions, such as LAG() and LEAD(). How do we do all this in R?

Entering the tidyverse

Luckily for the SQLista, writing elegant, functional, and often rather SQL-like code in R is easy. All we need to do is ... enter the tidyverse. Actually, we’ve already entered it – doing library(tidyverse) – and used it to read in our csv file (read_csv)!
The tidyverse is a set of packages, developed by Hadley Wickham, Chief Scientist at Rstudio, designed to make working with R easier and more consistent (and more fun). We load data from files using readr, clean up datasets that are not in third normal form using tidyr, manipulate data with dplyr, and plot them with ggplot2.
For our task of data exploration, it is dplyr we need. Before we even begin, let’s rename the columns so they have shorter names:

df <- rename(df, avg_temp = AverageTemperature, avg_temp_95p = AverageTemperatureUncertainty, city = City, country = Country, lat = Latitude, long = Longitude)
head(df)

## # A tibble: 6 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E
## 2 1743-12-01 NA NA Århus Denmark 57.05N 10.33E
## 3 1744-01-01 NA NA Århus Denmark 57.05N 10.33E
## 4 1744-02-01 NA NA Århus Denmark 57.05N 10.33E
## 5 1744-03-01 NA NA Århus Denmark 57.05N 10.33E
## 6 1744-04-01 5.788 3.624 Århus Denmark 57.05N 10.33E

distinct() (SELECT DISTINCT)

Good. Now that we have this new dataset containing temperature measurements, really the first thing we want to know is: What locations (countries, cities) do we have measurements for?
To find out, just do distinct():

distinct(df, country)

## # A tibble: 159 × 1
## country
##
## 1 Denmark
## 2 Turkey
## 3 Kazakhstan
## 4 China
## 5 Spain
## 6 Germany
## 7 Nigeria
## 8 Iran
## 9 Russia
## 10 Canada
## # ... with 149 more rows

distinct(df, city)

## # A tibble: 3,448 × 1
## city
##
## 1 Århus
## 2 Çorlu
## 3 Çorum
## 4 Öskemen
## 5 Ürümqi
## 6 A Coruña
## 7 Aachen
## 8 Aalborg
## 9 Aba
## 10 Abadan
## # ... with 3,438 more rows

filter() (WHERE)

OK. Now as I said I'm really first and foremost curious about measurements from Munich, so I'll have to restrict the rows. In SQL I'd need a WHERE clause, in R the equivalent is filter():

filter(df, city == 'Munich')
## # A tibble: 3,239 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 1743-11-01 1.323 1.783 Munich Germany 47.42N 10.66E
## 2 1743-12-01 NA NA Munich Germany 47.42N 10.66E
## 3 1744-01-01 NA NA Munich Germany 47.42N 10.66E
## 4 1744-02-01 NA NA Munich Germany 47.42N 10.66E
## 5 1744-03-01 NA NA Munich Germany 47.42N 10.66E
## 6 1744-04-01 5.498 2.267 Munich Germany 47.42N 10.66E
## 7 1744-05-01 7.918 1.603 Munich Germany 47.42N 10.66E

This is how we combine conditions if we have more than one of them in a where clause:

# AND
filter(df, city == 'Munich', year(dt) > 2000)
## # A tibble: 153 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 2001-01-01 -3.162 0.396 Munich Germany 47.42N 10.66E
## 2 2001-02-01 -1.221 0.755 Munich Germany 47.42N 10.66E
## 3 2001-03-01 3.165 0.512 Munich Germany 47.42N 10.66E
## 4 2001-04-01 3.132 0.329 Munich Germany 47.42N 10.66E
## 5 2001-05-01 11.961 0.150 Munich Germany 47.42N 10.66E
## 6 2001-06-01 11.468 0.377 Munich Germany 47.42N 10.66E
## 7 2001-07-01 15.037 0.316 Munich Germany 47.42N 10.66E
## 8 2001-08-01 15.761 0.325 Munich Germany 47.42N 10.66E
## 9 2001-09-01 7.897 0.420 Munich Germany 47.42N 10.66E
## 10 2001-10-01 9.361 0.252 Munich Germany 47.42N 10.66E
## # ... with 143 more rows

# OR
filter(df, city == 'Munich' | year(dt) > 2000)

## # A tibble: 540,116 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 2001-01-01 1.918 0.381 Århus Denmark 57.05N 10.33E
## 2 2001-02-01 0.241 0.328 Århus Denmark 57.05N 10.33E
## 3 2001-03-01 1.310 0.236 Århus Denmark 57.05N 10.33E
## 4 2001-04-01 5.890 0.158 Århus Denmark 57.05N 10.33E
## 5 2001-05-01 12.016 0.351 Århus Denmark 57.05N 10.33E
## 6 2001-06-01 13.944 0.352 Århus Denmark 57.05N 10.33E
## 7 2001-07-01 18.453 0.367 Århus Denmark 57.05N 10.33E
## 8 2001-08-01 17.396 0.287 Århus Denmark 57.05N 10.33E
## 9 2001-09-01 13.206 0.207 Århus Denmark 57.05N 10.33E
## 10 2001-10-01 11.732 0.200 Århus Denmark 57.05N 10.33E
## # ... with 540,106 more rows

select() (SELECT)

Now, often we don't want to see all the columns/variables. In SQL we SELECT what we're interested in, and it's select() in R, too:
select(filter(df, city == 'Munich'), avg_temp, avg_temp_95p)

## # A tibble: 3,239 × 2
## avg_temp avg_temp_95p
##
## 1 1.323 1.783
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 5.498 2.267
## 7 7.918 1.603
## 8 11.070 1.584
## 9 12.935 1.653
## 10 NA NA
## # ... with 3,229 more rows

arrange() (ORDER BY)

How about ordered output? This can be done using arrange():

arrange(select(filter(df, city == 'Munich'), dt, avg_temp), avg_temp)

## # A tibble: 3,239 × 2
## dt avg_temp
##
## 1 1956-02-01 -12.008
## 2 1830-01-01 -11.510
## 3 1767-01-01 -11.384
## 4 1929-02-01 -11.168
## 5 1795-01-01 -11.019
## 6 1942-01-01 -10.785
## 7 1940-01-01 -10.643
## 8 1895-02-01 -10.551
## 9 1755-01-01 -10.458
## 10 1893-01-01 -10.381
## # ... with 3,229 more rows

Do you think this is starting to get difficult to read? What if we add FILTER and GROUP BY operations to this query? Fortunately, with dplyr it is possible to avoid paren hell as well as stepwise assignment using the pipe operator, %>%.

Meet: %>% - the pipe

The pipe transforms an expression of form x %>% f(y) into f(x, y) and so, allows us write the above operation like this:

df %>% filter(city == 'Munich') %>% select(dt, avg_temp) %>% arrange(avg_temp)

This looks a lot like the fluent API design popular in some object oriented languages, or the bind operator, >>=, in Haskell.
It also looks a lot more like SQL. However, keep in mind that while SQL is declarative, the order of operations matters when you use the pipe (as the name says, the output of one operation is piped to another). You cannot, for example, write this (trying to emulate SQL‘s SELECT – WHERE – ORDER BY ): df %>% select(dt, avg_temp) %>% filter(city == 'Munich') %>% arrange(avg_temp). This can’t work because after a new dataframe has been returned from the select, the column city is not longer available.

arrange() (GROUP BY)

Now that we’ve introduced the pipe, on to group by. This is achieved in dplyr using group_by() (for grouping, obviously) and summarise() for aggregation.
Let’s find the countries we have most – and least, respectively – records for:

# most records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count %>% desc())

## # A tibble: 159 × 2
## country count
##
## 1 India 1014906
## 2 China 827802
## 3 United States 687289
## 4 Brazil 475580
## 5 Russia 461234
## 6 Japan 358669
## 7 Indonesia 323255
## 8 Germany 262359
## 9 United Kingdom 220252
## 10 Mexico 209560
## # ... with 149 more rows

# least records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count)

## # A tibble: 159 × 2
## country count
##
## 1 Papua New Guinea 1581
## 2 Oman 1653
## 3 Djibouti 1797
## 4 Eritrea 1797
## 5 Botswana 1881
## 6 Lesotho 1881
## 7 Namibia 1881
## 8 Swaziland 1881
## 9 Central African Republic 1893
## 10 Congo 1893

How about finding the average, minimum and maximum temperatures per month, looking at just records from Germany, and that originate after 1949?

df %>% filter(country == 'Germany', !is.na(avg_temp), year(dt) > 1949) %>% group_by(month(dt)) %>% summarise(count = n(), avg = mean(avg_temp), min = min(avg_temp), max = max(avg_temp))

## # A tibble: 12 × 5
## `month(dt)` count avg min max
##
## 1 1 5184 0.3329331 -10.256 6.070
## 2 2 5184 1.1155843 -12.008 7.233
## 3 3 5184 4.5513194 -3.846 8.718
## 4 4 5184 8.2728137 1.122 13.754
## 5 5 5184 12.9169965 5.601 16.602
## 6 6 5184 15.9862500 9.824 21.631
## 7 7 5184 17.8328285 11.697 23.795
## 8 8 5184 17.4978752 11.390 23.111
## 9 9 5103 14.0571383 7.233 18.444
## 10 10 5103 9.4110645 0.759 13.857
## 11 11 5103 4.6673114 -2.601 9.127
## 12 12 5103 1.3649677 -8.483 6.217

In this way, aggregation queries can be written that are powerful and very readable at the same time. So at this point, we know how to do basic selects with filtering and grouping. How about joins?

JOINs

Dplyr provides inner_join(), left_join(), right_join() and full_join() operations, as well as semi_join() and anti_join(). From the SQL viewpoint, these work exactly as expected.
To demonstrate a join, we’ll now load the second dataset, containing daily weather data for Munich, and aggregate it by month:

daily_1997_2015 % summarise(mean_temp = mean(mean_temp))
monthly_1997_2015

## # A tibble: 228 × 2
## month mean_temp
##
## 1 1997-01-01 -3.580645
## 2 1997-02-01 3.392857
## 3 1997-03-01 6.064516
## 4 1997-04-01 6.033333
## 5 1997-05-01 13.064516
## 6 1997-06-01 15.766667
## 7 1997-07-01 16.935484
## 8 1997-08-01 18.290323
## 9 1997-09-01 13.533333
## 10 1997-10-01 7.516129
## # ... with 218 more rows

Fine. Now let’s join the two datasets on the date column (their respective keys), telling R that this column is named dt in one dataframe, month in the other:

df % select(dt, avg_temp) %>% filter(year(dt) > 1949)
df %>% inner_join(monthly_1997_2015, by = c("dt" = "month"), suffix )

## # A tibble: 705,510 × 3
## dt avg_temp mean_temp
##
## 1 1997-01-01 -0.742 -3.580645
## 2 1997-02-01 2.771 3.392857
## 3 1997-03-01 4.089 6.064516
## 4 1997-04-01 5.984 6.033333
## 5 1997-05-01 10.408 13.064516
## 6 1997-06-01 16.208 15.766667
## 7 1997-07-01 18.919 16.935484
## 8 1997-08-01 20.883 18.290323 of perceptrons
## 9 1997-09-01 13.920 13.533333
## 10 1997-10-01 7.711 7.516129
## # ... with 705,500 more rows

As we see, average temperatures obtained for the same month differ a lot from each other. Evidently, the methods of averaging used (by us and by Berkeley Earth) were very different. We will have to use every dataset separately for exploration and inference.

Set operations

Having looked at joins, on to set operations. The set operations known from SQL can be performed using dplyr’s intersect(), union(), and setdiff() methods. For example, let’s combine the Munich weather data from before 2016 and from 2016 in one data frame:

daily_2016 % arrange(day)

## # A tibble: 7,195 × 23
## day max_temp mean_temp min_temp dew mean_dew min_dew max_hum
##
## 1 1997-01-01 -8 -12 -16 -13 -14 -17 92
## 2 1997-01-02 0 -8 -16 -9 -13 -18 92
## 3 1997-01-03 -4 -6 -7 -6 -8 -9 93
## 4 1997-01-04 -3 -4 -5 -5 -6 -6 93
## 5 1997-01-05 -1 -3 -6 -4 -5 -7 100
## 6 1997-01-06 -2 -3 -4 -4 -5 -6 93
## 7 1997-01-07 0 -4 -9 -6 -9 -10 93
## 8 1997-01-08 0 -3 -7 -7 -7 -8 100
## 9 1997-01-09 0 -3 -6 -5 -6 -7 100
## 10 1997-01-10 -3 -4 -5 -4 -5 -6 100
## # ... with 7,185 more rows, and 15 more variables: mean_hum ,
## # min_hum , max_hpa , mean_hpa , min_hpa ,
## # max_visib , mean_visib , min_visib , max_wind ,
## # mean_wind , max_gust , prep , cloud ,
## # events , winddir

Window (AKA analytic) functions

Joins, set operations, that’s pretty cool to have but that's not all. Additionally, a large number of analytic functions are available in dplyr. We have the familiar-from-SQL ranking functions (e.g., dense_rank(), row_number(), ntile(), and cume_dist()):

# 5% hottest days
filter(daily_2016, cume_dist(desc(mean_temp)) % select(day, mean_temp)

## # A tibble: 5 × 2
## day mean_temp
##
## 1 2016-06-24 22
## 2 2016-06-25 22
## 3 2016-07-09 22
## 4 2016-07-11 24
## 5 2016-07-30 22

# 3 coldest days
filter(daily_2016, dense_rank(mean_temp) % select(day, mean_temp) %>% arrange(mean_temp)

## # A tibble: 4 × 2
## day mean_temp
##
## 1 2016-01-22 -10
## 2 2016-01-19 -8
## 3 2016-01-18 -7
## 4 2016-01-20 -7

We have lead() and lag():

# consecutive days where mean temperature changed by more than 5 degrees:
daily_2016 %>% mutate(yesterday_temp = lag(mean_temp)) %>% filter(abs(yesterday_temp - mean_temp) > 5) %>% select(day, mean_temp, yesterday_temp)

## # A tibble: 6 × 3
## day mean_temp yesterday_temp
##
## 1 2016-02-01 10 4
## 2 2016-02-21 11 3
## 3 2016-06-26 16 22
## 4 2016-07-12 18 24
## 5 2016-08-05 14 21
## 6 2016-08-13 19 13

We also have lots of aggregation functions that, if already provided in base R, come with enhancements in dplyr. Such as, choosing the column that dictates accumulation order. New in dplyr is e.g., cummean(), the cumulative mean:

daily_2016 %>% mutate(cum_mean_temp = cummean(mean_temp)) %>% select(day, mean_temp, cum_mean_temp)

## # A tibble: 260 × 3
## day mean_temp cum_mean_temp
##
## 1 2016-01-01 2 2.0000000
## 2 2016-01-02 -1 0.5000000
## 3 2016-01-03 -2 -0.3333333
## 4 2016-01-04 0 -0.2500000
## 5 2016-01-05 2 0.2000000
## 6 2016-01-06 2 0.5000000
## 7 2016-01-07 3 0.8571429
## 8 2016-01-08 4 1.2500000
## 9 2016-01-09 4 1.5555556
## 10 2016-01-10 3 1.7000000
## # ... with 250 more rows

OK. Wrapping up so far, dplyr should make it easy to do data manipulation if you’re used to SQL. So why not just use SQL, what can we do in R that we couldn’t do before?

Visualization

Well, one thing R excels at is visualization. First and foremost, there is ggplot2, Hadley Wickham‘s famous plotting package, the realization of a "grammar of graphics". ggplot2 predates the tidyverse, but became part of it once it came to life. We can use ggplot2 to plot the average monthly temperatures from Berkeley Earth for selected cities and time ranges, like this:

cities = c("Munich", "Bern", "Oslo")
df_cities % filter(city %in% cities, year(dt) > 1949, !is.na(avg_temp))
(p_1950 <- ggplot(df_cities, aes(dt, avg_temp, color = city)) + geom_point() + xlab("") + ylab("avg monthly temp") + theme_solarized())


While this plot is two-dimensional (with axes time and temperature), a third "dimension" is added via the color aesthetic (aes (..., color = city)).

We can easily reuse the same plot, zooming in on a shorter time frame:

start_time <- as.Date("1992-01-01")
end_time <- as.Date("2013-08-01")
limits <- c(start_time,end_time)
(p_1992 <- p_1950 + (scale_x_date(limits=limits)))


It seems like overall, Bern is warmest, Oslo is coldest, and Munich is in the middle somewhere.
We can add smoothing lines to see this more clearly (by default, confidence intervals would also be displayed, but I’m suppressing them here so as to show the three lines more clearly):

(p_1992 <- p_1992 + geom_smooth(se = FALSE))


Good. Now that we have these lines, can we rely on them to obtain a trend for the temperature? Because that is, ultimately, what we want to find out about.
From here on, we’re zooming in on Munich. Let’s display that trend line for Munich again, this time with the 95% confidence interval added:

p_munich_1992 <- p_munich_1950 + (scale_x_date(limits=limits))
p_munich_1992 + stat_smooth()


Calling stat_smooth() without specifying a smoothing method uses Local Polynomial Regression Fitting (LOESS). However, we could as well use another smoothing method, for example, we could fit a line using lm(). Let’s compare them both:

loess <- p_munich_1992 + stat_smooth(method = "loess", colour = "red") + labs(title = 'loess')
lm <- p_munich_1992 + stat_smooth(method = "lm", color = "green") + labs(title = 'lm')
grid.arrange(loess, lm, ncol=2) (p_1992 <- p_1950 + (scale_x_date(limits=limits)))


Both fits behave quite differently, especially as regards the shape of the confidence interval near the end (and beginning) of the time range. If we want to form an opinion regarding a possible trend, we will have to do more than just look at the graphs - time to do some time series analysis!
Given this post has become quite long already, we'll continue in the next - so how about next winter? Stay tuned :-)

Note: This post has originally been postedhere

Read more…

Shopper Marketing -Infographic

Originally posted on Data Science Central

This infographic on Shopper Marketing was created by Steve Hashman and his team. Steve is Director at Exponential Solutions (The CUBE) Marketing. 

Shopper marketing focuses on the customer in and at the point of purchase. It is an integrated and strategic approach to a customer’s in-store experience which is seen as a driver of both sales and brand equity.

For more information, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Rick Riddle

It can be tempting to lump all the people who’ve spent more of their life with social media as one large group with largely the same interests and aims. Indeed, that is what you will find many marketing firms doing. The internet is rife with articles about how to market to millennials, how snapchat is the new social media platform of the millennial generation and how Instagram has overtaken Facebook.

To do so, however, would be wrong.

Because if you dig a little deeper, you’ll very quickly indeed find out that the millennials are far from the homogeneous group they’ve been made out to be. For example, ‘they’ most certainly don’t all use snapchat. In fact, an Ipsos study of 1000 millennials between the age of 20 and 35 found that more than half don’t have a snapchat account, 1 in 10 doesn’t have a Facebook account and 40% do not use Instagram.

Figure 1 Infographic is infographic by the Smart Paper Help writing service.

In effect, to market to Millennials is a little bit like marketing to women or black people. The category is just far too big and by using it you lump people together that are entirely different and have entirely different interests.

Broad categories means less engagement

What’s more, by targeting categories this broad, there is almost no way that a person feels personally addressed by your marketing campaign. In other words, you’re not making use of one of the biggest trends in marketing that we’re currently seeing, and that is the personalization of products and websites.

In order to take advantage of that, you need to slice market segments far more thinly than the word ‘millennial’ ever could. Then you couldn’t just focus on millennials, or even millennial women, you would have to focus on millennial single mothers, for example.

What’s more, this is in many ways far easier to do, as studying the numbers in terms of smaller groups both makes it easier to find out what social media they’re using, as well as to find ways to appeal to those groups directly, by exploring topics that are immediately relevant to them. And that, in turn, will serve to significantly raise their interest and their engagement with your brand.

Modern social media allows for thin-slicing  

And besides, why wouldn’t you approach modern advertising in this way? Many social platforms allow you to thin slice who you approach and how you approach them to an amazing degree. For example, it is possible to target people in specific jobs, in specific areas, even people that work at a specific business.

This is immensely advantageous as it means you can tailor your message exactly for that group –giving them the feeling that you’re talking directly to them and giving them exactly what they might be looking for.

What’s more, by thin-slicing who you address with your advertisements and posts, you’ll manage to tighten all the bolts on the leaky faucet, so that those people that won’t benefit from being exposed to your ad (because they will not be interested or will not be able to take advantage of what you’re offering) will not be exposed to your ad. This will make them happier and will mean that you’re spending far less money on people who you’re not interested in targeting.

A warning

At the same time, there is a growing body of evidence that people are less and less comfortable with the way that social media have encroached on their privacy. A reason survey by Nation under A-Hack revealed that when asked 55% of Millennials said they would stay away from Social Media if they could start afresh and that 75% were considering closing their accounts if the security breaches continued. And that’s not the only source that shows these kinds of trends, with other infographics about how millennials use social media showing similar trends.

This matters for you, in that it is vital that you do not make them feel as if you know too much about them, as – rather than them feeling that you’ve personally connected with them there’s a good chance that this will actually creep them out.

And that can’t be good for your business.

A fine line

In other words, stay away from addressing them directly, letting them know that you’re aware where they live, or what other information you have about them. And just like with the older generations, it might be about time that we ask for permission before we start broadcasting information about what we know about individuals across our network.

The truth is, though the US has not yet caught up to the European Union in terms of privacy protection, with the way the mood is currently going, it will sooner or later start swinging in that same direction. When that happens, you want to make certain you’re on the right side of the fence.

So, personalize, and thin slice, but do not become too personable, as people still do not like the idea of business peeking into their living rooms.

About the author:

Rick Riddle is a successful blogger whose articles aim to help readers with digital marketing, entrepreneurship, career, and self-development. Feel free to connect with Rick on twitter and LinkedIn.

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Careers