Modern data science with R / Benjamin S. Baumer, etal

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 4, 2018 23:34:38 GMT

Quote

Post by Admin on Dec 4, 2018 23:34:38 GMT

So I have Modern data science with R / Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. (2017)
Chapman & Hall/CRC
Texts in Statistical Science Series

It is a huge book packed with info. And quite new too.

They list over 50, and they all look really good. CRC Press is always good, and our academic libraries have a good number of their titles.

I want to start here looking at the material in Appendix F, how to connect R to databases. This had after all been an area of concern for me.

Finding that I have to absorb information holographically. Lots and lots of books I need to access, to get all of what I want.

So what I am seeing, and this book confirms it, is this, to do what you want you add expansion modules to the interpreter. The interpreter is a straight C program. Not too big. You do stuff by adding expansion modules. These typically are C modules, and then some R code. You can access these without changing the original source code and compiling and re-linking because you will use Dynamic Linking. This is available for Linux, Windows, and Mac OS.

I presume that the expansion R code can pass connection info for the new C language interpreter code.

So then you use this to get to the data bases, MySQL or PostgreSQL. You do this via the Client Server interface, now accommodated for in each of the operating systems.

So this alleviates my concern about having to transfer through text files. It actually is very well done.

And they cover MySQL and PostgreSQL. They do not talk about MongoDB or any other NoSQL databases.

They also talk about SQLite. Now generally you can just write a program and link to the code for this and include it in your executable. But here you probably do still use dynamic linking, and you might still use the client server interface. Not sure, but I am confident it will work out well.

As I see it, if it is "Enterprise" data and you need complex access control, then you need MySQL or PostgreSQL. But if it just pertains to running your program, then probably SQLite is better.

As I see it though, no one wants to write much of a program using an interpreted language, its 100x slower. Want to use C++.

But the thing is, need some way for the user to input data to the program and to control it. Nice to be able to let the user be able to change this themselves. So this is where the interpretive language comes in, as the natural dividing line for what the user can change, and what they cannot.

You don't ever want a user to be able to change code and then expect you to maintain it, fork your software. And dealing with a large corp, run very sloppily, that could happen if you are not careful. So anything of any complexity is in your C++, and complied up and kept shut. Then in the interpretive language you just set up a chain type program which moves through its parts. The user can change this.

Okay, but the upshot of this is that considering Python, Ruby, R, and Io, for anything one of them can do, it should not be too difficult to make the others do it too. You just need to write your own program in the form of an expansion module.

So is there any reason to do some stuff using one of them and other stuff using another? Or better just to pick one and use it for everything?

Python, at least initially, it does not impress me. But I see Python being invoked for lots of R like applications.

Ruby seems to be intended as a corrective, and it seems to be getting used as such. Io, at least initially, it looks to me like something good for embedded applications, like a replacement for Forth. But for R, and really all of them, I still have much more to learn.

As far was why R uses something written in Fortran, sounds like it is just some legacy mathematical module. I though am not averse to using Fortran, not even for new stuff, if that makes it fit well within a tradition.

Still much more to learn.

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 4, 2018 23:36:51 GMT

Quote

Post by Admin on Dec 4, 2018 23:36:51 GMT

So this is a huge book packed with info.

So I have Modern data science with R / Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. (2017)
Chapman & Hall/CRC
Texts in Statistical Science Series

As I see it, the appeal of R is mainly the 10,000 plus expansion modules written for it, not the intrinsic s of the language itself.

R is like the new Fortran, much loved, but actually already out of date.

Want to try and capture so of the critical links it offers:

so to get to the databases you use the modules

RMySQL, RPostgreSQL, RSQLite.

But they want you to use another module DBI, to get to the above 3.

And then dplyr is supposedly even better to get to DBI. But sounds like not always.

Also talking about Python and what is available for it:

most R like is Pandas

SciPy for scientific computation

NumPy for large arrays

matplotlib for graphics

scikit-learn for machine learning

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 4, 2018 23:40:45 GMT

Quote

Post by Admin on Dec 4, 2018 23:40:45 GMT

So I have Modern data science with R / Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. (2017)
Chapman & Hall/CRC
Texts in Statistical Science Series

Getting through the above, lots and lots of info I want to record. Lets start with the books:

Statistical Methods for SPC and TQS
by D. Bissell

available,
Statistical methods for SPC and TQM / Derek Bissell (1994)

Bayesian Methods for Data Analysis
B. P. Carlin and T. A. Louis

find:
Bayes and empirical Bayes methods for data analysis / Bradley P. Carlin and Thomas A. Louis (2000)

Okay, then in Bibliography, some notables:

Much attention paid to works of Hadley Wickham

Advanced R / Hadley Wickham (2015) even deals with functional programming.

Wickham very big in the R community and has written many of the librarys, like that dplyr.

online R documentation is being hosted by Google ( nice ). Actually it is an R Style Guide

google.github.io/styleguide/Rguide.xml

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 4, 2018 23:55:03 GMT

Quote

Post by Admin on Dec 4, 2018 23:55:03 GMT

Finished with: Modern data science with R / Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. (2017)
Chapman & Hall/CRC
Texts in Statistical Science Series

Google's R Style Guide
google.github.io/styleguide/Rguide.xml

To be able to make a comprehensive appraisal of R, I will have to read many more books. R does have some provisions for using parallel processing. But to use it for Process Control/Automatic Test, you will want some kind of concurrency. This is because the controller is to do its housekeeping tasks while the other equipment is busy. Not seeing this.

R look much like APL (Array Processing Language). I had read this, but I actually see it because I read the book about APL, and because some academic libraries still had the book.

In R the basic variable is a vector. This has its uses. I think they can also be like database records, hence two dimensional. And I think the records or items in the vectors can be different length.

This can all be useful, but R and all of these new languages are object oriented. One of the benefits of that should be that you can make your own forms of composite variables. You can define these and the manipulation routines yourself.

SJG

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 5, 2018 0:29:11 GMT

Quote

Post by Admin on Dec 5, 2018 0:29:11 GMT

So the above authors say that Statisticians love R, where as Computer Scientists love Python.

Well as I see it, Python could be made to do all that R does. And further, Computer Scientists seem to love Ruby, and be running away now from Python.

So this is a major book, but most of what is in it is about using some of the major R packages. The R language itself, and how you might make your own expansion package are not really dealt with.

So let me try and summarize the best of the book.

So they endorse R and RStudio.

knitr() is a reproducible analysis system.

msdr is the above book.

mdsr-book.github.io/

ggplot2 is the main package they use for graphics, including spatial maps and scatter plots and everything.

'library(msdr)' loads the msdr expansion package for the book.

ggplot2 could be extremely useful. Don't know though if you can make its stuff into JSON's, or if you download to server is picture.

These authors are using this Mosaic System.

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 5, 2018 0:52:49 GMT

Quote

Post by Admin on Dec 5, 2018 0:52:49 GMT

MDSR Book:
mdsr-book.github.io/

ggplot2 works good with geographical and spatial maps. Could be very useful.

Data Wrangling:

dplyr is one step more abstract than SQL.

so it has

select() take a subset of the colums

filter() take a subset of the rows

mutate() add or modify existing columns

arrange() sort the rows

summarize() aggregate the data across rows

Now you have inner_join() this would be the most common type of join. And remember it translates down to SQL.

But there is also left, right, and cross join. Then there is union.

This is the stuff that even for seemingly small jobs, can take days of pumping the hard disk non-stop. This is where the relational database paradigm breaks down, in my opinion and experience.

Talk about the pipeline operator in R

'%>%'

Data Formats:

Octave ( and through that, MATLAB ) engineering and physics

Stata commonly used for economic research

SPSS commonly used for social science

Minitab often used in business applications

SAS often used for large data sets

Epi used by Center for Disease Control

Data can go in and out of Excel and Google spread sheets.

Should be able to make this and more work with Mozilla-Apache Calc, as it is open source

HTML '<table>' format

XML

JSON

CSV ( comma separated values )

tidyr package is for Tidy Data

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 5, 2018 1:15:23 GMT

Quote

Post by Admin on Dec 5, 2018 1:15:23 GMT

How to Lie with Statistics by Darrell Huff

How to lie with statistics / by Darrell Huff ; illustrated by Irving Geis. (1954)

This book talks a lot about data scaping, taking data from public sources, like the Wikipedia and like Project Gutenberg.

Various statements of professional codes of conduct here.

Book talks about Statistical Learning and Predictive Analysis, and about supervised and unsupervised learning.

( to me, not sure that R has any advantage here, it would have to be these expansion packages )

talks about simulations

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 5, 2018 1:36:22 GMT

Quote

Post by Admin on Dec 5, 2018 1:36:22 GMT

D3.js and htmlwidgets work together, help with things involving JavaScript. Might be able to make JSONs.

also leaflet. and this works with JavaScript Leaflet and the OpenStreetMaps API.

Besides ggplot2, we have plotly and ggplotly()

DT is a way to make data tables interactive.

we also have dygraphs and steamgraphs.js

Dynamic Visualization: ggvis

Interactive web apps with Shiny.

www.htmlwidgets.org/

ggvis.rstudio.com/

shiny.rstudio.com/

extrafonts

Admin
Administrator

Posts: 711

Modern data science with R / Benjamin S. Baumer, etal Dec 5, 2018 1:54:47 GMT

Quote

Post by Admin on Dec 5, 2018 1:54:47 GMT

SQLite, I think now SQLite3

MySQL, PostgreSQL,

and then MonetDB and MonetDBLite

for databases

Primary Key

Unique Key

Foreign Key

and then Indices

FOR Spatial Data Use:

sp rgdal ggmap leaflet

they scape data gout of Project Guttenberg, like all the text of Shakespeare's Plays

Use R's gutenberggr package and also the tidytext package and the tidy text mining package.

Talks about MAPREDUCE.

Talks about NVIDIA's CUDA ,for parallel computing.

Talks about Hadoop and Apache Spark

Talks about Google BigQuery, they will let you search your data and theirs, only charging money if it goes beyond 10000 queries per day.

Talks about NOSQL and MongoDB

Talks about Python having having Pandas, NumPy, SciPy, matplotlib, and scikit-learn

Admin Administrator Posts: 711	Modern data science with R / Benjamin S. Baumer, etal Dec 5, 2018 2:03:29 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Admin on Dec 5, 2018 2:03:29 GMT Talks about R Markdowns, makes annotated texts of code, for making sure experiments are reproducible? Connects with R-Studio.

TestosteroneLifeBoat

Modern data science with R / Benjamin S. Baumer, etal

Post by Admin on Dec 4, 2018 23:34:38 GMT

Post by Admin on Dec 4, 2018 23:36:51 GMT

Post by Admin on Dec 4, 2018 23:40:45 GMT

Post by Admin on Dec 4, 2018 23:55:03 GMT

Post by Admin on Dec 5, 2018 0:29:11 GMT

Post by Admin on Dec 5, 2018 0:52:49 GMT

Post by Admin on Dec 5, 2018 1:15:23 GMT

Post by Admin on Dec 5, 2018 1:36:22 GMT

Post by Admin on Dec 5, 2018 1:54:47 GMT

Post by Admin on Dec 5, 2018 2:03:29 GMT

Quick Reply