15/07/2023

From textbooks to introductory tomes and mass-market nonfiction.

Chico Camargo, a postdoctoral researcher in data science at the Oxford Internet Institute came to data science from a background in biology.
Biology is big, messy and complex, he told Built In, so I was drawn toward tools that could help me make some sense out of that.
Usually, humans make sense of the natural worlds complexity with our own natural tools: our brains and our senses. Data science augments those innate capacities, though, with algorithms and predictive models.
Camargo was especially drawn to unsupervised machine learning and natural language processing, which helps humans with everything from detecting signs of metastasizing cancer to understanding foreign languages with Google Translate.
At this point, in fact, data science has gotten so sophisticated that it doesnt just enhance our natural abilities it mimics them.
Take deep learning, for example. It uses multiple layers [of algorithms] to progressively extract higher-level features from raw input, Camargo explained.
Human vision works in a similarly layered way. The first layers of neurons in our visual system are responsible for identifying light and dark, Camargo said, while the deeper layers respond to patterns like curves and straight lines. Ultimately, the nth layer of neurons recognizes the visual for what it is: Aha, its a face!
In a way, data science has become humanitys sixth sense. Yet its also probably the sense the average person understands the least. So for anyone hoping to learn more, we asked three experts to recommend their favorite data science books. Our panel included:
The resulting reading list ranges from technical machine learning and math textbooks to sociological studies of how algorithms impact our daily lives.
General Interest Books
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz
CAMARGO: This book is like Freakonomics in the age of data science. Its 100 percent not a technical book. Every chapter tells some peculiar story illustrating a data science concept like, theres one chapter about Google searches, another about news, another about image data, etc. Its a bunch of stories of people being creative and finding patterns in the most random things, because these random things actually reveal a lot. The book has that name because you can lie about what you eat and read, and you can lie about who youre going to vote for but if I have access to your search history, I can figure out the truth. Its a book for people that are curious about what data science is and what it can do  especially when it comes to social data. The author finishes by saying the next Freud will be a data scientist, the next Foucault will be a data scientist, the next Marx will be a data scientist. I think thats a bit much perhaps, because data science doesnt answer every question ever. But its a fun book, to be read with a grain of salt.
Naked Statistics: Stripping the Dread from Data by Charles Wheelan
HERMAN: This book gives a lot of examples of how statistical concepts apply in the real world. Wheelan does not go into a lot of theory, but he has some pretty interesting examples and a kind of dry sense of humor. This the only statistics book thats ever made me laugh, and its the book that we recommend our incoming students at the Flatiron School read beforehand. Our students come from a wide variety of statistics backgrounds, but Ive always gotten really positive feedback on it. Its ideal for beginners, but I also think that if youve never read it and youre in data science, its a great read.
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy ONeil
CAMARGO: The author of this book, Cathy ONeil, used to be an academic mathematician. Then she went to Wall Street, then she went to Occupy Wall Street and now shes an activist raising awareness of how algorithms rule our lives, and how they are not as neutral or unbiased as we like to believe. The book is a collection of stories of algorithms real-world applications, and a lot of them are about people who were classified as unworthy by an algorithm. Like, someone purchased an item at a particular shop and automatically got their credit card limit lowered, or a college student couldnt get a job at a local grocery store because the algorithm said so.
She doesnt just say boo hoo, bad algorithm, bad machine! though  she makes an effort to explain the mechanisms that might make an algorithm racist, for instance. So, why is a policing algorithm sending officers to black neighborhoods more often? Well, what happened in that case is that the algorithm was fed data on previous police patrols, which were more often in black neighborhoods. So the algorithm learned that those neighborhoods are the ones that receive more patrols. The algorithm simply reproduced what it was taught. The book makes you think a lot about how you can design algorithms and data science practices to deal with that.
Algorithms of Oppression by Safiya Noble
CAMARGO: This book has a few stories, with very simple data, which the author explores in depth. I found it a very interesting read, because the authors background is almost diametrically opposed to mine. Shes 100 percent qualitative, telling stories based on small data with a lot of context. 
In one of these stories, the author, Safiya Noble, was organizing a party for her niece and other children, and she searched something like black girls on Google. To her surprise, she didnt find pictures of children. She found websites like HOT BLACK SINGLES IN YOUR AREA. For other search terms, like Latina girls and Asian girls, she found the same stuff. 
The reason this happened, she explained, is Googles revenue model. The algorithm will serve whatever ad pays the most. And it becomes a troubling situation, because even though Google is an advertising company, we use it like a public library  like some sort of publicly accessible repository of information. I found it a very sobering read.
BEGINNER-FRIENDLY TEXTBOOKS
An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
HERMAN: When I was first learning data science, most statistical textbooks were kind of unreadable. They went in-depth on theory and didnt really show the application side. This book doesnt go as deep statistically as a lot of other books, but it gives you enough knowledge to be successful as a data scientist, and it goes over the key machine learning algorithms. One of the issues people have with data science is that algorithms are these black boxes where you put data in and you get data out and you have no idea what happens in the middle. This book gives you enough statistical knowledge to understand whats going on in that black box.
Its geared toward people that dont have any programming or statistics background. That being said, Ive actually read this book multiple times. Even if youre an experienced data scientist, a lot of statistical concepts, you kind of forget about them over time. As you work in a job, youre not going to be using every single algorithm. You get comfortable. This book allows you to say, okay, maybe I should try this other algorithm.
Data Science From Scratch: First Principles of Python by Joel Grus
MILLER: This book is about how to write data science algorithms in Python. Its a mix between a textbook and a normal book a great entryway book, very appropriate for a layperson. So for instance, if I wanted to learn the machine learning algorithm Naive Bayes, this book says, Were going to literally program Naive Bayes as if it doesnt exist in the world. Were going to learn the math first and then write the code as part of that. Well build this algorithm together with nothing but Python.
You probably want to know a little bit of Python and a little bit of statistics going in, but this book assumes almost no depth of knowledge. Its not one of those books thats like, This is left to the reader because its easy. And it will teach you all the standard machine learning algorithms, probably 10 or 15 different ones.
Hands-On Machine Learning With Scikit-Learn, Keras and Tensorflow by Aurélien Géron
HERMAN:This book will teach you how to run predictive analytics. In the data science world, there are two main programming languages: Python and R. There are pros and cons to both, but this book is specifically for Python. Scikit-Learn, Keras and TensorFlow are all libraries of machine learning and deep learning functions within the Python programming languages. 
You have to be pretty good at these libraries to be a data scientist. When I was starting out, I would reference this book daily. To this day I probably look at it at least monthly as a reference, because he really goes deep into explaining how each algorithm works. A lot of algorithms have a lot of knobs or levers that you can turn  so depending on what the data is doing, you might change the algorithm a little bit. The author explains what those different knobs and levers are in a way that a beginner can understand, but someone with more experience can appreciate the level of detail that he goes into.
Think Stats by Allen B. Downey
MILLER:  Data science is a mix of three different disciplines. One is programming and computer science; one is linear algebra, stats, very math-heavy analytics; and then one is machine learning and algorithms. The ideal data scientist is really good at all of them. But that doesnt always happen, so this book is about building out that analytics, math and stats side of your data science knowledge. How do you do testing, how do you determine whether your solutions are working and the distributions are right, and how do you use that math stuff to solve business problems? 
Its textbook-y, but it isnt a hardcore textbook. It also merges the statistical analysis with how you would write it in Python. Early in my career, I found statistics fairly easy, but making statistics into a program was more challenging. I found this very helpful for making that connection.
Grokking Deep Learning by Andrew W. Trask
CAMARGO: This book is an introductory textbook for the beginner who wants to go beyond usage and understand a bit of how deep learning works. People who develop deep learning tools are usually drawing from a lot of mathematics: multivariate calculus, linear algebra, optimization, often some physics too. But you dont need all these things to understand what deep learning is doing. In the authors words, If youve passed high school mathematics and hacked around in Python, youre ready for this book. It covers some very general and fundamental bits, such as gradient descent, backpropagation and regularization, which are used in so many advanced tools that you cannot progress without a decent understanding of them.
I think books like this are important because thanks to online tutorials, you can get to a point where youre implementing complex stuff without actually understanding how it works all you need is Python and an internet connection. And that is troublesome, sometimes. People can waste resources by using deep neural networks where a linear regression would do (using a bazooka to kill a fruit fly, in a sense) or by implementing algorithms that lead to decisions that harm people, without the programmers realizing thats happening.
 
Linear Algebra Done Right by Sheldon Axler
MILLER: This book is an undergraduate math textbook. Its designed for a mid-level linear algebra course, which is something every data scientist can use. Its not sexy. Its not machine learning, its not flash programming. But the thing that I use more than anything else is my ability to take a matrix or a high-dimensional space and think about it. This is one of those books that, when youre done, you will know inside and out how to do matrices and how to handle the vector space and how to do pure math about high-dimensional spaces. I wouldnt say its for everybody, though. If this was your first math book, you would find it daunting. This is for a 200- or 300-level course.
MORE ADVANCED TEXTBOOKS
Pattern Recognition and Machine Learning by Christopher M. Bishop
MILLER: This book is definitely a textbook. Its also, if you take Data Science From Scratch and then turn up the math level to 11, thats what this book is. It bases everything on what is known as a Bayesian viewpoint, and it says that it has an intro for Bayesian learning, which it technically does, but any beginner would be mortified by it about two pages in. When I talked to other data scientists who are as nerdy as me, though, this is the book that we always end up talking about.
As far as what pattern recognition means here any machine learning is pattern recognition, right? Looking at how the stock market used to perform and then projecting how it should perform next, thats pattern recognition. But similarly looking at a bunch of signs and learning, this pattern means stop, thats a similar thing. Machine learning is a big, fancy, shiny term, which basically just means using the old data to think about the data you havent seen before. This is probably the best book Ive read on the subject, just in terms of just depth and clarity of presentation. Hes not glossing over anything and hes not making it super beginner-friendly. Its just, this is how it works, and you can take it or leave it.
 
Deep Learning With Python by François Chollet
HERMAN: The author of this book is the creator of the library called Keras, which makes it a lot easier to build neural networks in Python and usually, in deep learning, youre using neural networks on unstructured data. So if youre trying to predict if theres a person in an image, or whether a review on Yelp is positive or negative, you would use a deep neural network. I remember when I was reading this, in the second chapter, you build a neural network for the first time. He writes out code in the book, and then you try it out for yourself on your computer, and you get 98 percent accuracy. The data set is a bunch of handwritten numbers and youre trying to predict what the number is, even though everyones handwriting is different. The ones the algorithm gets incorrect are ones that I would probably would get incorrect. Being able to do that in the second chapter, I was like, OK, Im definitely gonna be finishing this book.
Designing Data-Intensive Applications by Martin Kleppman
MILLER: This book isnt a standard pick for a data science book because its very much in that data engineering, computer sciences corner of data sciences three pillars. Its more about designing databases and making sure that your data can flow in and out of your system. If I wanted to build a system to store every Yelp review thats ever existed, every Yelp user and all of that information this book is about how you store that. How do you make sure that the data can go in and out? How do you make sure that the data is consistent and reliable? How do you make sure that your system doesnt break when you get a million users instead of 100,000 users?
Its not super data science-y, but I think its a piece of the puzzle that a lot of data scientists ignore, and it explains why your system should be this way very clearly. It doesnt assume that youre a data engineer or an admin. I would say anybody whos a data scientist owes it to themselves to learn about how the systems they rely on work. But you probably arent going to sit down and read this one end to end. Its more of a reference.
 
Data Science With Python and Dask by Jesse Daniel
HERMAN: The focus of this book is big data  specifically working on it with Dask.
Dask is a new library in Python and its this buzzword right now. I see it in pretty much every job description my students apply for, and Im very fond of it. Most companies that work with big data use a library called Spark, but it has a huge learning curve. You have to learn essentially a new language to use it. Dask allows you to interact with massive data sets in libraries that youre already comfortable with. In this book, I really liked seeing how concepts were applied. The author introduces a data set at the beginning its 42 million parking tickets around New York City and hell explain a concept and then apply it on that data set.
Responses have been condensed and edited.