XML is a tool that is used to store and transport data. It stands for eXtensible Markup Language. XML is quite similar to HTML and they have almost the same kind of structure but they were designed to accomplish different goals.
XML is designed to transport data while HTML is designed to display data. Many systems contain incompatible data formats. This makes data exchange between incompatible systems is a time-consuming task for web developers as large amounts of data has to be converted. Further, there are chances that incompatible data is lost. But, XML stores data in plain text format thereby providing software and hardware-independent method of storing and sharing data.
Another major difference is that HTML tags are predefined whereas XML files are not.
❖ Example of XML:
<?xml version="1.0" encoding="UTF-8"?>
<note> <to>Harry Potter</to> <from>Albus Dumbledore</from> <heading>Reminder</heading> <body>It does not do to dwell on dreams and forget to live!</body>
</note>
As mentioned earlier, XML tags are not pre-defined so we need to find the tag that holds the information that we want to extract. Thus there are two major aspects governing the parsing of XML files:
Finding the required Tags.
Extracting data from after identifying the Tags.
BeautifulSoup and LXML Installation
When it comes to web scraping with Python, BeautifulSoup the most commonly used library. The recommended way of parsing XML files using BeautifulSoup is to use Python’s lxml parser.
You can install both libraries using the pip installation tool. Please have a look at our BLOG TUTORIAL to learn how to install them if you want to scrape data from an XML file using Beautiful soup.
# Note: Before we proceed with our discussion, please have a look at the following XML file that we will be using throughout the course of this article. (Please create a file with the name sample.txt and copy-paste the code given below to practice further.)
Since the tags are not pre-defined in XML, we must identify the tags and search them using the different methods provided by the BeautifulSoup library. Now, how do we find the right tags? We can do so with the help of BeautifulSoup's search methods.
Beautiful Soup has numerous methods for searching a parse tree. The two most popular and commonly used methods are:
find()
find_all()
We have an entire blog tutorial on the two methods. Please have a look at the following tutorial to understand how these search methods work.
If you have read the above-mentioned article, then you can easily use the findand find_all methods to search for tags anywhere in the XML document.
Relationship Between Tags
It is extremely important to understand the relationship between tags, especially while scraping data from XML documents.
The three key relationships in the XML parse tree are:
Parent: The tag which is used as the reference tag for navigating to child tags.
Children: The tags contained within the parent tag.
Siblings: As the name suggests these are the tags that exist on the same level of the parse tree.
Let us have a look at how we can navigate the XML parse tree using the above relationships.
Finding Parents
❖ The parent attribute allows us to find the parent/reference tag as shown in the example below.
Example: In the following code we will find out the parents of the common tag.
print(soup.common.parent.name)
Output:
plant
Note: The name attribute allows us to extract the name of the tag instead of extracting the entire content.
Finding Children
❖ The children attribute allows us to find the child tag as shown in the example below.
Example: In the following code we will find out the children of the plant tag.
for child in soup.plant.children: if child.name == None: pass else: print(child.name)
Output:
common
botanical
zone
light
price
availability
Finding Siblings
A tag can have siblings before and after it.
❖ The previous_siblings attribute returns the siblings before the referenced tag, and the next_siblings attribute returns the siblings after it.
Example: The following code finds the previous and next sibling tags of the light tag of the XML document.
print("***Previous Siblings***")
for sibling in soup.light.previous_siblings: if sibling.name == None: pass else: print(sibling.name) print("\n***Next Siblings***")
for sibling in soup.light.next_siblings: if sibling.name == None: pass else: print(sibling.name)
Output:
***Previous Siblings***
zone
botanical
common ***Next Siblings***
price
availability
Extracting Data From Tags
By now, we know how to navigate and find data within tags. Let us have a look at the attributes that help us to extract data from the tags.
Text And String Attributes
To access the text values within tags, you can use the text or strings attribute.
Example: let us extract the the text from the first price tag using text and string attributes.
print('***PLANT NAME***')
for tag in plant_name: print(tag.text)
print('\n***BOTANICAL NAME***')
for tag in scientific_name: print(tag.string)
The contents attribute allows us to extract the entire content from the tags, that is the tag along with the data. The contents attribute returns a list, therefore we can access its elements using their index.
Example:
print(soup.plant.contents)
# Accessing content using index
print()
print(soup.plant.contents[1])
If you observe closely when we print the tags on the screen, they have a sort of messy appearance. While this may not have direct productivity issues, but a better and structured print style helps us to parse the document more effectively.
The following code shows how the output looks when we print the BeautifulSoup object normally:
We are now well versed with all the concepts required to extract data from a given XML document. It is now time to have a look at the final code where we shall be extracting the Name, Botanical Name, and Price of each plant in our example XML document (sample.xml).
Please follow the comments along with the code given below to have a understanding of the logic used in the solution.
from bs4 import BeautifulSoup # Open and read the XML file
file = open("sample.xml", "r")
contents = file.read() # Create the BeautifulSoup Object and use the parser
soup = BeautifulSoup(contents, 'lxml') # extract the contents of the common, botanical and price tags
plant_name = soup.find_all('common') # store the name of the plant
scientific_name = soup.find_all('botanical') # store the scientific name of the plant
price = soup.find_all('price') # store the price of the plant # Use a for loop along with the enumerate function that keeps count of each iteration
for n, title in enumerate(plant_name): print("Plant Name:", title.text) # print the name of the plant using text print("Botanical Name: ", scientific_name[ n].text) # use the counter to access each index of the list that stores the scientific name of the plant print("Price: ", price[n].text) # use the counter to access each index of the list that stores the price of the plant print()
XML documents are an important source of transporting data and hopefully after reading this article you are well equipped to extract the data you want from these documents. You might be tempted to have a look at this video series where you can learn how to scrape webpages.
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
This chapter draft is part of my upcoming book “From One to Zero” (NoStarch 2021). You’ll learn about the concept of premature optimization and why it hurts your programming productivity. Premature optimization is one of the main problems of poorly written code. But what is it anyway?
Definition Premature Optimization
Definition: Premature optimization is the act of spending valuable resources—such as time, effort, lines of code, or even simplicity—on unnecessary code optimizations.
There’s nothing wrong with optimized code.
The problem is that there’s no such thing as free lunch. If you think you optimize code snippets, what you’re really doing is to trade one variable (e.g., complexity) against another variable (e.g., performance).
Sometimes you can obtain clean code that is also more performant and easier to read—but you must spend time to get to this state! Other times, you prematurely spend more lines of code on a state-of-the-art algorithm to improve execution speed. For example, you may add 30% more lines of code to improve execution speed by 0.1%. These types of trade-offs will screw up your whole software development process when done repeatedly.
Donald Knuth Quote Premature Optimization
But don’t take my word for it. Here’s what one of the most famous computer scientists of all times, Donald Knuth, says about premature optimization:
“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97 % of the time: premature optimization is the root of all evil.” — Donald Knuth
Knuth argues that most of the time, you shouldn’t bother tweaking your code to obtain small efficiency gains. Let’s dive into five practical instances of premature optimization to see how it can get you.
Six Examples of Premature Optimization
There are many situations where premature optimization may occur. Watch out for those! Next, I’ll show you six instances—but I’m sure there are more.
Premature Optimization of Code Functions
First, you spend a lot of time optimizing a code function or code snippet that you just cannot stand leaving unoptimized. You argue that it’s a bad programming style to use the naïve method, and you should use more efficient data structures or algorithms to tackle the problem. So, you dive into learning mode, and you find better and better algorithms. Finally, you decide on one that’s considered best—but it takes you hours and hours to make them work. The optimization was premature because, as it turns out, your code snippet is executed only seldom, and it doesn’t result in meaningful performance improvements.
Premature Optimization of Software Product’s Features
Second, you add more features to your software product because you believe that users will need them. You optimize for expected but unproven user needs. Say you develop a smartphone app that translates text into morse code lights. Instead of developing the minimum viable product (MVP, see Chapter 3) that does just that, you add more and more features that you expect are necessary, such as a text to audio conversion and even a receiver that translates light signals to text. Later you find out that your users never use these features. Premature optimization has significantly slowed down your product development cycle and reduced your learning speed.
Premature Optimization of Planning Phase
Third, you prematurely optimize your planning phase, trying to find solutions to all kinds of problems that may occur. While it’s very costly to avoid planning, many people never stop planning, which can be just as costly! Only now the costs are opportunity costs of not taking action. Making a software product a reality requires you to ship something of value to the real world—even if this thing is not perfect, yet. You need user feedback and a reality check before even knowing which problems will hit you the hardest. Planning can help you avoid many pitfalls, but if you’re the type of person without a bias towards action, all your planning will turn into nothing of value.
Premature Optimization of Scalability
Fourth, you prematurely optimize the scalability of your application. Expecting millions of visitors, you design a distributed architecture that dynamically adds virtual machines to handle peak load if necessary. Distributed systems are complex and error-prone, and it takes you months to make your system work. Even worse, I’ve seen more cases where the distribution has reduced an application’s scalability due to an increased overhead for communication and data consistency. Scalable distributed systems always come at a price—are you sure you need to pay it? What’s the point of being able to scale to millions of users if you haven’t even served your first one?
Premature Optimization of Test Design
Fifth, you believe in test-driven development, and you insist on 100% test coverage. Some functions don’t lend themselves to unit tests because of their non-deterministic input (e.g., functions that process free text from users). Even though it has little value, you prematurely optimize for a perfect coverage of unit tests, and it slows down the software development cycle while introducing unnecessary complexity into the project.
Premature Optimization of Object-Orientated World Building
Sixth, you believe in object orientation and insist on modeling the world using a complex hierarchy of classes. For example, you write a small computer game about car racing. You create a class hierarchy where the Porsche class inherits from the Car class, which inherits from the Vehicle class. In many cases, these types of stacked inheritance structures add unnecessary complexity and could be avoided. You’ve prematurely optimized your code to model a world with more details than the application needs.
Code Example of Premature Optimization Gone Bad
Let’s consider a small Python application that should serve as an example for a case where premature optimization went bad. Say, three colleagues Alice, Bob, and Carl regularly play poker games in the evenings. They need to keep track during a game night who owes whom. As Alice is a passionate programmer, she decides to create a small application that tracks the balances of a number of players.
She comes up with the code that serves the purpose well.
Listing: Simple script to track transactions and balances.
The script has two global variables transactions and balances. The list transactions tracks the transactions as they occurred during a game night. Each transaction is a tuple of sender identifier, receiver identifier, and the amount to be transferred from the sender to the receiver. The dictionary balances tracks the mapping from user identifier to the number of credits based on the occurred transactions.
The function transfer(sender, receiver, amount) creates and stores a new transaction in the global list, creates new balances for users sender and receiver if they haven’t already been created, and updates the balances according to the transaction. The function get_balance(user) returns the balance of the user given as an argument. The function max_transaction() goes over all transactions and returns the one that has the maximum value in the third tuple element—the transaction amount.
The application works—it returns the following output:
But Alice isn’t happy with the application. She realizes that calling max_transaction() results in some inefficiencies due to redundant calculations—the script goes over the list transactions twice to find the transaction with the maximum amount. The second time, it could theoretically reuse the result of the first call and only look at the new transactions.
To make the code more efficient, she adds another global variable max_transaction that keeps track of the maximum transaction amount ever seen.
By adding more complexity to the code, it is now more performant—but at what costs? The added complexity results in no meaningful performance benefit for the small applications for which Alice is using the code. It makes it more complicated and reduces maintainability. Nobody will ever recognize the performance benefit in the evening gaming sessions. But Alice’s progress will slow down as she adds more and more global variables (e.g., tracking the minimal transaction amounts etc.). The optimization clearly was a premature optimization without need for the concrete application.
Do you want to develop the skills of a well-rounded Python professional—while getting paid in the process? Become a Python freelancer and order your book Leaving the Rat Race with Python on Amazon (Kindle/Print)!
Where to Go From Here?
Enough theory, let’s get some practice!
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
HTML (Hypertext Markup Language) consists of numerous tags and the data we need to extract lies inside those tags. Thus we need to find the right tags to extract what we need. Now, how do we find the right tags? We can do so with the help of BeautifulSoup's search methods.
Beautiful Soup has numerous methods for searching a parse tree. The two most popular and commonly methods are:
find()
find_all()
The other methods are quite similar in terms of their usage. Therefore, we will be focusing on the find() and find_all() methods in this article.
The following Examplewill be used throughout this document while demonstrating the concepts:
html_doc = """ <html><head><title>Searching Tree</title></head>
<body>
<h1>Searching Parse Tree In BeautifulSoup</h1></p> <p class="Main">Learning <a href="https://docs.python.org/3/" class="language" id="python">Python</a>,
<a href="https://docs.oracle.com/en/java/" class="language" id="java">Java</a> and
<a href="https://golang.org/doc/" class="language" id="golang">Golang</a>;
is fun!</p> <p class="Secondary"><b>Please subscribe!</b></p>
<p class="Secondary" id= "finxter"><b>copyright - FINXTER</b></p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")
Types Of Filters
There are different filters that can be passed into the find() and find_all() methods and it is crucial to have a clear understanding of these filters as they are used again and again, throughout the search mechanism. These filters can be used based on the tags:
name,
attributes,
on the text of a string,
or a mix of these.
❖ A String
When we pass a string to a search method then Beautiful Soup performs a match against that passed string. Let us have a look at an example and find the <h1> tags in the HTML document:
Passing a regular expression object allows Beautiful Soup to filter results according to that regular expression. In case you want to master the concepts of the regex module in Python, please refer to our tutorial here.
Note:
We need to import the re module to use a regular expression.
To get just the name of the tag instead of the entire content (tag+ content within the tag), use the .name attribute.
Example: The following code finds all instances of the tags starting with the letter “b”.
# finding regular expressions
for regular in soup.find_all(re.compile("^b")): print(regular.name)
We can define a function and pass an element as its argument. The function returns True in case of a match, otherwise it returns False.
Example: The following code defines a function which returns True for all classes that also have an id in the HTML document. We then pass this function to the find_all() method to get the desired output.
def func(tag): return tag.has_attr('class') and tag.has_attr('id') for tag in soup.find_all(func): print(tag)
➠ Now that we have gone through the different kind of filters that we use with the search methods, we are well equipped to dive deep into the find() and find_all() methods.
The find() Method
The find() method is used to search for the occurrence of the first instance of a tag with the needed name.
Syntax:
find(name, attrs, recursive, string, **kwargs)
➠ find() returns an object of type bs4.element.Tag.
Example:
print(soup.find('h1'), "\n")
print("RETURN TYPE OF find(): ",type(soup.find('h1')), "\n")
# note that only the first instance of the tag is returned
print(soup.find('a'))
Output:
<h1>Searching Parse Tree In BeautifulSoup</h1> RETURN TYPE OF find(): <class 'bs4.element.Tag'> <a class="language" href="https://docs.python.org/3/" id="python">Python</a>
➠ The above operation is the same as done by the soup.h1 or soup soup.a which also returns the first instance of the given tag. So what’s, the difference? The find() method helps us to find a particular instance of a given tag using key-value pairs as shown in the example below:
We saw that the find() method is used to search for the first tag. What if we want to find all instances of a tag or numerous instances of a given tag within the HTML document? The find_all() method, helps us to search for all tags with the given tag name and returns a list of type bs4.element.ResultSet. Since the items are returned in a list, they can be accessed with help of their index.
Now there are numerous other argument apart from the filters that we already discussed earlier. Let us have a look at them one by one.
❖ The name Argument
As stated earlier the name argument can be a string, a regular expression, a list, a function, or the value True.
Example:
for tag in soup.find_all('p'): print(tag)
Output:
<p class="Main">Learning <a class="language" href="https://docs.python.org/3/" id="python">Python</a>,
<a class="language" href="https://docs.oracle.com/en/java/" id="java">Java</a> and
<a class="language" href="https://golang.org/doc/" id="golang">Golang</a>;
is fun!</p>
<p class="Secondary"><b>Please subscribe!</b></p>
❖ The keyword Arguments
Just like the find() method, find_all() also allows us to find particular instances of a tag. For example, if the id argument is passed, Beautiful Soup filters against each tag’s ‘id’ attribute and returns the result accordingly.
Often we need to find a tag that has a certain CSS class, but the attribute, class, is a reserved keyword in Python. Thus, using class as a keyword argument will give a syntax error. Beautiful Soup 4.1.2 allows us to search a CSS class using the keyword class_
❖ Note: The above search will allow you to search all instances of the p tag with the class “Secondary” . But you can also filter searches based on multiple attributes, using a dictionary.
The find_all() method scans through the entire HTML document and returns all the matching tags and strings. This can be extremely tedious and take a lot of time if the document is large. So, you can limit the number of results by passing in the limit argument.
Example: There are three links in the example HTML document, but this code only finds the first two:
We have successfully explored the most commonly used search methods, i.e., find and find_all(). Beautiful Soup also has other methods for searching the parse tree, but they are quite similar to what we already discussed above. The only differences are where they are used. Let us have a quick look at these methods.
find_parents() and find_parent(): these methods are used to traverse the parse tree upwards and look for a tag’s/string’s parent(s).
find_next_siblings() and find_next_sibling(): these methods are used to find the next sibling(s) of an element in the HTML document.
find_previous_siblings() and find_previous_sibling(): these methods are used to find and iterate over the sibling(s) that appear before the current element.
find_all_next() and find_next(): these methods are used to find and iterate over the sibling(s) that appear after the current element.
find_all_previous and find_previous(): these methods are used to find and iterate over the tags and strings that appear before the current element in the HTML document.
With that we come to the end of this article; I hope that after reading this article you can search elements within a parse tree with ease! Please subscribe and stay tuned for more interesting articles.
Where to Go From Here?
Enough theory, let’s get some practice!
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
Problem: You’ve just learned about the list.clear() method in Python. You wonder, what’s its purpose? Why not creating a new list and overwriting the variable instead of clearing an existing list?
Example: Say, you have the following list.
lst = ['Alice', 'Bob', 'Carl']
If you clear the list, it becomes empty:
lst.clear()
print(lst)
# []
However, you could have accomplished the same thing by just assigning a new empty list to the variable lst:
The output is the same. Why does the list.clear() method exist in the first place?
If you go through the following interactive memory visualizer, you’ll see that both variants lead to different results if you have multiple variables pointing to the list object:
In the second example, the variable lst_2 still points to a non-empty list object!
So, there are at least two reasons why the list.clear() method can be superior to creating a new list:
Release Memory: If you have a large list that fills your memory—such as a huge data set or a large file read via readlines()—and you don’t need it anymore, you can immediately release the memory with list.clear(). Especially in interactive mode, Python doesn’t know which variable you still need – so it must keep all variables till session end. But if you call list.clear(), it can release the memory for other processing tasks.
Clear Multiple List Variables: Multiple variables may refer to the same list object. If you want to reflect that the list is now empty, you can either call list.clear() on one variable and all other variables will see it, or you must call var1 = [], var2 = [], ..., varn = [] for all variables. This can be a pain if you have many variables.
Do you want to develop the skills of a well-rounded Python professional—while getting paid in the process? Become a Python freelancer and order your book Leaving the Rat Race with Python on Amazon (Kindle/Print)!
Python’s built-inabs(x) function returns the absolute value of the argument x that can be an integer, float, or object implementing the __abs__() function. For a complex number, the function returns its magnitude. The absolute value of any numerical input argument -x or +x is the corresponding positive value +x.
Argument
x
int, float, complex, object with __abs__() implementation
Return Value
|x|
Returns the absolute value of the input argument. Integer input –> Integer output Float input –> Float output Complex input –> Complex output
Interactive Code Shell
Example Integer abs()
The following code snippet shows you how to use the absolute value 42 of a positive integer value 42.
# POSITIVE INTEGER
x = 42
abs_x = abs(x) print(f"Absolute value of {x} is {abs_x}")
# Absolute value of 42 is 42
The following code snippet shows you how to use the absolute value 42 of a negative integer value -42.
# NEGATIVE INTEGER
x = -42
abs_x = abs(x) print(f"Absolute value of {x} is {abs_x}")
# Absolute value of -42 is 42
Example Float abs()
The following code snippet shows you how to use the absolute value 42.42 of a positive integer value 42.42.
# POSITIVE FLOAT
x = 42.42
abs_x = abs(x) print(f"Absolute value of {x} is {abs_x}")
# Absolute value of 42.42 is 42.42
The following code snippet shows you how to use the absolute value 42.42 of a negative integer value -42.42.
# NEGATIVE FLOAT
x = -42.42
abs_x = abs(x) print(f"Absolute value of {x} is {abs_x}")
# Absolute value of -42.42 is 42.42
Example Complex abs()
The following code snippet shows you how to use the absolute value of a complex number (3+10j).
# COMPLEX NUMBER
complex_number = (3+10j)
abs_complex_number = abs(complex_number) print(f"Absolute value of {complex_number} is {abs_complex_number}")
# Absolute value of (3+10j) is 10.44030650891055
Python abs() vs fabs()
Python’s built-in function abs(x) calculates the absolute number of the argument x. Similarly, the fabs(x) function of the math module calculates the same absolute value. The difference is that math.fabs(x) always returns a float number while Python’s built-in abs(x) returns an integer if the argument x is an integer as well. The name “fabs” is shorthand for “float absolute value”.
Here’s a minimal example:
x = 42 # abs()
print(abs(x))
# 42 # math.fabs()
import math
print(math.fabs(x))
# 42.0
Python abs() vs np.abs()
Python’s built-in function abs(x) calculates the absolute number of the argument x. Similarly, NumPy’s np.abs(x) function calculates the same absolute value. There are two differences: (1) np.abs(x) always returns a float number while Python’s built-in abs(x) returns an integer if the argument x is an integer, and (2) np.abs(arr) can be also applied to a NumPy array arr that calculates the absolute values element-wise.
Here’s a minimal example:
x = 42 # abs()
print(abs(x))
# 42 # numpy.abs()
import numpy as np
print(np.fabs(x))
# 42.0 # numpy.abs() array
a = np.array([-1, 2, -4])
print(np.abs(a))
# [1 2 4]
abs and np. absolute are completely identical. It doesn’t matter which one you use. There are several advantages to the short names: They are shorter and they are known to Python programmers because the names are identical to the built-in Python functions.
Summary
The abs() function is a built-in function that returns the absolute value of a number. The function accepts integers, floats, and complex numbers as input.
If you pass abs() an integer or float, n, it returns the non-negative value of n and preserves its type. In other words, if you pass an integer, abs() returns an integer, and if you pass a float, it returns a float.
# Int returns int
>>> abs(20)
20
# Float returns float
>>> abs(20.0)
20.0
>>> abs(-20.0)
20.0
The first example returns an int, the second returns a float, and the final example returns a float and demonstrates that abs() always returns a positive number.
Complex numbers are made up of two parts and can be written as a + bj where a and b are either ints or floats. The absolute value of a + bj is defined mathematically as math.sqrt(a**2 + b**2). Thus, the result is always positive and always a float (since taking the square root always returns a float).
Here you can see that abs() always returns a float and that the result of abs(a + bj) is the same as math.sqrt(a**2 + b**2).
Where to Go From Here?
Enough theory, let’s get some practice!
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
In this article, you’ll explore how to generate exponential fits by exploiting the curve_fit() function from the Scipy library. SciPy’s curve_fit() allows building custom fit functions with which we can describe data points that follow an exponential trend.
In the first part of the article, the curve_fit() function is used to fit the exponential trend of the number of COVID-19 cases registered in California (CA).
The second part of the article deals with fitting histograms, characterized, also in this case, by an exponential trend.
Disclaimer: I’m not a virologist, I suppose that the fitting of a viral infection is defined by more complicated and accurate models; however, the only aim of this article is to show how to apply an exponential fit to model (to a certain degree of approximation) the increase in the total infection cases from the COVID-19.
Exponential fit of COVID-19 total cases in California
Data related to the COVID-19 pandemic have been obtained from the official website of the “Centers for Disease Control and Prevention” (https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36) and downloaded as a .csv file. The first thing to do is to import the data into a Pandas dataframe. To do this, the Pandas functions pandas.read_csv() and pandas.Dataframe() were employed. The created dataframe is made up of 15 columns, among which we can find the submission_date, the state, the total cases, the confirmed cases and other related observables. To gain an insight into the order in which these categories are displayed, we print the header of the dataframe; as can be noticed, the total cases are listed under the voice “tot_cases”.
Since in this article we are only interested in the data related to the California, we create a sub-dataframe that contains only the information related to the California state. To do that, we exploit the potential of Pandas in indexing subsections of a dataframe. This dataframe will be called df_CA (from California) and contains all the elements of the main dataframe for which the column “state” is equal to “CA”. After this step, we can build two arrays, one (called tot_cases) that contains the total cases (the name of the respective header column is “tot_cases”) and one that contains the number of days passed by the first recording (called days). Since the data were recorded daily, in order to build the “days” array, we simply build an array of equally spaced integer number from 0 to the length of the “tot_cases” array, in this way, each number refers to the n° of days passed from the first recording (day 0).
At this point, we can define the function that will be used by curve_fit()to fit the created dataset. An exponential function is defined by the equation:
y = a*exp(b*x) +c
where a, b and c are the fitting parameters. We will hence define the function exp_fit() which return the exponential function, y, previously defined. The curve_fit() function takes as necessary input the fitting function that we want to fit the data with, the x and y arrays in which are stored the values of the datapoints. It is also possible to provide initial guesses for each of the fitting parameters by inserting them in a list called p0 = […] and upper and lower boundaries for these parameters (for a comprehensive description of the curve_fit() function, please refer to https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html ). In this example, we will only provide initial guesses for our fitting parameters. Moreover, we will only fit the total cases of the first 200 days; this is because for the successive days, the number of cases didn’t follow an exponential trend anymore (possibly due to a decrease in the number of new cases). To refer only to the first 200 values of the arrays “days” and “tot_cases”, we exploit array slicing (e.g. days[:200]).
The output of curve_fit() are the fitting parameters, presented in the same order that was used during their definition, within the fitting function. Keeping this in mind, we can build the array that contains the fitted results, calling it “fit_eq”.
Now that we built the fitting array, we can plot both the original data points and their exponential fit.
The final result will be a plot like the one in Figure 1:
Figure 1
Application of an exponential fit to histograms
Now that we know how to define and use an exponential fit, we will see how to apply it to the data displayed on a histogram. Histograms are frequently used to display the distributions of specific quantities like prices, heights etc…The most common type of distribution is the Gaussian distribution; however, some types of observables can be defined by a decaying exponential distribution. In a decaying exponential distribution, the frequency of the observables decreases following an exponential[A1] trend; a possible example is the amount of time that the battery of your car will last (i.e. the probability of having a battery lasting for long periods decreases exponentially). The exponentially decaying array will be defined by exploiting the Numpy function random.exponential(). According to the Numpy documentation, the random.exponential() function draws samples from an exponential distribution; it takes two inputs, the “scale” which is a parameter defining the exponential decay and the “size” which is the length of the array that will be generated. Once obtained random values from an exponential distribution, we have to generate the histogram; to do this, we employ another Numpy function, called histogram(), which generates an histogram taking as input the distribution of the data (we set the binning to “auto”, in this way the width of the bins is automatically computed). The output of histogram() is a 2D array; the first array contains the frequencies of the distribution while the second one contains the edges of the bins. Since we are only interested in the frequencies, we assign the first output to the variable “hist”. For this example, we will generate the array containing the bin position by using the Numpy arange() function; the bins will have a width of 1 and their number will be equal to the number of elements contained in the “hist” array.
At this point, we have to define the fitting function and to call curve_fit() for the values of the just created histogram. The equation describing an exponential decay is similar to the one defined in the first part; the only difference is that the exponent has a negative sign, this allows the values to decrease according to an exponential fashion. Since the elements in the “x” array, defined for the bin position, are the coordinates of the left edge of each bin, we define another x array that stores the position of the center of each bin (called “x_fit”); this allows the fitting curve to pass through the center of each bin, leading to a better visual impression. This array will be defined by taking the values of the left side of the bins (“x” array elements) and adding half the bin size; which corresponds to half the value of the second bin position (element of index 1). Similar to the previous part, we now call curve_fit(), generate the fitting array and assign it to the varaible “fit_eq”.
Once the distribution has been fitted, the last thing to do is to check the result by plotting both the histogram and the fitting function. In order to plot the histogram, we will use the matplotlib function bar(), while the fitting function will be plotted using the classical plot() function.
The final result is displayed in Figure 2:
Figure 2
Summary
In these two examples, the curve_fit()function was used to apply to different exponential fits to specific data points. However, the power of the curve_fit()function, is that it allows you defining your own custom fit functions, being them linear, polynomial or logarithmic functions. The procedure is identical to the one shown in this article, the only difference is in the shape of the function that you have to define before calling curve_fit().
Full Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit url = "United_States_COVID-19_Cases_and_Deaths_by_State_over_Time" #url of the .csv file
file = pd.read_csv(url, sep = ';', thousands = ',') # import the .csv file
df = pd.DataFrame(file) # build up the pandas dataframe
print(df.columns) #visualize the header
df_CA = df[df['state'] == 'CA'] #initialize a sub-dataframe for storing only the values for the California
tot_cases = np.array((df_CA['tot_cases'])) #create an array with the total n° of cases
days = np.linspace(0, len(tot_cases), len(tot_cases)) # array containing the n° of days from the first recording #DEFINITION OF THE FITTING FUNCTION
def exp_fit(x, a, b, c): y = a*np.exp(b*x) + c return y #----CALL THE FITTING FUNCTION----
fit = curve_fit(exp_fit,days[:200],tot_cases[:200], p0 = [0.005, 0.03, 5])
fit_eq = fit[0][0]*np.exp(fit[0][1]*days[:200])+fit[0][2] # #----PLOTTING-------
fig = plt.figure()
ax = fig.subplots()
ax.scatter(days[:200], tot_cases[:200], color = 'b', s = 5)
ax.plot(days[:200], fit_eq, color = 'r', alpha = 0.7)
ax.set_ylabel('Total cases')
ax.set_xlabel('N° of days')
plt.show() #-----APPLY AN EXPONENTIAL FIT TO A HISTOGRAM--------
data = np.random.exponential(5, size=10000) #generating a random exponential distribution
hist = np.histogram(data, bins="auto")[0] #generating a histogram from the exponential distribution
x = np.arange(0, len(hist), 1) # generating an array that contains the coordinated of the left edge of each bar #---DECAYING FIT OF THE DISTRIBUTION----
def exp_fit(x,a,b): #defining a decaying exponential function y = a*np.exp(-b*x) return y x_fit = x + x[1]/2 # the point of the fit will be positioned at the center of the bins
fit_ = curve_fit(exp_fit,x_fit,hist) # calling the fit function
fit_eq = fit_[0][0]*np.exp(-fit_[0][1]*x_fit) # building the y-array of the fit
#Plotting
plt.bar(x,hist, alpha = 0.5, align = 'edge', width = 1)
plt.plot(x_fit,fit_eq, color = 'red')
plt.show()
This tutorial taken from my upcoming programming book “From One to Zero” (NoStarch, 2021) will show you how to write great comments. While most online tutorials focus on a bullet list of commenting tips, we dive deeper into the meat exploring the underlying reasons for the commonly recommended commenting principles. So, let’s get started!
Code For Humans Not Machines
“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” — Martin Fowler
The main purpose of source code is to define what machines should do and how to do it.
Yet, if this was the only criteria, you’d use a low-level machine language such as assembler to accomplish this goal because it’s the most expressive and most powerful language.
The purpose of high-level programming languages such as Python is to help people write better code and do it more quickly. Our next principle for clean code is to constantly remind yourself that you’re writing code for other people and not for machines.
If your code will have any impact in the real world, it’ll be read multiple times by you or a programmer that takes your place if you stop working on the code base. Always assume that your source code will be read by other people. What can you do to make their job easier? Or, to put it more plainly: what can you do to mitigate the negative emotions they’ll experience against the original programmer of the code base their working on?Code for people not machines!
Reduce Time to Understanding
If you write code for humans not machines, you’ll need to use comments to help readers of your code understand it better and quicker. A short comment can greatly reduce the time to cognitively grasp the meaning of the code base. Consider the following code example:
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' f_words = re.findall('\\bf\w+\\b', text)
print(f_words) l_words = re.findall('\\bl\w+\\b', text)
print(l_words) '''
OUTPUT:
['frost', 'flower', 'field']
['let', 'lips', 'long', 'lies', 'like'] '''
Bad code example without comments.
The previous code snippet analyzes a short text snippet from Shakespeare’s Romeo and Juliet using regular expressions. If you’re not very familiar with regular expressions, you probably struggle understanding what the code does. Even the meaningful variable names don’t help much. Let’s see if a few comments can resolve your confusion!
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' # Find all words starting with character 'f'
f_words = re.findall('\\bf\w+\\b', text)
print(f_words) # Find all words starting with character 'l'
l_words = re.findall('\\bl\w+\\b', text)
print(l_words) '''
OUTPUT:
['frost', 'flower', 'field']
['let', 'lips', 'long', 'lies', 'like'] '''
Good code example with comments.
The two short comments greatly help understanding the regular expression patterns'\\bf\w+\\b' and '\\bl\w+\\b'. While I won’t dive deeply into regular expressions here, the example shows how comments can help you getting a rough understanding of other people’s code without understanding each and every syntactic sugar. For introductory tutorials into the powerful technology regular expressions, check out our two technical books Python One-Liners and The Smartest Way to Learn Python Regular Expressions.
You’re the Expert—Share Your Wisdom!
Helpful comments give a glimpse into your thinking—as you’ve written the code, you possess valuable insight into it only matched by very few persons. Don’t miss out on sharing your insights with other people! Comments can be very useful to “abstract” over blocks of code. For example, if you have five lines of code dealing with updating customer information in a database, add a short comment before the block to explain this. This will help the reader get a quick overview of your code and accelerates their and your “code loading time”. You can find an example of such an instance next:
# Process next order
order = get_next_order()
user = order.get_user()
database.update_user(user)
database.update_product(order.get_order()) # Ship order & confirm customer
logistics.ship(order, user.get_address())
user.send_confirmation()
Commented blocks help get an overview of the code.
The code exemplifies how an online shop completes a customer order in two high-level steps: (1) Processing the next order, and (2) Shipping the order. The comments help you understand the purpose of the code in a few seconds without the need to look at each individual method call.
Comments as WARNINGS!
You can also use comments as a warning of potentially undesired consequences. This increases the level of attention of the programmer working with your code. The following code shows such an example where programmers are warned before calling a function ship_yacht() that will actua ship an expensive yacht to a customer.
##########################################################
# WARNING #
# EXECUTING THIS FUNCTION WILL SHIP A $1,569,420 YACHT!! #
##########################################################
def ship_yacht(customer): database.update(customer.get_address()) logistics.ship_yacht(customer.get_address()) logistics.send_confirmation(customer)
Comments as warnings.
There are many more ways to use comments in a useful way. Comments are always there for the human reader, so always keep in mind that you’re writing code for humans not machines!
Avoid Unnecessary Comments
Not all comments help readers understand code better. In fact, there are many cases where comments reduce the clarity and confuse the readers of a given code base. If your goal is to write clean code, you must not only use valuable comments but also avoid unnecessary comments. But what are unnecessary comments? Let’s dive into those next.
During my time as a computer science researcher, many of my senior-level students described me in great detail how their job interviews at various companies went. A very skilled student had successfully applied for a job at Google. He told me that the Google headhunters—they’re usually Google engineers—criticized his code style because he added too many unnecessary comments. These types of comments are so called “code smells”—expert coders can figure out very quickly whether you’re a beginner, intermediate, or expert coder yourself. Unnecessary comments make this very obvious. But what are unnecessary comments? In most cases, they add a level of redundancy to the code. A great coder will use meaningful variable names (Principle: Use the Right Names), so the code often becomes self-explanatory—at least in comparison to code that doesn’t use the right names. Let’s revisit the code snippet with meaningful variable names.
investments = 10000
yearly_return = 0.1
years = 10 for year in range(years): print(investments * (1 + yearly_return)**year)
No comments needed.
The code calculates your cumulative investment return for ten years assuming a 10% yield. Now, let’s add some unnecessary comments!
investments = 10000 # your investments, change if needed
yearly_return = 0.1 # annual return (e.g., 0.1 --> 10%)
years = 10 # number of years to compound # Go over each year
for year in range(years): # Print value of your investment in current year print(investments * (1 + yearly_return)**year)
Unnecessary comments.
All comments in the previous code snippet are redundant. Some of them would’ve been useful if you’d chosen less meaningful variable names such as x, y, or z. But explaining a variable named yearly_return by means of a comment doesn’t provide any relative value. Quite contrarily, it reduces the value because it adds unnecessary clutter to the code. The additional clutter makes your code less readable and less concise. There are a few rules that may help you avoid unnecessary comments—although the best rule is to use your common sense to identify whether a comment really improves the readability of your code.
Code Smells — Negative Commenting Principles
Don’t use inline comments. They have little value and can be completely avoided by choosing meaningful variable names.
Don’t be redundant. Redundancy is the enemy of clarity—this also holds for comments!
Don’t add obvious comments. You can see an obvious comment in the previous code snippet just before the for loop statement. Any coder knows the for loop, so what additional value do you provide with the comment # Go over each year when the for loop already states for year in range(years)?
Don’t comment out code. If you’re a programmer, it’s very likely that you’ve been guilty of this. We programmers often hang on to our beloved code snippets even if we already (grudgingly) decided to remove them. The shy approach to removing unnecessary code is to comment it out. However, commented code is a readability killer and you should avoid it at all costs if you want to write clean code. Instead of commenting out the unnecessary code, boldly remove it. For your piece of mind, you should use a version history tool such as Git that allows you to get any old code snippet if you need it.
Many programming languages such as Python come with documentation functionality that allows you to describe the purpose of each function, method, and class in your code. If you’ve carefully chosen the abstraction level of each function (Single-Responsibility Principle), it’s often enough to use the build in documentation functionality instead of comments to describe what your code does. This largely removes the need for additional comments in your code.
Do you want to develop the skills of a well-rounded Python professional—while getting paid in the process? Become a Python freelancer and order your book Leaving the Rat Race with Python on Amazon (Kindle/Print)!
Where to Go From Here?
Enough theory, let’s get some practice!
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
This chapter draft from my upcoming book From One to Zero to appear in 2021 with NoStarch will teach you why and how to write clean and simple code. To stay tuned about the book release, sign up for the Finxter email coding acadamy (it’s free)!
Write Clean & Simple Code
Story: I learned to focus on writing clean code the hard way. One of my research projects during my time as a doctoral researcher in distributed systems was to code a distributed graph processing system from scratch. The system allowed you to run graph algorithms such as computing the shortest path on a large map in a distributed environment to speed up computation among multiple machines. If you’ve ever written a distributed application where two processes that reside on different computers interact with each other via messages, you know that the complexity can quickly become overwhelming. My code had thousands of lines of code and bugs were popping up frequently. I didn’t make any progress for weeks at a time—it was very frustrating. In theory, the concepts I developed sounded great and convincing. But practice got me! Finally, after a month or so working full-time on the code base without seeing any encouraging progress, I decided to radically simplify the code base. I started to use libraries instead of coding functions myself. I removed large code blocks of premature optimizations (see later). I removed code blocks that I had commented out for a possible later use. I refactored variable and function names. I structured the code in logical units and classes. And, after a week or so, not only was my code more readable and understandable by other researchers, it was also more efficient and less buggy. I managed to make progress again and my frustration quickly morphed into enthusiasm—clean code had rescued my research project!
Complexity: In the previous chapters, you’ve learned how harmful complexity is for any code project in the real world. Complexity kills your productivity, motivation, and time. Because most of us haven’t learned to speak in source code from an early age, it can quickly overwhelm our cognitive abilities. The more code you have, the more overwhelming it becomes. But even short code snippets and algorithms can be complicated. The following one-liner code snippet from our book Python One-Liners is a great example of a piece of source code that is short and concise, but still complex!
# Quicksort algorithm to sort a list of integers
unsorted = [33, 2, 3, 45, 6, 54, 33] q = lambda l: q([x for x in l[1:] if x <= l[0]]) + [l[0]] + q([x for x in l if x > l[0]]) if l else [] print(q(unsorted))
# [2, 3, 6, 33, 33, 45, 54]
Complexity comes from many directions when working with source code. It slows down our understanding of the code. And it increases the number of bugs in our code. Both slow understanding and more bugs increase the project costs and the number of people hours required to finish it. Robert C. Martin, author of the book Clean Code, argues that the more difficult it is to read and understand code, the higher the costs to write code as well:
“Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. …[Therefore,] making it easy to read makes it easier to write.” — Robert C. Martin
This relationship is visualized in Figure 5-1. The x axis corresponds to the number of lines written in a given code project. The y axis corresponds to the time to write one additional line of code. In general, the more code you’ve already written in one project, the more time it takes to write an additional line of code. Why is that? Say, you’ve written n lines of code and you add the n+1st line of code. Adding this line may have an effect on potentially all previously written lines. It may have a small performance penalty which impacts the overall project. It may use a variable that is defined at another place. It may introduce a bug (with probability c) and to find that bug, you must search the whole project (so, your expected costs per line of code is c * T(n) for a steadily increasing function T with increasing input n). It may force you to write additional lines of code to ensure backward compatibility. There are many more reasons but you get the point: the additional complexity causes to slow down your progress the more code you’ve written.
Figure 5-1: Clean code improves scalability and maintainability of your code base.
But Figure 5-1 also shows the difference between writing dirty versus clean code. If writing dirty code wouldn’t result in any benefit, nobody would do it! There’s a very real benefit of writing dirty code: it’s less time consuming in the short-term and for small code projects. If you cram all the functionality in a 100-line code script, you don’t need to invest a lot of time thinking and structuring your project. But as you add more and more code, the monolithic code file grows from 100 to 1000 lines and at a certain point, it’ll be much less efficient compared to a more thoughtful approach where you structure the code logically in different modules, classes, or files. As a rule of thumb: try to always write thoughtful and clean code—because the additional costs for thinking, refactoring, and restructuring will pay back many times over for any non-trivial project. Besides—writing clean code is just the right thing to do. The philosophy of carefully crafting your programming art will carry you further in life.
You don’t always know the second-order consequences of your code. Think of the spacecraft on a mission towards Venus in 1962 where a tiny bug—an omission of a hyphen in the source code—caused NASA engineers to issue a self-destruct command which resulted in a loss of the rocket worth more than $18 million at the time.
To mitigate all of those problems, there’s a simple solution: write simpler code. Simple code is less error-prone, less crowded, easier to grasp, and easier to maintain. It is more fun to read and write. In many cases, it’s more efficient and takes less space. It also facilitates scaling your project because people won’t be scared off by the complexity of the project. If new coders peek in your code project to see whether they want to contribute, they better believe that they can understand it. With simple code, everything in your project will get simpler. You’ll make faster progress, get more support, spend less time debugging, be more motivated, and have more fun in the process.
So, let’s learn how to write clean and simple code, shall we?
Clean code is elegant and pleasing to read. It is focused in the sense that each function, class, module focuses on one idea. A function transfer_funds(A,B) in your banking application does just that—transferring funds from account A to account B. It doesn’t check the credit of the sender A —for this, there’s another function check_credit(A). Simple but easy to understand and focused. How do you get simple and clean code? By spending time and effort to edit and revise the code. This is called refactoring and it must be a scheduled and crucial element of your software development process.
Let’s dive into some principles to write clean code. Revisit them from time to time—they’ll become meaningful sooner or later if you’re involved in some real-world projects.
Principles to Write Clean Code
Next, you’ll going to learn a number of principles that’ll help you write cleaner code.
Principle 1: You Ain’t Going to Need It
The principle suggests that you should never implement code if you only expect that you’re going to need its provided functionality someday in the future—because you ain’t gonna need it! Instead, write code only if you’re 100% sure that you need it. Code for today’s needs and not tomorrow’s.
It helps to think from first principles: The simplest and cleanest code is the empty file. It doesn’t have any bug and it’s easy to understand. Now, go from there—what do you need to add to that? In Chapter 4, you’ve learned about the minimum viable product. If you minimize the number of features you pursue, you’ll harvest cleaner and simpler code than you could ever attain through refactoring methods or all other principles combined. As you know by now, leaving out features is not only useful if they’re unnecessary. Leaving them out even makes sense if they provide relatively little value compared to other features you could implement instead. Opportunity costs are seldomly measured but most often they are very significant. Only because a feature provides some benefits doesn’t justify its implementation. You have to really need the feature before you even consider implementing it. Reap the low-hanging fruits first before you reach higher!
Principle 2: The Principle of Least Surprise
This principle is one of the golden rules of effective application and user experience design. If you open the Google search engine, the cursor will be already focused in the search input field so that you can start typing your search keyword right away without needing to click into the input field. Not surprising at all—but a great example of the principle of least surprise. Clean code also leverages this design principle. Say, you write a currency converter that converts the user’s input from USD to RMB. You store the user input in a variable. Which variable name is better suited, user_input or var_x? The principle of least surprise answers this question for you!
Principle 3: Don’t Repeat Yourself
Don’t Repeat Yourself (DRY) is a widely recognized principle that implies that if you write code that partially repeats itself—or that’s even copy&pasted from your own code—is a sign of bad coding style. A negative example is the following Python code that prints the same string five times to the shell:
The code repeats itself so the principle suggests that there will be a better way of writing it. And there is!
for i in range(5): print('hello world')
The code is much shorter but semantically equivalent. There’s no redundancy in the code.
The principle also shows you when to create a function and when it isn’t required to do so. Say, you need to convert miles into kilometers in multiple instances in your code (see Listing 5-1).
miles = 100
kilometers = miles * 1.60934 # ... # BAD EXAMPLE
distance = 20 * 1.60934 # ... print(kilometers)
print(distance) '''
OUTPUT:
160.934
32.1868 '''
Listing 5-1: Convert miles to kilometers twice.
The principle Don’t Repeat Yourself suggests that it would be better to write a function miles_to_km(miles) once—rather than performing the same conversion explicitly in the code multiple times (see Listing 5-2).
Listing 5-2: Using a function to convert miles to kilometers.
This way, the code is easier to maintain, you can easily increase the precision of the conversion afterwards without searching the code for all instances where you used the imprecise conversion methodology. Also, it’s easier to understand for human readers of your code. There’s no doubt about the purpose of the function miles_to_km(20) while you may have to think harder about the purpose of the computation 20 * 1.60934.
The principle Don’t Repeat Yourself is often abbreviated as DRY and violations of it as WET: We Enjoy Typing, Write Everything Twice, and Waste Everyone’s Time.
Principle 4: Code For People Not Machines
The main purpose of source code is to define what machines should do and how to do it. Yet, if this was the only criteria, you’d use a low-level machine language such as assembler to accomplish this goal because it’s the most expressive and most powerful language. The purpose of high-level programming languages such as Python is to help people write better code and do it more quickly. Our next principle for clean code is to constantly remind yourself that you’re writing code for other people and not for machines. If your code will have any impact in the real world, it’ll be read multiple times by you or a programmer that takes your place if you stop working on the code base. Always assume that your source code will be read by other people. What can you do to make their job easier? Or, to put it more plainly: what can you do to mitigate the negative emotions they’ll experience against the original programmer of the code base their working on? Code for people not machines!
What does this mean in practice? There are many implications. First of all, use meaningful variable names. Listing 5-3 shows a negative example without meaningful variable names.
# BAD
xxx = 10000
yyy = 0.1
zzz = 10 for iii in range(zzz): print(xxx * (1 + yyy)**iii)
Listing 5-3: Example of writing code for machines.
Take a guess: what does the code compute?
Let’s have a look at the semantically equivalent code in Listing 5-4 that uses meaningful variable names.
# GOOD
investments = 10000
yearly_return = 0.1
years = 10 for year in range(years): print(investments * (1 + yearly_return)**year)
Listing 5-4: Using a function to convert miles to kilometers.
The variable names indicate that you calculate the value of an initial investment of 1000 compounded over 10 years assuming an annual return of 10%.
The principle to write code has many more applications. It also applies to indentations, whitespaces, comments, and line lengths. Clean code radically optimizes for human readability. As Martin Fowler, international expert on software engineering and author of the popular book Refactoring, argues:
“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”
Principle 5: Stand on the Shoulders of Giants
There’s no value in reinventing the wheel. Programming is a decade-old industry and the best coders in the world have given us a great legacy: a collective database of millions of fine-tuned and well-tested algorithms and code functions. Accessing the collective wisdom of millions of programmers is as simple as using a one-liner import statement. You’d be crazy not to use this superpower in your own projects. Besides being easy to use, using library code is likely to improve the efficiency of your code because functions that have been used by thousands of coders tend to be much more optimized than your own code functions. Furthermore, library calls are easier to understand and take less space in your code project. For example, if you’d need a clustering algorithm to visualize clusters of customers, you can either implement it yourself or stand on the shoulders of giants and import a clustering algorithm from an external library and pass your data into it. The latter is far more time efficient—you’ll take much less time to implement the same functionality with fewer bugs, less space, and more performant code. Libraries are one of the top reasons why master coders can be 10,000 times more productive than average coders.
Here’s the two-liner that imports the KMeans module from the scikit-learn Python library rather than reinventing the wheel:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
If you’d want to implement the KMeans algorithm, it’ll take you a few hours and 50 lines of code—and it’ll clutter your code base so that all future code will become harder to implement.
Principle 6: Use the Right Names
Your decisions on how to name your functions, function arguments, objects, methods, and variables uncovers whether you’re a beginner, intermediate, or expert coder. How? In any programming language, there are many naming conventions that are used by all experienced coders. If you violate them, it immediately tells the reader of your code base that you’ve not had a lot of experience with practical code projects. The more such “tells” exist in your code, the less serious will a reader of your code take it.
There are a lot of explicit and implicit rules governing the correct naming of your code elements. These rules may even differ from programming language to programming language. For example, you’ll use camelCaseNaming for variables in the Java programming language while you’ll use underscore_naming in Python. If you start using camel case in Python, everyone will immediately see that you’re a Python beginner. While you may not like this, it’s not really a big problem to be perceived as a beginner—everyone has been one at one point in time. Far worse is that other coders will be negatively surprised when reading their code. Instead of thinking about what the code does, they start thinking about how your code is written. You know the principle of least surprise—there’s no value in surprising other coders by choosing unconventional variable names.
So, let’s dive into a list of naming rule of thumbs you can consider when writing source code. This will speed up your ability to learn how to write clean code names. However, the best way to learn is to study the code of people who are better than you. Read a lot of programming tutorials, join the StackOverview community, and check out the Github code of open-source projects.
Choose descriptive names. Say you create a function to convert currencies from USD to EUR in Python. Call it usd_to_eur(amount) rather than f(x).
Choose unambiguous names. You may think that dollar_to_euro(amount) would be good name as well for the previously discussed function. While it is better than f(x), it’s worse than usd_to_eur(amount) because it introduces an unnecessary degree of ambiguity. Do you mean US, Canadian, or Australian Dollar? If you’re in the US, the answer may be obvious to you. But an Australian coder may not know that the code is written in the US and may assume a different output. Minimize these confusions!
Use Pronounceable Names. Most coders subconsciously read code by pronouncing it in their mind. If they cannot do this subconsciously because a variable name is unpronounceable, the problem of deciphering the variable name takes their precious attention. They have to actively think about possible ways to resolve the unexpected naming. For example, the variable name cstmr_lst may be descriptive and unambiguous, but it’s not pronounceable. Choosing the variable name customer_list is well worth the additional space in your code!
Use Named Constants, Not Magic Numbers. In your code, you may use the magic number 0.9 multiple times as a factor to convert a sum in USD to a sum in EUR. However, the reader of your code—including your future self that rereads your own code—has to think about the purpose of this number. It’s not self-explanatory. A far better way of handling this “magic number” 0.9 is to store it in a variable CONVERSION_RATE = 0.9 and use it as a factor in your conversion computations. For example, you may then calculate your income in EUR as income_euro = CONVERSION_RATE * income_usd. This way, their’s no magic number in your code and it becomes more readable.
These are only some of the naming conventions. Again, to pick the conventions up, it’s best to Google them once (for example, “Python Naming Conventions”) and study Github code projects from experts in your field.
Principle 7: Single-Responsibility Principle
The single responsibility principle means that every function has one main task. A function should be small and do only one thing. It is better to have many small functions than one big function doing everything at the same time. The reason is simple: the encapsulation of functionality reduces overall complexity in your code.
As a rule of thumb: every class and every function should have only one reason to change. If there are multiple reasons to change, multiple programmers would like to change the same class at the same time. You’ve mixed too many responsibility in your class and now it becomes messy and cluttered.
Let’s consider a small examples using Python code that may run on an ebook reader to model and manage the reading experience of a user (see Listing 5-5).
Listing 5-5: Modeling the book class with violation of the single responsibility principle—the book class is responsible for both data modeling and data representation. It has two responsibilities.
The code in Listing 5-5 defines a class Book with four attributes: title, author, publisher, and current page number. You define getter methods for the attributes, as well as some minimal functionality to move to the next page. The function next_page() may be called each time the user presses a button on the reading device. Another function print_page() is responsible for printing the current page to the reading device. This is only given as a stub and it’ll be more complicated in the real world. While the code looks clean and simple, it violates the single responsibility principle: the class Book is responsible for modeling the data such as the book content, but it is also responsible for printing the book to the device. You have multiple reasons to change. You may want to change the modeling of the book’s data—for example, using a database instead of a file-based input/output method. But you may also want to change the representation of the modeled data—for example, using another book formatting scheme on other type of screens. Modeling and printing are two different functions encapsulated in a single class. Let’s change this in Listing 5-6!
Listing 5-6: Adhering to the single responsibility principle—the book class is responsible for data modeling and the printing class is responsible for data representation.
The code in Listing 5-6 accomplishes the same task but it satisfies the single responsibility principle. You create both a book and a printer class. The book class represents book meta information and the current page number. The printer class prints the book to the device. You pass the book for which you want to print the current page into the method Printer.print_page(). This way, data modeling and data representation are decoupled and the code becomes easier to maintain.
Do you want to develop the skills of a well-rounded Python professional—while getting paid in the process? Become a Python freelancer and order your book Leaving the Rat Race with Python on Amazon (Kindle/Print)!
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.