Welcome to here

简单即生活

python how to use statistics method

2017-06-24

python

count number of element of a list
count number of element of a special interval

first, the AIM.txt is :

23
45
89
...

count number of element of a list

with open("AIM.txt") as file_object:
	lines = file_object.readlines()
	#print(lines.strip())
	for x in lines:
		if x.strip() == '':
			continue
		temp.append(int(x.strip()))

#print(temp)

dict = {}
for key in temp:
	dict[key] = dict.get(key, 0) + 1

#print(dict)

count number of element of a special interval

# code for python3
from itertools import groupby

lst= [
    2648, 2648, 2648, 63370, 63370, 425, 425, 120,
    120, 217, 217, 189, 189, 128, 128, 115, 115, 197,
    19752, 152, 152, 275, 275, 1716, 1716, 131, 131,
    98, 98, 138, 138, 277, 277, 849, 302, 152, 1571,
    68, 68, 102, 102, 92, 92, 146, 146, 155, 155,
    9181, 9181, 474, 449, 98, 98, 59, 59, 295, 101, 5
]

for k, g in groupby(sorted(lst), key=lambda x: x//50):
    print('{}-{}: {}'.format(k*50, (k+1)*50-1, len(list(g))))

Output:

0-49: 1
50-99: 10
100-149: 15
150-199: 8
200-249: 2
250-299: 5
300-349: 1
400-449: 3
450-499: 1
800-849: 1
1550-1599: 1
1700-1749: 2
2600-2649: 3
9150-9199: 2
19750-19799: 1
63350-63399: 2

Read All

install

sudo pip install pygal

python可以画比较优美的图表，现在介绍一下pygal的简单用法，下面模拟一个掷骰子的游戏。

code in die.py

from random import randint
class Die():
    """ a class of die(骰子)"""
    def __init__(self, num_sides = 6):
        # default a die have 6 sides
        self.num_sides = num_sides

        def roll(self):
            """ return a random number between 1 and side"""
            return randint(1, self.num_sides)

下面是实例化的代码： code in die_visual.py

from die import Die

die = Die()
# throw die, result in a list
results = []

for roll_num in range(100):
    result = die.roll()
    results.append(result)

print(results)

上面的代码你应该知道怎么去运行吧？

分析结果

很简单，就是简单的把１，２,,,６出现的频率分析一下：

--[snip]--
frequencier = []
for value in range(1, die.num_sides + 1):
    frequency = results.count(value)
    frequencier.append(frequency)
--[snip]--

这个思想和上面的没啥区别，关键有一点我不是很明白，为什么是从点数是从１到die.num_sides+1,直接到die.num_sides不行吗?不行，请看下面的代码:

for i in range(1, 6):
    print i
# output: 1 2 3 4 5

原来是两端取一端的。

绘制直方图

下面的才是今天的重点，

--[snip]--
# 绘制直方图

hist = pygal.Bar()
hist.title = u"投掷1000次的结果"
hist.x_labels = ['1','2','3','4','5','6']
hist.x_title = u"结果"
hist.y_title = u"结果的频率"

hist.add('D6', frequencier)
hist.render_to_file('die_visual.svg')

效果如下所示：

下面就是画出两个骰子的分布图，代码只需做一个修改，为了不占用空间，请参考

Read All

python 数据可视化matplotlib简介

2017-06-22

python

where is installed matplotlibrc
install
随机漫步

where is installed matplotlibrc

import matplotlib

print (matplotlib.matplotlib_fname())

执行：

$ python chinese_issue.py
D:\Program File\python\lib\site-packages\matplotlib\mpl-data\matplotlibrc

可以修改配置文件

install

python 2.7:

sudo apt-apt install matplotlib

小试牛刀

import matplotlib.pyplot as plt

squares = [1, 4, 9,25]
plt.plot(squares)
plt.show()

这个代码段就打开了一个二维坐标图。我们首先导入模块pyplot,因为它有很多用于生成图表的函数。

下面作为更详细的介绍.

修改标签文字和线条粗细

import matplotlib.pyplot as plt
import matplotlib

zhfont1 = matplotlib.font_manager.FontProperties(fname='/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc')

squares = [1, 4, 9, 16, 25]
plt.plot(squares, linewidth = 5)

# 　set title
plt.title(u"平方数据",fontsize = 24, fontproperties=zhfont1)
plt.xlabel(u"值", fontsize = 12, fontproperties=zhfont1)
plt.ylabel(u"值的平方", fontsize = 12, fontproperties=zhfont1)

# set scale
plt.tick_params(axis='both', labelsize=14)
plt.show()

参数linewidth决定了plot()绘制的线条的粗细，像title, xlabel,ylabel这些就不用多说了吧，尤其第三个参数fontproperties这个下面就有介绍。

plt.tick_params就是设置刻度的样式，其中指定的实参将影响x轴和y轴的刻度(axis=’both’)

显示的效果如下：

大家请注意第zhfont1,这个对象就是指定matplotlib使用的字体，这个不指明的话容易造成乱码，这里，有必要说一下fname的参数，这个传递不对的话也不行。

怎么看你的系统支持哪一种字体呢？使用：

fc-list

然后指明字体路径即可。

校正图形

不知道大家有没有注意到上面的图片中，有几个问题，那就是在4.0这个点指向了25，现在就得修复这个问题了。

input_values = [1, 2, 3, 4, 5]
squares = [1, 4, 9, 16, 25]
plt.plot(input_values, squares, linewidth = 5)

效果如下图所示：

这种方式是浅显易懂的，主要是因为有了输入数据，而不必需要matplotlib去猜。

分散点(scatter)

上面的图像是连续的，如果使用分散点，那么就可以使用scatter()方法了。

为了节省空间，下面的代码段中有很多的注释，这里就不一一讲解了。　

plt.scatter(2,4)

# 一系列的点
x_values = [1,2,3,4,5]
y_values = [1, 4, 9, 16, 25]
plt.scatter(x_values, y_values, s=100)

#自己填充数字 及　确定刻度范围
# 这样的填充点因为数据太密了，连成一条曲线了
x_values = list(range(1, 1001))
y_values = [x**2 for x in x_values]

plt.axis([0, 1100, 0, 1100000])

# edgecolor 删除点的轮廓和改变颜色
plt.scatter(x_values, y_values, c = 'red', edgecolor = 'none', s=40)

# 颜色映射colormap: 根据数据的变化规律用颜色展示出来。
# c = y_values 就是将颜色索引值传递给c
plt.scatter(x_values, y_values,  edgecolor = 'none', c = y_values, s=40)

# 自动保存图像,这里需要注意的是将　plt.show()替换掉，否则就会显示为一张空白的图片
plt.savefig('squares_plot.png', bbox_inches = 'tight')

其他的代码不用改变，只是代替plt.plot()方法即可。

随机漫步

这个东西有点类似随机化算法，下面请看代码实现.

code in random_walk.py:

from random import choice

class RandomWalk():
    def __init__(self, num_point = 5000):
        """ 初始化随机漫步的属性"""
        self.num_point = num_point

        # 所有的随机漫步都始于(0,0)
        self.x_values = [0]
        self.y_values = [0]


    def fill_walk(self):
    # walk and up to length of
        while len(self.x_values) < self.num_point:
        # decise direction and distance
            x_direction = choice([-1, 1])
            x_distance = choice([0, 1, 2, 3, 4])
            x_step = x_direction * x_distance

            y_direction = choice([-1, 1])
            y_distance = choice([0, 1, 2,3,4])
            y_step = y_direction * y_distance

        # refuse keep original
            if x_step == 0 and y_step == 0:
                continue

        # caluate next x and y, x_step add last element in x_values
            next_x = self.x_values[-1] + x_step
            next_y = self.y_values[-1] + y_step

            self.x_values.append(next_x)
            self.y_values.append(next_y)

上面的代码只是定义了一个使用的类，下面是使用的方法。

code in rw_visual.py:

import matplotlib.pyplot as plt

from mpl_squares import RandomWalk

# 创建一个RandomWalk（）实例

rw = RandomWalk()
rw.fill_walk()

plt.scatter(rw.x_values, rw.y_values, s=15)

plt.show()

效果如图所示：

还可以使用循环，做一下好几次变化的输出：

while True:
	--[snip]--

	keep_running = raw_input("make a another(y/n)?")
	if keep_running == 'n'
	break

还可以像前面那样使用颜色映射：

point_numbers = list(range(rw.num_point))
plt.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues, edgecolor='none', s=15)

还可以突出起点和终点:

plt.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues, edgecolor='none', s=15)
--[snip]--
    plt.scatter(0,0,c='green', edgecolor='none', s=100)
    plt.scatter(rw.x_values[-1], rw.y_values[-1], c='red', edgecolor='none', s=15)
--[snip]--

也就是说,plt.scatter方法是可以前后调用的，当你前面已经描绘出随机点的时候，后面的可以突出起点和终点.

还可以隐藏坐标轴:

    plt.axes().get_xaxis().set_visible(False)
    plt.axes().get_yaxis().set_visible(False)

实际上介绍到这里，还是希望大家能够阅读文档，这里的实质性的东西太少，并不能给你一个透彻的理解.

Read All

python之Requests的使用笔记

2017-06-21

python
- send a Request
- Request Content
send a Request
```
re = requests.get('https://api.github.com/events')

re = requests.post('http://httpbin.org/post', data = {'key':'value'})
print re
## This is a HTTP POST

payload = {'keys1': 'val1','keys2': 'val2'}
r = requests.get('http://httpbin.org/get', params=payload)
print r.url
## http://httpbin.org/get?keys1=val1&keys2=val2
```
哈哈，不得不说这个库真的很厉害，一下子就解决我的问题了，所以我只能高兴的使用汉语表达我的喜悦之情了。

Request Content

这个库的使用和BS的用法差不多少，MY BS note, 一开始都是依靠自己的猜测，当然，大多数是认为Unicode,看下面：
```
print re.text
```
这里当然是指在你不知道网页源代码字符集的情况下默认编码为Unicode,如果，而且必须做的是，你知道正确的编码格式就得改正过来, 使用 re.encoding=’’ 指明它：
```
print re.encoding
re.encoding
```
所以，这里的思路应该是使用re.content去发现源代码的编码方式，然后使用re.encoding去定义原本的编码格式而不是考猜，最后使用re.text打印出你想使用的东西。

JSON 请求

还是以上面的例子，比如：
```
re.json()
```
返回码

上面有几个的内容我忽略掉了，现在只需要几个我认为自己能够需要的，其他的后来慢慢补上。
```
re.status_code

## @2
re.headers

re.headers['Content-Type']
```
嘿，有意思了，4xx是客户端错误然而5XX是服务端错误。

cookies

在一些含有cookie的网页中，你可以快速进入cookies:
```
re.cookies['example_cookie_name']

##  If you send your own cookies to the server, you can use the *cookies* parameter:

    url = 'http://httpbin.org/cookies'
    cookies = dict(cookies_are='working')

    re = requests.get(url, cookies=cookies)
    print re.text
#{
#  "cookies": {
#      "cookies_are": "working"
#	        }
#}
```
Read All

python爬虫之beautifulsoup用法

2017-06-20

python

快速开始
- 软件安装
几个简单的用法
对象类型
实例
- 分行

快速开始

软件安装

sudo pip install beautifulsoup

python的软件安装请参参考

详细教程请参考这里:

提醒大家，使用这个库，一定做好字符编码这个坑的准备

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

from bs4 import BeautifulSoup
import requests

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.prettify())

当然，所有的用法是以上面的代码为基础的. 效果如下:

png

几个简单的用法

下面代码中的注释就是输出内容.

print soup.title
#<title>The Dormouse's story</title>

print type(soup.title)
#<class 'bs4.element.Tag'>

print soup.title.name
#title

print soup.title.string
#The Dormouse's story

print soup.title.parent.name
#head

print soup.p
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print soup.p['class']
#['title']

print soup.a
#<a class="sister" href="http://example.com/elsie"
#id="link1"><!-- Elsie --></a>

print soup.find_all('a')
#[class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.find(id="link3")
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

实际上，那个主页上面的例子有些差错，我已经发邮件说了。

大家可以先看看这个例子，慢慢地我会逐一讲解。

这里可以使用loop将链接提取出来:

for link in soup.find_all('a'):
    print(link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie

对象类型

Tag

在上面的代码中已经出现了这个，请看<class ‘bs4.element.Tag’> 这个类型有很多的属性，这里，重点记住它的属性和名称.

name

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
print type(tag)
print tag.name
# 在这里你可以更改tag的名字，这样还可以将修改反应到原生的html

tag.name = 'yubo'
print tag
#### 输出的内容:
<class 'bs4.element.Tag'>
b
<yubo class="boldest">Extremely bold</yubo>

attris

对于上面的代码来说就是这样，这里的属性就是class,也就是标签的属性。

print tag['class']

# 你可以直接访问,注意这是一个字典:

print tag.attrs
{'class': ['boldest']}

这里有一个多值属性，就先不解释了

NavigableString

这个对象的含义就是一个tag对象所存储的文本内容,BeautifulSoup使用NavigableString这个类去存储文本。

print tag.string
#Extremely bold

print type(tag.string)
#<class 'bs4.element.NavigableString'>

这里就涉及到令人烦恼的编码问题了。一个NavigableString对象就像Python Unicode字符串，你可以使用unicode()方法。

unicode_string = unicode(tag.string)
print unicode_string
#Extremely bold

tag.string.replace_with("hello, yubo")
print tag
#<b class="boldest">hello, yubo</b>

这里直接引用文档中的原文: If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

BeautifulSoup

因为之前有Tag的对象，这个对象就是整个页面的集合组成，基本上用的不多.

评论及其他

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

还是以上面的例子，html_doc的内容不变，测试以下实验例句.

soup = BeautifulSoup(html_doc, 'lxml')
print soup.a
# 这个语句打印匹配出第一个合适的<a>标签

print soup.find_all('a')
#这个语句就是打印出所有的<a>标签，这明显就是一个列表。

.contents 和 .children

head_tag = soup.head
print head_tag
#<head><title>The Dormouse's story</title></head>

print head_tag.contents
# [<title>The Dormouse's story</title>]
# actually, you can see that soup.head is tag,
# which contents is its sub-class and it is a list


print head_tag.contents[0]
#<title>The Dormouse's story</title>
# discard list

print head_tag.contents[0].contents
# recursion... [u"The Dormouse's story"]

The “BeautifulSoup” object itself has children, in this case, the <html> is the child of the “BeautifulSoup”.

head_tag = soup.head
print len(soup.contents)
# 1

print soup.contents[0].name
# html
# You need to think the same useage like about

title_tag = soup.title
print title_tag
# <title>The Dormouse's story</title>

print title_tag.contents
#[u"The Dormouse's story"]

print title_tag.contents[0]
#The Dormouse's story

.descendants

The .contents and .children attribute only consider a tag’s direct chilren. For instance, the <head> tag has a single direct child-the tag:

head_tag = soup.head
print head_tag
# <head><title>The Dormouse's story</title></head>

print head_tag.contents
#[<title>The Dormouse's story</title>]

Above, please note: The title tag itself has a child: “The Dormouse’s story”, the .descendants attribute just is to use it.

for child in head_tag.descendants:
    print child
#<head><title>The Dormouse's story</title></head>
#<title>The Dormouse's story</title>
#The Dormouse's story

If a tags only one child,and that child is a NavigableString(??), the child is available as .string

title_tag = soup.title
print title_tag.string
# The Dormouse's story

.string and stripped_string

for string in soup.strings:
    print(repr(string))
# print a lot

for string in soup.stripped_string:
	print(repr(string))
# discard "\n"

Going up

.parent

You can access an elements parent with .parnet,for instance,

title_tag = soup.title
print title_tag
#<title>The Dormouse's story</title>

print title_tag.parent
#<head><title>The Dormouse's story</title></head>

.parents

link = soup.a

for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

This is a recursion usage to iterate over all elements’s parent.

Sideways

soup_sibling = BeautifulSoup("<a><b>text1</b><c>Test2</c></b></a>", 'lxml')

print (soup_sibling.prettify())
#<html>
# <body>
#  <a>
#   <b>
#    text1
#   </b>
#   <c>
#    Test2
#   </c>
#  </a>
# </body>
#</html>

The tag and tag are at the same level: they're both direct children of the same tag.We call them *siblings*.

.next_siblings and .previous_sibling

print soup_sibling.b.next_sibling
# <c>Test2</c>

print soup_sibling.c.previous_sibling
# <b>text1</b>

Certainly, there is a next_sibling

Here, you should know .next_elementsand .previous_elements

Searching the tree

A string

print soup.find_all('b')
#[<b>The Dormouse's story</b>]
# This is a list.

A regular expression

import re

### @1
for tag in soup.find_all(re.compile("^b")):
    print (tag.name)
##body
##b

### @2
for tag in soup.find_all(re.compile("t")):
    print (tag.name)
## html
## title

### @3
print soup.find_all(["a", "b"])

### @4
for tag in soup.find_all(True):
    print (tag.name)

### @5
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print soup.find_all(has_class_but_no_id)


### @6
def not_lacie(href):
    return href and not re.compile("lacie").search(href)

print (soup.find_all(href=not_lacie))

@1,here, this code finds all the tags whose names start with the letter “b”:in this case, the <body> tag and the tag

@2: this code is to find tag whose name contains “t”.

@3: here is a list(arguments is and results is too)

@4: The argument is bool, it will list all tags in html_doc.

@5: You will pick up a tag <p> it contains “class”.This function only picks up the <p> tags. It doesn’t pick up the tags, because those tags define both “class” and “id”. It doesn’t pick up tags like <html> and , because those tags don’t define “class”.

@6: You will note the call for function.It will output two elements in a list,it both are tag.

find_all()

The above usage is mentioned.her, name keywords argument, in a words:

print soup.find_all("title")

print soup.find_all(id='link2')

print soup.find_all(href=re.compile("elsie"))

You can’t use a keyword argument to search for HTML’s ‘name’ element, because Beautiful Soup uses the name argument to contain the name of the tag itself. Instead, you can give a value to ‘name’ in the attrs argument.Below is code:

soup = BeautifulSoup('<input name="email"/>', 'lxml')
## @1
print soup.find_all(name="email")

## @2
print soup.find_all(attrs={"name":"email"})
#[<input name="email"/>]

@1: it will print a empty list([]).

@2: it will print right result.

CSS class

The class is a keyword in python, so you have to use class_

print soup.find_all("a", class_ = "sister")

here

CSS selector

Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself.

soup.select("title")

…

Output

The will turn BS into Unicode string.

print (soup.prettify())
## it will print Unicode

str(soup)
## it will return UTF-8

Encodings

Use .original_encoding to displays documents’s encoding

print soup.original_encoding
# ascii

markup = b"<h1>\xed\xe5\xec\xf9</h1>"
# 这个编码是iso-8859-8
# 如果你直接打印soup.original_encoding将会是 iso-8859-7

soup = BeautifulSoup(markup , 'lxml', from_encoding="iso-8859-8")
# 使用这个参数就会告诉BS正确的编码方式

print soup.original_encoding
#这样就会打印iso-8859-8

这里才是我写这篇文章的目的，BS默认是把输入文档转化为Unicode,当然，输入时的文本编码是靠BS猜的，但是有可能猜错，所以这样你最好使用from_encoding参数指明输入的文本的编码格式。这里用汉语写出，以突出重点.

输出编码(output encoding)

BS输出时默认是UTF-8.请看下面的例子:

html_doc = """
<html>
    <head>
     <meta content="text/html; charset=ISO-Latin-1"
     http-equiv="Content-type" />
    </head>
    <body>
        <p>Sacr\xe9 bleu!</p>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

print soup.original_encoding
#iso-latin-1

print (soup.prettify())
#<html>
#	<head>
#		<meta content="text/html;
#		charset=utf-8" http-equiv="Content-type"/>
#	</head>
#	<body>
#	<p>
#		  Sacré bleu!
#	</p>
#	</body>
#</html>

请问能够看出什么东西来，对，就是这样的简单，BS可以根据<html>的字符集自动分析编码集，但是使用prettify()输出的时候就是”UTF-8”的编码。

赶快拿出笔记本记重点：如果prettify()的 UTF-8不是你的菜，你可以使用prettify()的编码方法。

实例

以这篇文章为例

为了提取图中的文字，可以使用下面的代码：

import urllib2
from bs4 import BeautifulSoup
import requests

import chardet
import re

if __name__ == '__main__':
#    target = 'http://paper.people.com.cn/rmrb/html/2017-11/15/nbs.D110000renmrb_09.htm'
    target = 'http://paper.people.com.cn/rmrb/html/2017-11/15/nw.D110000renmrb_20171115_1-09.htm'
    req = requests.get(url=target)
    req.encoding = 'utf-8'
    content = req.text
    bf = BeautifulSoup(content ,'lxml')
    context = bf.find(id='postContent')
    print bf.h1.text
    print context.text

参考：https://jiayi.space/post/yong-beautifulsoupti-qu-wang-ye-xin-xi-shi-li

分行

这个东西从一开始就困扰我，现在还好些了，请看效果:

在这里，我们知道<P></P>标签就是分段的意思，同理，
也是同样的意思，你可以使用.get_text属性

    bf = BeautifulSoup(content ,'lxml')
    context = bf.find(id='postContent')
    print context.get_text(separator = u'\n')

Please to see the picture:

Read All

python之爬虫简介

2017-06-19

python
基本的知识架构为python基础->urllib2库基本用法或者requests基本用法->正则表达式，这样你的爬虫的效果才会更好，更进一步。

urllib2
```
import urllib2
response = urllib2.urlopen("http://www.baidu.com")
print response.read()
```
将这段代码保存为demo.py,运行之后你就会看到满屏的内容，哈哈，这就是第一个爬虫代码. 首先我们调用的是urllib2库里面的urlopen方法，传入一个URL，这个网址是百度首页，协议是HTTP协议，当然你也可以把HTTP换做FTP,FILE,HTTPS 等等，只是代表了一种访问控制协议，urlopen一般接受三个参数，它的参数如下：
```
urlopen(url, data, timeout)
```
第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间。第二三个参数是可以不传送的，data默认为空None，timeout默认为 socket._GLOBAL_DEFAULT_TIMEOUT.第一个参数URL是必须要传送的，在这个例子里面我们传送了百度的URL，执行urlopen方法之后，返回一个response对象，返回信息便保存在这里面。
```
print response.read()
```
response 对象有一个read方法，可以返回获取到的网页内容。

这段代码可以改进一下:构造Request 这里可以传入一个request请求，这就是一个Request类的实例，构造时需要传入Url,Data等等的内容，比如，上面的两行代码，我们可以这么改写:
```
import urllib2
request = urllib2.Request("http://www.aftermath.cn")
response = urllib2.urlopen(request)
print response.read()
```
post 和　get 方式

[IMPORTANT]，这块后来补充先上一张正则表达式的图片

requests 库的使用

requests的方法如下:
```
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import requests

if __name__ == '__main__':
    target = 'http://gitbook.cn/'
    req = requests.get(url=target)
    print(req.text)
```
现在你可以看一下这个代码，这样你就将很多文本型的内容打印出来了. 下面这段代码与上面的代码是一样的，我这里只是想探究一下字符编码的问题。
```
import requests

if __name__ == '__main__':
    target = 'http://www.biqukan.com/1_1094/5403177.html'
    req = requests.get(url = target)
    print(req.text)
```
效果:

注意源码中的编码字符集，这个对于以后的中文处理字符有很大的作用. 也就是说从这里开始，requests得到的字符可以在我的电脑上显示出来，在你的电脑上不一定显示出来，这一点在我一开始处理代码时就这样。

Beautiful Soup

目前这种方法的编码有问题.(应该新开一篇单独讲这个问题)
```
from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':
    target = 'http://www.biqukan.com/1_1094/5403177.html'

    req = requests.get(url=target)
    content = req.text
    #html = urllib2.urlopen(target)
    #content = html.read().decode('gbk', 'ignore')
    #下面可以不指定fromEncoding，bs 可以自己搞定
    bf_1 = BeautifulSoup(content,'lxml')
    print "------------\n\n"
    texts = bf_1.find_all('div', class_ = 'showtxt')
    print "type(texts)=", type(texts)
    print "len(texts)=", len(texts)

    for each_div in texts:
        print "type(each_div)=", type(each_div)
        print "each_div.string=", each_div.string # 输出soup的属性
        print "type(each_div.string)=", type(each_div)
        print "each_div=", each_div
        print "each_div.renderContents()=", each_div.renderContents()
        print "each_div.__str__('GBK')=", each_div.__str__()
```
为什么这么啰嗦呢，关键问题在于字符编码的问题影响了事情的发展。BeautifulSoup将输入的内容自己转换为unicode的编码了，需要我们在输出的时候，人为的指定输出格式.

效果如下:

参考这篇文章

在上面的效果中，你会发现还有
标签什么，下面是去掉
的代码，
```
import sys
reload(sys)
sys.setdefaultencoding('gbk')
# 这一个对于编码的工作也是至关重要

import urllib2
from bs4 import BeautifulSoup
import requests
if __name__ == '__main__':
    target = 'http://www.biqukan.com/1_1094/5403177.html'

    req = requests.get(url=target)
    content = req.text
    bf_1 = BeautifulSoup(content,'lxml')
    texts = bf_1.find_all('div', class_ = 'showtxt')
    print "type(texts)=", type(texts)
    print "len(texts)=", len(texts)
    print(texts[0].text.replace('\xa0'*8,'\n\n'))
```
这一段代码与前面的代码的区别在哪里相信大家一定能够看出来，其实，上面的代码后面几行代码就是测试使用的。

如果不加入最前面的三行代码，python会报出unicodedecodeerror error,这里面不知道大家有没有看出门道，就是你要抓取的网页源代码是什么样的编码格式，在最前面的sys.setdefaultencoding()就要指定python的编码格式，这样在传递给BeautifulSoup处理的时候就是gbk了(这里是我自己的推断，不一定正确)

总结如下：

不能再对Unicode编码再进行解码(decode)了，这时候你应该对其进行对应于你的终端上的编码格式进行编码(encode)

效果如下：

关于编码的问题

这里有一篇好的文章，请参考一下here

(ascii编码)[http://blog.csdn.net/songjinshi/post/details/7868866]

正则表达式:
Read All

58/90

Welcome to here

python how to use statistics method

count number of element of a list

count number of element of a special interval

python画表--pygal简介

install

分析结果

绘制直方图

python 数据可视化matplotlib简介

where is installed matplotlibrc

install

小试牛刀

修改标签文字和线条粗细

校正图形

分散点(scatter)

随机漫步

python之Requests的使用笔记

send a Request

Request Content

JSON 请求

返回码

cookies

python爬虫之beautifulsoup用法

快速开始

软件安装

几个简单的用法

对象类型

Tag

name

attris

NavigableString

BeautifulSoup

评论及其他

.contents 和 .children

.descendants

.string and stripped_string

Going up

.parent

.parents

Sideways

.next_siblings and .previous_sibling

Searching the tree

A string

A regular expression

find_all()

CSS class

CSS selector

Output

Encodings

输出编码(output encoding)

实例

分行

python之爬虫简介

urllib2

post 和 get 方式

requests 库的使用

Beautiful Soup

关于编码的问题

post 和　get 方式