Skip to content

Latest commit

 

History

History
587 lines (533 loc) · 41.3 KB

8.markdown

File metadata and controls

587 lines (533 loc) · 41.3 KB

6. C++ Primer

  • static_cast:well-behaved且非casting away const
    • implicit casting的逆变换: 如果S->D可隐式转换(即casting行为明确定义),则D->S可以用static_cast
      • void*->T*、narrowing cast、Base*->Derived*
  • reinterpret_cast: 对数据的二进制表示提供另一种解释
    • 完全无视类型系统,位于C++语言之下、体系结构一层,因此必然是平台相关的,不可移植
  • C++ casting的引入是为了填补C casting的类型漏洞。两者的比较:
    • C casting在形式上不明显,不易搜索,而新的casting的格式xxx_cast非常醒目
    • C++ casting分门别类,明确用意
      • 有时候重构过后,casting的语义发生变化了,C casting无法发现从而引起运行时错误,而C++ casting却能通过静态检查捕获错误、要求更改cast方式
    • C casting将不可移植的reinterpret_cast隐藏起来
  • C++11的lambda简单的声明了一个functor类型,并创建了一个对象
    • 使用STL算法时,传入lambda比传入函数指针快,因为前者根据类型分派,产生了针对匿名lambda functor类型特化的算法,包含inline;而后者是根据函数地址值分派,公用的同一份代码,有大量函数调用
    • 普通的lambda的operator()声明是const的,而mutable的lambda是非const的
  • 迭代器类型
    • input_iterator: 只读、单趟算法(一个位置只能读一次)
      • 因为input_iteratoroutput_iterator所属的容器有一个"当前位置"的概念,而移动迭代器将导致当前位置更新;所以,当同一个容器创建多个迭代器时,任何一个迭代器的前进都将影响其他迭代器;所以,两次读/写当前位置的结果是未定义的: 虽然本迭代器没移动,但其他迭代器可能移动,因为"当前位置"的状态共享,所以不保证第2次读取能得到相同结果
    • output_iterator: 只写、单趟算法(只能写一次)
      • input_iterator同样的理由,不能重复写
    • forward_iterator: 允许多次读/写同一位置
    • bidirectional iterator: 多次读写同一位置,并且支持双向迭代
    • random-access iterator: O(1)的时间访问任意元素
  • input_iteratoroutput_iterator的迭代状态位于容器中,而其他3种迭代器的状态是自包含的
  • unique_ptr独占指针,所以支持release动作释放指针所有权,而shared_ptr不支持release
  • 需要析构函数的类,肯定要考虑实现拷贝构造、赋值
  • 对象的值语义分类:
    • 不可移动、不可复制: 禁用右值构造/赋值,禁用左值构造/赋值
      • 不能作为返回值,不能放入容器,如scoped_ptr
    • 可移动、不可复制: 允许右值构造/赋值
      • 可作为返回值,可放入C++11容器,如unique_ptr
    • 可移动、可复制:允许左/右值构造/赋值
      • 可作为返回值、可放入C++98容器,如shared_ptr
  • 任何变量都是左值,即使类型是int&&
    • 因此,要将移动变量,一定要用std::move,哪怕它是个右引用变量
    • 模板内一定要用std::forward来完美转发
  • 模板函数通过template argument deduction进行实例化时,其implicit casting规则:
    • 只允许3种casting:
      1. 数组到指针: T[n]->T*
      2. 函数到指针: R(A)->R(*)(A)
      3. 添加const: T&->const T&
    • 其他隐式转换规则都无效,包括自动拓宽、派生类到基类指针等
      • 比如std::min(3, 2.0),由于不会进行整形的自动拓宽,所以会是一个编译错误
      • 要进行这些转换,显示实例化函数模板
  • 函数重载规则
    1. 首先,将非模板函数、函数模板1、函数模板2等,按精确匹配->隐式转换后匹配的优先级排序
    2. 如果优先级最高的备选项有多个,优先选非模板、其次是最特化的模板、然后是更泛化的模板
      • 可以看做是类型的模式匹配,总是选最特化的(非模板看做模板的特例)
      • 想要调用对应版本的函数时,一定要确定该版本可见,否则会调用新的模板实例
  • C++11中的sizeof...可以取得variadic template的项数
  • C++11中注意用std::forward来转发实参列表
  • 右值引用和完美转发(perfect forwarding)
    • 函数模板的参数
      • 基本规则:T&捕获左引用;const T&,捕获任意引用,但不能修改;T&&捕获右引用
      • 特殊规则: T&&能够捕获左引用,此时T是T2&,利用了引用折叠规则
        • 引用的引用是C++11引入的,引用折叠规则是:T&& && -> T&&,而T& &&T&& &T& &都 -> T&
      • 特殊规则:static_cast可以将左值转换成右值
    • std::move
      • 使用环境:模板/非模板中,任何想被移动的值,包括右引用
      • 使用方式: 类型推演(template type deduction), std::move(v)
      • 实现:参数是T&&,其中T可能是T2或T2&&,返回remove_reference&&
    • std::forwrad
      • 使用环境: 模板中,参数类型应该是右引用(虽然const T&也能绑定任意值,但不能修改)
      • 使用方式:显示实例化(explicit instantiate), std::forward(v)
      • 实现:参数是T&&,其中T可能是T2或T2&&,返回T&&
  • 类模板,实例化过后,并不立即生成所有成员,而是只在每个call site生成被访问的成员
    • 因为没被访问的成员根本不生成,所以,即使类型T不完全匹配模板的requirements,也能实例化模板:只访问成员的一个子集即可
    • 显示实例化,将立即生成所有成员
  • 使用显示实例化,用extern template来引用显示实例化,可以控制目标文件的代码规模,不至于每个obj文件都生成各自的模板示例
  • 函数模板只支持特化而不支持部分特化,所以一个惯用技巧是,用函数模板接收参数,然后转发给一个类模板的静态成员,利用类模板来特化、部分特化;这一技巧利用了函数模板的template argument deduction能力,很强大
  • 模板对参数类型应该只有尽量少的要求,比如,用<来代替>和==
  • 模板类的成员没有限制,而类/模板类的模板成员,不能是虚函数
  • 可以单独特化模板类的成员函数
  • 引用作为局部变量的时候(非形参),可以彻底优化为别名(aliasing),完全不占用栈空间
  • C++11引入的随机数
    • 随机引擎:一个callable对象,支持min()/max()/seed()/discard(n);不同的随机引擎区别是随机质量和速度
      • 默认随机引擎类型:default_random_engine
    • 随机分布:一个callable对象,operator()接收随机引擎作为参数,返回不同的分布
      • 均匀分布:uniform_int_distribution、uniform_real_distribution
        • 用rand()得到的浮点随机数精度太低
      • 伯努利分布:bernoulli_distribution
      • 泊松分布:poisson_distribution
      • 正态分布:normal_distribution
  • IO stream
    • stdio的缺陷
      • 无法扩展。不支持用户自定义类型的输出。即使某些扩展支持register,但由于每种类型都要使用独占字符,所以用途有限
      • 类型不安全,...是类型漏洞,丢失了实参的静态类型
        • 比较boost::format,使用%连接每个实参,于是能够进行类型检查,数量、类型的不匹配都可以抛出运行时异常
    • IO stream使用manipulator操作stream的内部状态
      • 输出方式控制: boolalpha/noboolalpha,dec/hex/oct,showcase/noshowcase(输出8/16进制的前缀0/0x)
      • 对齐控制
      • 填充控制
    • IO stream的缺点
      • operator<<的虚函数调用开销
      • manipulator的2次函数调用开销
      • 格式控制极其繁琐,在可读性方面比stdio差很多,使用operator<<的iostream是一种更差的notation
  • 模板算法的参数,选择函数子/函数指针,是一种效率/灵活性间的取舍,前者是静态绑定,后者是动态绑定
  • C++11的final和override
    • final用于class,避免该class被继承;object based范式可以使用
    • final用于函数,避免被子类覆盖
    • override用于函数,明确指出要覆盖基类;避免了由于失误、重构引起的签名不匹配导致的覆盖失败
  • C++模板可以看做是编译期的函数式语言
    • 围绕template<typename car, typename cdr> struct Cons{}编程,template meta programming于是成了编译期的lisp编程
    • 以int等integral number作为负载(即Cons的car),可以进行编译期的计算,如求素数、快排、二叉树等
    • 以typename作为负载,可以操纵typelist
  • 元编程(meta programming)和泛型编程(generic programming)
    • 编写程序A,而A会生成最终代码B,该过程叫元编程
    • 宏、模板、反射/检查都是元编程
    • 当模板参数(template parameter)是类型时(typename),叫泛型编程
  • 模板相关的一些术语
    • class template, function template, member template
    • template argument, default template argument, template argument deduction,
    • instantiate, instantiation, explicit instantiation
    • template specialization, partial specialization
    • variadic template, function parameter pack, pack expansion, parameter pack,

6. C++编程规范, 101条规则

  • new object之后接虚函数init的两段构造法,可以用private constructor + 模板函数create来封装
  • 不使用匈牙利记法的理由:多态
    • 动态多态:Base *baseObj = derivedObj,这里的obj应该标记运行时类型还是静态类型呢
    • 静态多态: generic programming不可能用匈牙利命名;比如C++模板、Haskell
  • SESE(single entry, single exit)已经过时了,异常会造成多出口, 应该采用early exit
    • 阅读嵌套代码需要在脑中维护一个栈
  • 未使用变量的警告
    • 参数:直接注释掉,只留下类型
    • 局部变量:(void)v;
  • 不要用不必要的运算符重载来耍小聪明
  • memory barrier: 指一种特殊的指令,处理器保证在该指令前后的内存操作是顺序的,避免乱序在并反中带来的问题
  • 避免对智能指针的过度使用
    • 作用域有限的指针用裸指针或者scoped_ptr即可
  • C++最强大的静态检查工具是类型系统本身
    • 动态语言只能依赖于单元测试
  • C++反对使用宏,C++的设计目标之一就是通过模板等设施使宏称为多余的
    • 宏没有作用域、类型系统、不卫生
    • 模板在定义时就已经施加了一部分检查
      • concept使得宏施加更多的检查,类似Haskell中的type constraint
  • 有可能一个大函数无法拆小,因为任何拆分的尝试都需要传递大量的参数/上下文给子函数,这种特例下的大函数是允许的
    • 但这应该是非常非常罕见的特例
  • 能够通过forward declaration完成的工作就不要include完整的声明
  • 头文件中应该include它所有的依赖
  • 重载常用来避免不必要的隐式转换带来的开销
    • 如string的operator+等接口
  • 重载||、&&、,将使得这几个运算符变成普通函数,应用applicative order的参数求值规则;而默认情况下他们都是有编译器特殊照顾的特殊求值规则
  • 函数参数的求值顺序不定,确定的只是求值后的压栈规则
    • 注意避免副作用
    • 另一条类似的规则: 同一个语句只允许写一个变量一次,多次修改行为是undefined-behavior
  • C++的继承是仅次于friend的第二大依赖,应该尽量避免
  • 非成员函数优于成员函数
    • "太多的语法糖导致了逗号癌!"
  • 最需要保证正确的是接口,错了可能就再没机会改了(广泛发布了)
  • 派生类通过覆盖重写虚函数的时候,要保证函数的不变式
  • 不要重载隐式转换符
    • string如果提供了operator const char*的话,诸如s + '0'if (s == "0")的意外会发生
  • 将数据私有化有利于封装变化和保持不变式,只有单纯的数据聚集的时候用POD的struct
    • public、protected都有可能破坏不变式
  • 如果一个struct全是getter/setter的话,说明它只是数据聚集(data aggregation),不需要抽象和不变式,用struct+public field就好
  • C++的private成员不可访问,但可见,所以需要pimpl编译防火墙
  • 基类的析构函数,要么是公开虚函数,要么是保护非虚函数
  • NVI(non-virtual interface)模式:虚函数私有,公开非虚函数作为包装
    • 和模板模式的用法比较类似
    • 一个好例子:为避免某个类不覆盖clone,将虚函数声明为clone_,以非虚函数clone调用虚函数,虚函数返回后根据typeid和this的typeid比较
  • 要防止切片(slicing),用clone
  • 在C++98中,实现拷贝赋值的一种办法是,声明它为T& operator = (T o) { swap(*this, o); return *this; },这里将形参声明为T而非引用,其实已经利用了惯用法;同时,该做法还有利于编译器优化右值
  • std::swap的特化其实只是锦上添花,如果实在做不到也无所谓
  • ADL(argument dependent lookup),也称Koening查找。调用非成员函数时,编译器会先查找参数所在的名空间,最后才查找全局空间
    • 鉴于这个理由,将类型相关的非成员函数放到类所在的名空间下,用起来接近成员函数
  • 只能在cpp文件的include之后调用using,它只会影响本cpp。如果在头文件中using,会影响所有包含该头文件的文件;如果在cpp的include之前using,会影响头文件
    • 典型的错误用法:在头文件中using namespace std;
  • 头文件中不能有链接实体(entity with linkage),因此函数要inline或extern,变量要extern;如果语义不变的话,也可以static
    • 一个例外是模板类的static成员数据
  • 无差别捕获catch(...)只应该用在有限的几个地方:(另外,它还将捕获平台异常,而C++异常也是平台异常实现的)
    • 模块接汇处
      • 模块的C接口返回前,为了避免ABI不兼容的异常泄露到模块外,在这里将异常转换成错误码
      • 线程的兜底异常捕获,包括main和threadProc
      • 模块提供的回调末尾
    • 析构函数
  • 模块接口的类型越底层,可发布范围就越广(如C接口的模块几乎可以发布给所有语言);类型越高级,可发布范围越窄(比如用了class过后,只能发布给C++用)
    • 可以用C作为接口,在内部用C++实现。比如windows api
  • template对参数类型要求有三种方式(表现为concept的requirements),模板对类型的要求基于哪种形式,应该文档化(STL现在在这方面不行)
    • 类成员,如T::compare(使用了域访问子,scope resolution operator)
    • 非成员,如compare(a, b),基于ADL
    • type_traits,如sort_traits::compare
  • 不要特化函数,改为在当前类名空间下重载(基于ADL,重载的优先级更高)
    • 当然,整形模板无法重载,只能特化
  • 编写模板的时候,只对类型提出最少要求
    • 比如sort、set、priority_queue只要求类型实现<,而不要求>和==
  • 不要用异常代替断言,虽然logic_error就是为此而生的。断言失败的时候,我们一般不希望栈回滚(stack unwinding)
  • 析构不能抛异常
    • 原因:以下对象因为异常而析构会导致双异常,从而触发terminate
      • 局部变量
      • 数组
      • 静态/全局变量
    • 都是RAII引起的,所以大部分其他语言功能重叠的设施一样面临此问题
      • Java面临finally里面的异常,如果简单的接收nested exception,那么会有资源泄露
      • C#的using的核心设施IDispose,它内部如果再抛出异常,一样面临资源泄露
  • 异常相比错误码的优点
    • 不可忽略
    • 隐式传播,不干扰正常业务的控制流;错误码要求手工传播,和业务码混在一起,难以辨识和维护
    • 在catch块中集中进行错误处理
    • 构造函数、运算符只能抛异常
  • 捕获、并转发异常的场合
    • 模块接口处,将模块内的异常导出成错误码
    • 捕获低层异常添加高层的业务信息后重抛
  • C++的异常规范不靠谱,不要用
    • 例外是,依赖的模块使用了异常规范,只能继承。如exception::what
  • 尽量依赖于STL的检查
    • &v[n]优于&v[0] + n,后者绕过了STL的range checking
  • 尽量使用STL算法,避免手工编写循环。原因:
    • 函数名是功能的抽象,而while/for什么也没说明,需要仔细看
    • 避免了编码上的不必要劣化,如反复调用c.end()等
    • STL是专家编写,有很多优化知识
    • 能够利用特化进行memcpy等优化
    • STL广泛发布。一般而言,发布范围越广,质量、性能都更优
  • STL的函数子应该是纯函数,即其operator ()应该是const成员。理由:
    • 函数子可能在内部有多份拷贝,如果有写操作,将不能累积
      • 如果函数子类包含的是状态的引用,而非状态本身,可以避免此问题
    • 函数子被调用的顺序未定义
  • STL算法使用lambda作为参数比函数指针快
  • 类型安全(type-safety),要求不能访问无类型的内存
  • 对比用type tag进行类型分派(比如用if/else+dynamic_cast)和用虚函数来分派
    • 添加新类型要求修改分派代码,破坏了开放封闭原则
    • 添加新类型,编译器发现不了,而虚函数会自动分派
      • 前者至少应该通过switch-default/else中的assert来捕获运行时异常
  • POD,在内存布局上应该同C的struct等价,同时没有构造+big three
  • 由于对齐问题,无法在任意位置写入任意类型对象
    • 这种unaligned write,势必会用到reinterpret_cast,这是不可移植的
  • reinterpret_cast位于类型系统之下的体系结构一层,注定是不可移植的
    • reinterpret_cast不受class可见性的影响(C casting同样的问题),不会在Base*->Derived*转换的时候调整偏移
  • union也是弱类型的C引入的,它比reinterpret_cast更彻底的擦除了类型,后者至少还会在某些时候失败
  • 可变参数...也擦除了类型信息,应该用类型安全的IO,如IO stream
    • printf等函数在C++中广泛存在,是由于IO stream在可用性、性能等方面设计不够好造成的

6. C++语言的设计与演化

  • 支持局部变量、全局/静态变量的class,而不必每次都new,对C++的性能增益很大
    • C#有struct
    • Java依赖escape analysis和minor GC
  • inline的设计初衷,是为了补偿getter/setter等简单访问子引入的抽象代价
  • 只提供一个特性是不够的,还必须以可以负担得起的代价提供它
    • 还应该有足够的可用性
  • narrowing cast本来是想禁止的,因为它绕过了类型系统且很可能不安全,但为了兼容C...
    • 作为补偿,提供警告
  • 在没有模板的年代,泛型容器就是通过宏的hackdefine; include; undef来实现的
  • Cfront采用C作为目标码,是因为,C是当时移植性最好的汇编

6. Linux多线程服务器编程

  • 动态库的升级(类似COM级别,要求升级后所有客户不用重新发布)
    • 虚函数采用运行时绑定,接口签名变了会是一个运行时错误
    • 非虚函数采用name mangling后symbol lookup,如果接口签名变了,会是一个动态链接错误
  • pimpl手法中的Impl class前置声明应该放在class类而不是外部
  • 如果是target级别的多态,可以简单的将接口的声明和实现分离,然后根据target不同编译、链接不同的cpp。这里不必引入虚函数开销
    • 例: OS.h,OS.cpp,根据系统不同,分别编译、链接Linux.cpp、Windows.cpp
    • 例: A.h, A.cpp,如果是unittest的target,则链接A_mock.o
  • 在linux下hook系统调用,只需要实现同名函数即可,它会影响dlsym的查找
  • C语言之所以必须先声明后使用,是因为早期编译器内存受限,使用一趟扫描(one-pass)的方式编译
    • 分离编译的原因之一,也是内存无法装载所有文件的语法树

6. 重构

  • 单元测试(unitest)和功能测试(functional test),前者是白盒后者是黑盒测试
    • 像解释器这种程序,也需要程序员编写大量的功能测试
  • 编写未臻完善的测试并运行,好过对完美测试的无尽等待
  • 测试的焦点应该在边界
  • 花合理的时间抓出大多数bug,好过花一生的时间找出所有bug
  • 当测试数量达到一定程度过后,继续增加测试其收益会开始锐减,你反而可能因为工作量太大而放弃

6. 杂项

  • C遗留给C++的类型安全漏洞
    • explicit conversion
      • 应该用xxx_cast替代
    • union
      • 不要使用
    • 可变参数...
      • 用类型安全的函数,如iostream::operator<<或者C++11的variadic template
  • C++自身的类型安全问题
    • static_cast进行Base*->Derived*
      • 至少应该在debug版中用dynamic_cast
    • dynamic_cast+if/else进行类型分派
      • 用虚函数
    • 数组的多态
      • 应该用base**而不是base*
  • C++ concept很像编译期的duck typing
  • 在C++中使用gtest
  • C++ project
    • 将C++模板看做编译期的函数式语言,当以integral number作为负载时,能进行编译期计算:包括求素数、快排、二叉树等

7. 常见编译器优化

  • Machine dependent factors

    • The architecture of the target CPU
      • RISC vs CISC
      • Pipelines
      • Number of functional units
    • Cache size
    • Cache/Memory transfer rates
  • Intended use of the generated code

    • Debugging
    • General purpose use
    • Special-purpose use
    • Embedded systems
  • 窥孔优化(Peephole optimizations)

    • Usually performed late in the compilation process after machine code has been generated
    • For instance, a multiplication of a value by 2 might be more efficiently executed by left-shifting the value or by adding the value to itself. (This example is also an instance of strength reduction.)
  • 语言相关优化(Language-dependent optimizations)

    • the existence of pointers in C and C++ makes it difficult to optimize array accesses (see alias analysis)
  • 循环优化(Loop optimizations)

    • 归纳变量优化(Induction variable analysis)

      • if a variable in a loop is a simple linear function of the index variable, such as j := 4*i + 1, it can be updated appropriately each time the loop variable is changed. This is a strength reduction
      • This information is also useful for bounds-checking elimination and dependence analysis, among other things.
    • 循环分裂(Loop fission)

      • loop fission (or loop distribution) is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body

      • This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor. It is the opposite to loop fusion, which can also improve performance in other situations.

      • 例子

           int i, a[100], b[100];
           for (i = 0; i < 100; i++) {
             a[i] = 1; 
             b[i] = 2;
           }
        
           int i, a[100], b[100];
           for (i = 0; i < 100; i++) {
             a[i] = 1;                     
           }
           for (i = 0; i < 100; i++) {
             b[i] = 2;
           }
        
    • 循环交换(Loop interchange)

      • loop interchange is the process of exchanging the order of two iteration variables used by a nested loop. The variable used in the inner loop switches to the outer loop, and vice versa. It is often done to ensure that the elements of a multi-dimensional array are accessed in the order in which they are present in memory, improving locality of reference.

      • 例子(列矩阵)

          for i from 0 to 10
           for j from 0 to 20
              a[i,j] = i + j
        
          loop interchange would result in:
        
          for j from 0 to 20
           for i from 0 to 10
              a[i,j] = i + j
        
      • On occasion, such a transformation may create opportunities to further optimize, such as vectorization of the array assignments.

    • 循环不变式提取(Loop-invariant code motion)

      • loop-invariant code consists of statements or expressions (in an imperative programming language) which can be moved outside the body of a loop without affecting the semantics of the program

      • 例子

          for (int i = 0; i < n; i++) {
              x = y + z;
              a[i] = 6 * i + x * x;
          }
        

        The calculation x = y + z and x * x can be moved outside the loop since within they are loop invariant---they do not change over the iterations of the loop— so the optimized code will be something like this:

          x = y + z;
          t1 = x * x;
          for (int i = 0; i < n; i++) {
              a[i] = 6 * i + t1;
          }
        
    • 循环展开(Loop unwinding)

      • is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size (space-time tradeoff). The transformation can be undertaken manually by the programmer or by an optimizing compiler.

      • 例子

           int x;
           for (x = 0; x < 100; x++)
           {
               delete(x);
           }
           int x; 
           for (x = 0; x < 100; x+=5)
           {
               delete(x);
               delete(x+1);
               delete(x+2);
               delete(x+3);
               delete(x+4);
           }
        
    • 循环分割(Loop splitting)

      • is a compiler optimization technique. It attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range.
    • 循环分支移除(Loop unswitching)

      • Loop unswitching is a compiler optimization. It moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional. This can improve the parallelization of the loop. Since modern processors can operate fast on vectors this increases the speed.

      • 例子

            int i, w, x[1000], y[1000];
            for (i = 0; i < 1000; i++) {
              x[i] = x[i] + y[i];
              if (w)
                y[i] = 0;
            }
        

      The conditional inside this loop makes it difficult to safely parallelize this loop. When we unswitch the loop, this becomes:

              int i, w, x[1000], y[1000];
              if (w) {
                for (i = 0; i < 1000; i++) {
                  x[i] = x[i] + y[i];
                  y[i] = 0;
                }
              } else {
                for (i = 0; i < 1000; i++) {
                  x[i] = x[i] + y[i];
                }
              }
      
    • 软流水(Software pipelining)

      • software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor

      • 例子

          for (i = 1) to bignumber
            A(i)
            B(i)
            C(i)
          end
        

        In this example, let A(i), B(i), C(i), be instructions, each operating on data i, that are dependent on each other. In other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could perform some arithmetic operation on the data, and C could store the data back into memory . A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ... A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ... Software pipelining is often used in combination with loop unrolling.

          for i = 1 to (bignumber - 2) step 3
            A(i)
            A(i+1)
            A(i+2)
            B(i)
            B(i+1)
            B(i+2)
            C(i)
            C(i+1)
            C(i+2)
          end
        
    • 自动向量化(Automatic vectorization)

      • is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation which processes one operation on multiple pairs of operands at once

      • 例子

          for (i = 0; i < 1024; i+=4)
              for (ii = 0; ii < 4; ii++)
                 C[i+ii] = A[i+ii]*B[i+ii];
        

      After loop distribution using temporary arrays

              for (i = 0; i < 1024; i+=4)
              {
                for (ii = 0; ii < 4; ii++) tA[ii] = A[i+ii];
                for (ii = 0; ii < 4; ii++) tB[ii] = B[i+ii];
                for (ii = 0; ii < 4; ii++) tC[ii] = tA[ii]*tB[ii];
                for (ii = 0; ii < 4; ii++) C[i+ii] = tC[ii];
              }
      

      After replacing with vector codes

            for (i = 0; i < 1024; i+=4)
              {
                vA = vec_ld( &A[i] );
                vB = vec_ld( &B[i] );
                vC = vec_mul( vA, vB );
                vec_st( vC, &C[i] );
              }
      
  • 公共子表达式消除(Common subexpression elimination )

    • is a compiler optimization that searches for instances of identical expressions (i.e., they all evaluate to the same value), and analyses whether it is worthwhile replacing them with a single variable holding the computed value

    • 例子

        a = b * c + g;
        d = b * c * e;
      

      it may be worth transforming the code to:

        tmp = b * c;
        a = tmp + g;
        d = tmp * e;
      
  • 常量折叠(Constant folding)

    • is the process of simplifying constant expressions at compile time. Terms in constant expressions are typically simple literals, such as the integer literal 2, but can also be variables whose values are never modified, or variables explicitly marked as constant. Consider the statement

    • 例子

        i = 320 * 200 * 32;
        i = 2048000
      
  • 常量传播(Constant propagation )

    • is the process of substituting the values of known constants in expressions at compile time. Such constants include those defined above, as well as intrinsic functions applied to constant values

    • 例子

        int x = 14;
        int y = 7 - x / 2;
        return y * (28 / x + 2);
      

      Propagating x yields:

        int x = 14;
        int y = 7 - 14 / 2;
        return y * (28 / 14 + 2);
      

      Continuing to propagate yields the following (which would likely be further optimized by dead code elimination of both x and y.)

        int x = 14;
        int y = 0;
        return 0;
      
  • SSA-based optimizations

    • These optimizations are intended to be done after transforming the program into a special form called static single assignment (see SSA form), in which every variable is assigned in only one place
    • 全局值编码(Global value numbering)
      • is a compiler optimization based on the SSA intermediate representation. It sometimes helps eliminate redundant code that common subexpression elimination (CSE) does not. At the same time, however, CSE may eliminate code that GVN does not, so both are often found in modern compilers. Global value numbering is distinct from local value numbering in that the value-number mappings hold across basic block boundaries as well, and different algorithms are used to compute the mappings.

      • 例子

          w := 3
          x := 3
          y := x + 4
          z := w + 4
        

        a good GVN routine would assign the same value number to w and x, and the same value number to y and z. For instance, the map [{w} \mapsto 1, {x} \mapsto 1, {y} \mapsto 2, {z} \mapsto 2] would constitute an optimal value-number mapping for this block. Using this information, the previous code fragment may be safely transformed into:

          w := 3
          x := w
          y := w + 4
          z := y
        
  • 寄存器分配(Register allocation)

    • The most frequently used variables should be kept in processor registers for fastest access
  • 指令选择(Instruction selection)

    • Most architectures, particularly CISC architectures and those with many addressing modes, offer several different ways of performing a particular operation, using entirely different sequences of instructions. The job of the instruction selector is to do a good job overall of choosing which instructions to implement which operators in the low-level intermediate representation with. For example, on many processors in the 68000 family and on the x86 architecture, complex addressing modes can be used in statements like lea 25(a1,d5*4), a0, allowing a single instruction to perform a significant amount of arithmetic with less storage.

    • 例子

        t1 = a
        t2 = b
        t3 = t1 + t2
        a = t3
        b = t1
      

      A good tiling for the x86 architecture is a succinct set of instructions:

        MOV EAX, a
        XCHG EAX, b
        ADD a, EAX
      
  • 指令调度(Instruction scheduling)

    • Instruction scheduling is an important optimization for modern pipelined processors, which avoids stalls or bubbles in the pipeline by clustering instructions with no dependencies together, while being careful to preserve the original semantics.
    • Software pipelining是它在循环中的应用? 也就是out-of-order?
  • Rematerialization

    • Rematerialization recalculates a value instead of loading it from memory, preventing a memory access. This is performed in tandem with register allocation to avoid spills.
  • 函数式语言优化(Functional language optimizations)

    • 递归消除(Removing recursion)
      • Recursion is often expensive, as a function call consumes stack space and involves some overhead related to parameter passing and flushing the instruction cache. Tail recursive algorithms can be converted to iteration, which does not have call overhead and uses a constant amount of stack space, through a process called tail recursion elimination or tail call optimization. Some functional languages (e.g., Scheme and Erlang) mandate that tail calls be optimized by a conforming implementation, due to their prevalence in these languages.
  • 边界检查消除(Bounds-checking elimination)

    • Many languages, for example Java, enforce bounds-checking of all array accesses. This is a severe performance bottleneck on certain applications such as scientific code. Bounds-checking elimination allows the compiler to safely remove bounds-checking in many situations where it can determine that the index must fall within valid bounds, for example if it is a simple loop variable.
    • Natively compiled languages
      • One technique for bounds-checking elimination is to use a typed static single assignment form representation and for each array create a new type representing a safe index for that particular array. The first use of a value as an array index results in a runtime type cast (and appropriate check), but subsequently the safe index value can be used without a type cast, without sacrificing correctness or safety.
    • JIT-compiled languages
      • Just-in-time compiled languages such as Java often check indexes at runtime before accessing Arrays. Some just-in-time compilers such as HotSpot are able to eliminate some of these checks if they discover that the index is always within the correct range, or if an earlier check would have already thrown an exception
  • 内链展开(Inline expansion or macro expansion)

    • When some code invokes a procedure, it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it. This saves the overhead related to procedure calls, as well as providing great opportunity for many different parameter-specific optimizations, but comes at the cost of space; the procedure body is duplicated each time the procedure is called inline. Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures. A "fewer jumps" optimization. The statements of imperative programming languages are also an example of such an optimization. Although statements could be implemented with function calls they are almost always implemented with code inlining.
  • 过程间优化(Interprocedural optimizations)

    • Interprocedural optimization works on the entire program, across procedure and file boundaries. It works tightly with intraprocedural counterparts, carried out with the cooperation of a local part and global part. Typical interprocedural optimizations are: procedure inlining, interprocedural dead code elimination, interprocedural constant propagation, and procedure reordering. As usual, the compiler needs to perform interprocedural analysis before its actual optimizations. Interprocedural analyses include alias analysis, array access analysis, and the construction of a call graph. Interprocedural optimization is common in modern commercial compilers from SGI, Intel, Microsoft, and Sun Microsystems. For a long time the open source GCC was criticized[citation needed] for a lack of powerful interprocedural analysis and optimizations, though this is now improving.[citation needed] Another good open source compiler with full analysis and optimization infrastructure is Open64, which is used by many organizations for research and for commercial purposes. Due to the extra time and space required by interprocedural analysis, most compilers do not perform it by default. Users must use compiler options explicitly to tell the compiler to enable interprocedural analysis and other expensive optimizations.
  • 逃逸分析(Escape analysis)

    • Escape analysis determines all the places where a pointer can be stored and whether the lifetime of the pointer can be proven to be restricted only to the current procedure and/or thread.
    • A compiler can use the results of escape analysis as a basis for optimizations
      • Converting heap allocations to stack allocations.
      • Synchronization elision. If an object is found to be accessible from one thread only, operations on the object can be performed without synchronization.
      • Breaking up objects or scalar replacement. An object may be found to be accessed in ways that do not require the object to exist as a sequential memory structure. This may allow parts (or all) of the object to be stored in CPU registers instead of in memory.

19. 杂项

  • Java的checked exception缺点

    • versioning问题
      • 一个函数调用栈中的大多数帧,都不关心异常如何处理,他们需要做的仅仅是finally/using;一种看法是catch和finally的比例是1:10;而checked exception要求异常被显示的沿调用链来声明,这已经部分具备错误码的特征了
      • 接口函数的checked exception声明也是接口的一部分,如果接口的实现变化,导致需要抛出新的异常,除非显示的在实现中将新异常转换成已有异常,否则,新异常的声明将意味着接口变化,要求所有使用接口的客户端代码修改,而实际上,这大部分代码都只是在finally,真正应该修改的只是数量更少的catch代码(位于调用链的更浅层次)
      • 实际使用中,不关心异常处理方式的finally帧往往直接声明throw exceptions,这实际上意味着checked exception不实用
    • scalability问题
      • 一个函数可以调用多个子系统的函数,caller需要应对的异常数是callees的异常类型数量之和;调用链的越低层次,需要应对的异常种类越多,这是扩展性问题
      • 实践上可能通过throw exceptions来避免传递所有异常类型,这其实是在宣告checked exception不实用
  • 函数调用的方式

    • 静态绑定(compile time binding)

      • 全局函数/非虚函数:

          mov ecx, eax;
          call 0x12345;
        
    • 动态绑定(runtime binding)

      • 虚函数:

          mov ecx, eax;
          mov ebx, (eax,0);
          mov eax, (ebx,12);
          call eax;
        
      • 委托(包括指向虚函数的委托):(可见,委托是比虚函数快的,因为在创建委托的时候有一次预先的vtbl lookup)

          mov ecx, (eax,4);
          call (eax,0)
        
  • .Net通过Strong names来应对DLL hells问题:引用的每个类都通过assemble名+版本+locale+密钥+类名来标识

  • JVM和CLR的设计初衷是不同的:

    • JVM强调的是平台独立性
    • CLR在平台独立性的基础上,还强调互操作性(interop)
      • 语言间的互操作性:CLR支持某些语言特有的feature如指针,它的能力是常见语言的超集
      • 与已有系统的互操作性:如对DLL、COM、OLE的支持;相反,Java相信世界是纯Java的,要和已有代码交互,需要JNI,后者的使用及其复杂,需要考虑和GC的协作等